Archive

Exadata

This is the fourth blogpost on a series of blogposts about how the Oracle database handles redo. The previous blogpost talked about the work cycle of the log writer: https://fritshoogland.wordpress.com/2018/02/12/a-look-into-oracle-redo-part-3-the-log-writer-work-cycle-overview/. This posts is looking into the execution of the kcrfw_redo_write_driver function executed in the ksbcti.

Now that we are familiar with how the logwriter works in general, we need to take a closer look to the kcrfw_redo_write_driver function. First let me be clear: the kcrfw_redo_write_driver function inside ksbcti is called every time the logwriter process times out on its semaphore, which is every 3 seconds, so this means it is called too when nothing is written in the public redo strands. In fact, when the log writer times out on semtimedop (which means it is not (yet) semctl’ed or semop’ed), it doesn’t know if there are entries in the public redo strands at that point in time.

There is a lot of things that go on in the kcrfw_redo_write_driver function. The following tracing are small snippets of execution from the function, not the every detail in this function.

The first thing that is of interest is the logwriter obtaining a LWN (log write number) number and SCN.

 | | > kcsnew3(0x600113b8, 0x7ffc3dd10d28, ...)
 | | | > kcsnew8(0x600113b8, 0x7ffc3dd10b70, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0

Please mind at this time it’s unknown if there is anything to write in the public redo strands.

So the kcsnew3 function is executed, with 0x600113b8 as first argument. That is the address of kcsgscn_, the instance global SCN. The kcsnew3 function calls the kcsnew8 function, which calls kslgetl for latch address 0x60049800. That address belongs to the ‘lgwr LWN SCN’ latch in my instance. In the function call overview there is nothing more to see. If see look at the pina memory trace, we see a lot more; this is what is executed in between kslgetl and kslfre:

kcsnew8+107:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0x1/1()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):R:4:0xc5/197()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):W:4:0xc6/198()
kcsnew8+125:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b7e/7744382()
kcsnew8+131:0x00007fffa1071ae0():W:4:0/0()
kcsnew8+140:0x00007fffa1071b12():W:1:0/0()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b7e/7744382()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):W:8:0x762b7f/7744383()
kcsnew8+165:0x00007fffa1071ae0():R:4:0/0()
kcsnew8+184:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsnew8+193:0x00007f520ba8ae90():R:8:0x7a823620/2055353888(shared pool|permanent memor,duration 1,cls perm+769 )
kcsnew8+212:0x000000007a823e40(shared pool|permanent memor,duration 1,cls perm+7695936 ):R:8:0x719beed8/190604(shared pool|ksu:stats_freel,duration 1,cls freeabl+24 )
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):R:8:0xc6/198()
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):W:8:0xc7/199()
kcsnew8+223:0x00007fffa1071bf0():W:8:0x762b7f/7744383()
kcsnew8+232:0x0000000060016260(fixed sga|var:kcrfsg_+56 shared pool|x_kcrfws.lwn_scn+0 ):W:8:0x762b7f/7744383()
kcsnew8+238:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0/0()
kcsnew8+246:0x0000000060016288(fixed sga|var:kcrfsg_+96 shared pool|pointer:lgwr lwn scn latch+0 ):R:8:0x60049800/1610913792(fixed sga|(parent)latch:lgwr LWN SCN+0 fixed sga|var:lwn_scn_lat_+0 )
kcsnew8+250:0x00007fffa1071ac8():W:8:0x10e3ed3f/283372863()

What is visible here, is that at kcrfsg_ offset 76 ‘1’ is written. This is quite probably a flag indicating the LWN related entries are being modified (this is a guess, of course the LWN SCN latch essentially performs the same function). Especially since this is set to ‘0’ at the end if the function (kcsnew8+238). At kcrfsg_ offset 72 a number is read and increased by 1. This is most likely the LWN number. If you take a look at the number of gets for the ‘lgwr LWN’ you will see the number of gets is very close (albeit not equal) to the LWN number.

Then the LWN SCN is fetched from kcsgscn_; it first is read, and increased by one to 7744383 and written back in kcsgscn_ (get and advance). The SCN from kcsgscn_ then is written in the kcrfsg_ struct at offset 56 a little further as well, after which the earlier set ‘1’ is reset to ‘0’ at offset 76. This SCN is visible in x$kcrfws as LWN SCN.

The next thing to look at is the function kcrfw_gather_lwn a little further:

 | | > kcrfw_gather_lwn(0x7ffc3dd10d68, 0x77eced90, ...)
 | | | > kslgetl(0x76fe0c10, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7ffc3dd10d68, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x76fe0c10, 0x118b2, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7ffc3dd10d68, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x76fe0cb0, 0x19, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff

In the kcrfw_gather_lwn function first latch address x76fe0c10 is taken in immediate mode, which is the redo allocation child #1, and then the function kcrfw_gather_strand is called, which returns 0, after which the latch is freed. This pattern repeats itself with a different latch address, 0x76fe0cb0, which is redo allocation child #2. This gives a hint the two public strands are active. By looking at the memory accesses trace again, a fixed sga variable is used to find out how many strands are active:

7038:kcrfw_gather_lwn+76:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()

A little further information is obtained from kcrfsg_ about the redo allocation latch and the KCRFA structure address:

7052:kcrfw_gather_lwn+223:0x0000000060016570(fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
7053:kcrfw_gather_lwn+230:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )

Then the redo allocation latch child #1 is obtained via kslgetl, after which kcrfw_gather_lwn calls kcrfw_gather_strand:

kcrfw_gather_strand+2:0x00007fffa1071b00():W:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1071af8():W:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+29:0x00007fffa1071af0():W:8:0x76fdf2b0/1996354224(shared pool|permanent memor,duration 1,cls perm+16114352 )
kcrfw_gather_strand+36:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_strand+53:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1071de8():R:8:0x762b7f/7744383()
kcrfw_gather_strand+79:0x0000000076fdf390(shared pool|KCRFA+8 shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x762b7c/7744380()
kcrfw_gather_strand+86:0x0000000076fdf3a8(shared pool|KCRFA+32 shared pool|NEXT_BUF_NUM_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x65e/1630()
kcrfw_gather_strand+121:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):R:4:0x65e/1630()
kcrfw_gather_strand+169:0x00007fffa1071bf8():W:4:0x65e/1630()
kcrfw_gather_strand+176:0x00007fffa1071bfc():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1071af0():R:8:0x76fdf2b0/1996354224(shared pool|permanent memor,duration 1,cls perm+16114352 )
kcrfw_gather_strand+186:0x00007fffa1071af8():R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+193:0x00007fffa1071b00():R:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+194:0x00007fffa1071b08():R:8:0x31f0f00/52367104()

In kcrfw_gather_strand, we see the sequence we have seen before: at offset 0 of kcrfsg_, the KCRFA struct address is looked up, after which the function investigates addresses in the KCRFA struct for the first public redo strand only. I’ve annotated these in the pinatrace snippet above. At offset 36 in the function, the last buffer written to the online redologfile is obtained from offset 208 in KCRFA, which is public strand buffer 1630. A little further, at offset 86 and 121 in the function kcrfw_gather_strand, the most recently used public strand buffer that is written to is obtained at KCRFA offset 32 (NEXT_BUF_NUM_KCRFA_CLN), and the highest buffer number that has been investigated (gathered) by the logwriter process at KCRFA offset 212 (LAST_BUF_GATHERED_KCRFA). In all these cases, the public redo strand buffer number is 1630. This means no redo has been written into the first public redo strand.
After reading the KCRFA information for the first public redo strand, the latch for this struct is freed.

A little further, the latch is obtained that protects the second public strand, and kcrfw_gather_strand is executed again to read the KCRFA information for this strand:

kcrfw_gather_strand+2:0x00007fffa1071b00():W:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1071af8():W:8:0x76fdf4b0/1996354736(shared pool|KCRFA+296 shared pool|permanent memor,duration 1,cls perm+16114864 )
kcrfw_gather_strand+29:0x00007fffa1071af0():W:8:0x76fdf1f0/1996354032(shared pool|permanent memor,duration 1,cls perm+16114160 )
kcrfw_gather_strand+36:0x0000000076fdf580(shared pool|KCRFA+504 shared pool|permanent memor,duration 1,cls perm+16115072 ):R:4:0x1d/29()
kcrfw_gather_strand+53:0x0000000076fdf5a8(shared pool|KCRFA+544 shared pool|permanent memor,duration 1,cls perm+16115112 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1071de8():R:8:0x762b7f/7744383()
kcrfw_gather_strand+79:0x0000000076fdf4b8(shared pool|KCRFA+304 shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x762ae7/7744231()
kcrfw_gather_strand+86:0x0000000076fdf4d0(shared pool|KCRFA+328 shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0x1d/29()
kcrfw_gather_strand+121:0x0000000076fdf584(shared pool|KCRFA+508 shared pool|permanent memor,duration 1,cls perm+16115076 ):R:4:0x1d/29()
kcrfw_gather_strand+169:0x00007fffa1071bf8():W:4:0x1d/29()
kcrfw_gather_strand+176:0x00007fffa1071bfc():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1071af0():R:8:0x76fdf1f0/1996354032(shared pool|permanent memor,duration 1,cls perm+16114160 )
kcrfw_gather_strand+186:0x00007fffa1071af8():R:8:0x76fdf4b0/1996354736(shared pool|KCRFA+296 shared pool|permanent memor,duration 1,cls perm+16114864 )
kcrfw_gather_strand+193:0x00007fffa1071b00():R:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+194:0x00007fffa1071b08():R:8:0x31f0f00/52367104()

This is exactly the same function, just reading the same information for a different strand, which means at a different offset in KCRFRA. If you subtract the size of the the information for a single strand which we calculated to be 296 from the KCRFA offsets (see blogpost number 2), you get the same numbers as above (504-296=208=LAST_BUF_WRITTEN). Once you understand what is read, it should be fairly easy to see nothing has been written into the second strand too. So at this point, the logwriter knows there is nothing to write.

What does the logwriter do when there is nothing to write? It executes a ‘null write’:

 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x40, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x40, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x769a5620
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x40, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x40
 | | | | > kcscu8(0x60016290, 0x7ffc3dd10a98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7ffc3dd10a38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x40, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7ffc3dd08730, 0x7ffc3dd10750, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x769a5628, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3

This means there is housekeeping to do, despite no public redo strand buffers needing writing. The most crucial things it does that I can detect, is updating the on disk SCN in the kcrfsg_ struct with the already set LWN SCN and posting any outstanding processes that are sleeping on a semaphore via post/wait. These are done in the kcrfw_post function, which probably means (kernel cache redo file write) ‘post write’, however there was nothing to write (null write), since there was nothing in the public redo strands. If you look to the above function trace, you see inside kcrfw_do_null_write, kcrfw_post is called.

Inside kcrfw_post, kcscu8 (kernel cache service get current SCN) is called with 0x60016290 as first argument, which is the memory address of offset 104 in the kcrfsg_ struct in the fixed SGA, which is externalised as ON_DISK_SCN in X$KCRFWS, which is done to read the current value of the on disk SCN:

kcscu8+2:0x00007fffa10718a0():W:8:0x7fffa1071bb0/140735894985648()
kcscu8+13:0x00007fffa1071898():W:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcscu8+20:0x00007fffa1071890():W:8:0x7f520ba98f40/139990359707456()
kcscu8+27:0x00000000600162a0(Fixed Size(pgsz:2048k)|+90784 fixed sga|var:kcrfsg_+120 ):R:4:0/0()
kcscu8+31:0x00000000600162a4(Fixed Size(pgsz:2048k)|+90788 fixed sga|var:kcrfsg_+124 ):R:4:0/0()
kcscu8+39:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):R:8:0x762b7e/7744382()
kcscu8+43:0x00000000600162a0(Fixed Size(pgsz:2048k)|+90784 fixed sga|var:kcrfsg_+120 ):R:4:0/0()
kcscu8+56:0x00007fffa1071b18():W:8:0x762b7e/7744382()
kcscu8+59:0x00007fffa1071898():R:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcscu8+63:0x00007fffa1071890():R:8:0x7f520ba98f40/139990359707456()
kcscu8+70:0x00007fffa10718a0():R:8:0x7fffa1071bb0/140735894985648()
kcscu8+71:0x00007fffa10718a8():R:8:0x3202150/52437328()

The next function that is called is kcsaj8 (kernel cache service adjust SCN), with 0x60016290 as first argument, indicating it is going to change the on disk SCN:

kcsaj8+105:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):R:8:0x762b7e/7744382()
kcsaj8+105:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):W:8:0x762b7f/7744383()
(these are only the relevant lines from the memory trace for the function kcsaj8)

What this means is that the LWN SCN that was obtained at te beginning of the write cycle using the kcsnew3 function, now is written into the on disk SCN too, despite anything actually being written to disk. The obvious question at this point is ‘if the on disk SCN is increased without anything actually written, would there be a SCN value that actually tracks the SCN of truly written (“on disk”) redo? To spoil the next blogpost about actual redo writing, this is administered in a SCN value called the ‘real redo SCN’.

After kcrfw_post returns, the execution returns from kcrfw_do_null_write and kcrfw_redo_write_driver, ending the ‘null write’.

Advertisements

This is the third part of a series of blogposts on how the Oracle database handles redo. The previous part talked about the memory area that stores redo strand information: https://fritshoogland.wordpress.com/2018/02/05/a-look-into-oracle-redo-part-2-the-discovery-of-the-kcrfa-structure/.

The single most important process in the Oracle database for handling redo is the log writer, which primary task is flushing the redo information other Oracle database processes put in the public redo strands to disk. Now that we have investigated the public redo strands and concurrency (first part) and kcrfsg_ and the KCRFA structure (second part), it seems logical to me to look at the log writer.

Simply because we have seen the foreground process take the redo allocation latch, ‘fiddles around’ in the kcrfsg_ and KCRFA structures and releases the latch, it is a pretty safe guess the logwriter is using these structures to determine if it needs to write. It looks like the kcrfsg_ structure contains global redo information like redo log file information, and the KCRFA structure strand specific information.

First we need to get an understanding of what the logwriter process is actually doing. The below investigation is done to see how the idle and write work cycles work, I explicitly excluded additional tasks like redo log file switches.

First of all, actually quite surprisingly, there is little information about what the log writer is actually doing. The only well known features of the logwriter is that it writes the log buffer instrumented by the ‘log file parallel write’ wait event, and sleeps on a semaphore in the wait event ‘rdbms ipc message’. In order to learn more, I’ve run a debugtrace on the logwriter process running idle, and stripped excess information and then captured only the calls one level from the stack depth the logwriter process is on in between the work cycles:

$ awk -n '$0 ~ /^\ \x3e\ /' lgwr_full_cycle_stripped.txt
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<
 > ksl_exit_main_loop_wait(0x9d64fb00c, 0x7, ...)
 > ksbcti(0x12b84c18, 0x7f2c1eb804b0, ...)          (nr 1)
 > dbktFlush(0, 0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)          (nr 2)
 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)          (nr 3)
 > sltrgatime64(0, 0, ...)
 > ksl_enter_main_loop_wait(0x1, 0x4ccef9975, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<
 > ksl_exit_main_loop_wait(0x9d67d8b01, 0x7, ...)
 > ksbcti(0x12b84c18, 0x7f2c1eb804b0, ...)          (nr 1)
 > dbktFlush(0, 0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)          (nr 2)
 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)          (nr 3)
 > sltrgatime64(0, 0, ...)
 > ksl_enter_main_loop_wait(0x1, 0x4cd05fb07, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<

ksarcv means kernel service asynchronous receive message, this is the function that sets up the semaphore and puts the logwriter process to sleep. This also explains the ksl_exit_main_loop_wait and ksl_enter_main_loop_wait functions, these are called exactly before (enter) and after (exit) the ksarcv function. The dbktFlush function is managing trace files, this function is used to write messages to the logwriter tracefile. The sltrgatime64 function is a function that calls clock_gettime. That leaves one function that can be seen in my example: ksbcti. ksbcti means kernel service background processes call timeout/interrupts. This is a function that performs several different functions based on the first argument.

The first ksbcti function following ksl_exit_main_loop_wait (with first argument 0x12b84c18) seems to be the function that performs administering the log writer process resource usage details in the SGA among other things:

 | | > kews_timeout(0x7f2c1eb93ec0, 0, ...)
 | | | > sltrgftime64(0x7f2c1eb93ec0, 0, ...)
 | | | | > clock_gettime@plt(0x1, 0x7ffc3dd10ad0, ...)
 | | | | | > clock_gettime(0x1, 0x7ffc3dd10ad0, ...)
 | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | < clock_gettime+0x00000000003a returns: 0
 | | | < sltrgftime64+0x00000000004c returns: 0x9cb6163fa
 | | | > slcpums(0x7f2c23ba3c58, 0x191ae1d5, ...)
 | | | | > getrusage(0x1, 0x7ffc3dd10a30, ...)
 | | | | | > fthread_self(0x1, 0x7ffc3dd10a30, ...)
 | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | <> getrusage_ext(0x1, 0x7ffc3dd10a30, ...)
 | | | | < getrusage+0x00000000000f returns: 0
 | | | < slcpums+0x00000000008f returns: 0x31d592
 | | | > kews_sqlst_flush(0xc, 0x72fd7d40, ...)
 | | | | > sltrgftime64(0xc, 0x72fd7d40, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7ffc3dd10a60, ...)
 | | | | | | > clock_gettime(0x1, 0x7ffc3dd10a60, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x9cb61723f
 | | | | > slcpums(0x7f2c23ba3c58, 0x7ae0ee70, ...)
 | | | | | > getrusage(0x1, 0x7ffc3dd109c0, ...)
 | | | | | | > fthread_self(0x1, 0x7ffc3dd109c0, ...)
 | | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | | <> getrusage_ext(0x1, 0x7ffc3dd109c0, ...)
 | | | | | < getrusage+0x00000000000f returns: 0
 | | | | < slcpums+0x00000000008f returns: 0x31e417
 | | | < kews_sqlst_flush+0x00000000016a returns: 0x7f2c23b9e900
 | | < kews_timeout+0x00000000052c returns: 0x7f2c23b9e900

kews_timeout means kernel event wait statistics, so statistics updated because of the process timing out. sltrgftime64 is an o/s specific call wrapper to read time from the operating system, it calls clock_gettime that reads the system clock (argument 1 is CLOCK_MONOTONIC, which is a clock that is not affected by ‘discontinuous’ changes to the clock time; think daylight saving time time changes for example). Similar, slcpums is an o/s specific call wrapper to read cpu usage, it calls getrusage to obtain CPU accounting data from the operating system. kews_sqlst_flush seems to be related to automatic tuning features, this *probably* (I haven’t investigated further) flushes status data like time and cpu to a buffer so features like sql monitor and adaptive features can use it.

The second ksbcti function call (with first argument 0x12b84be0) is the main function that performs the actual redo write, which is done by the kcrfw_redo_write_driver function:

 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)
 | > ksbckbast(0, 0, ...)
 | < ksbckbast+0x000000000061 returns: 0x1
 | > ksumcl(0, 0, ...)
 | < ksumcl+0x000000000045 returns: 0x1540
 | > kcrfw_redo_write_driver(0, 0, ...) 
...

The next blogpost will look into what the kcrfw_redo_write_driver function actually performs. To spoil it a bit: the kcrfw_redo_write_driver function is called every 3 seconds, even if there’s nothing to write!

The third ksbcti function call (with first argument 0x12b84bfc) performs a function or series of functions that are not clear to me, and it’s not doing a lot inside the functions that are called in ksbcti to give me a hint:

 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)
 | > ksbmsg(0, 0, ...)
 | < ksbmsg+0x000000000040 returns: 0x60065d90
 | > ksbxiaf(0, 0, ...)
 | < ksbxiaf+0x000000000058 returns: 0x795b5828
 | > kjci_action(0, 0, ...)
 | < kjci_action+0x000000000035 returns: 0x60065d90
 | > kcfsmpoll(0, 0, ...)
 | < kcfsmpoll+0x000000000030 returns: 0x60065d90
 | > ksuwaitsysevent(0, 0, ...)
 | < ksuwaitsysevent+0x0000000001e7 returns: 0x795b4aa8
 | > krdrsb_lgwr_chintr(0, 0, ...)
 | | > krsh_trace(0x2000, 0, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | < krdrsb_lgwr_chintr+0x00000000008b returns: 0
 | > ksbckbast(0, 0, ...)
 | < ksbckbast+0x000000000061 returns: 0x1
 < ksbcti+0x00000000018f returns: 0x1

Of course there are several things that point to features that are not used in my database; krdrsb points to dataguard and kjci points to RAC.

If you look again to the sequence you might wonder why the logwriter is not executing the redo write function immediately after waking up from the semaphore in kcarcv, in order to be able to write as soon as a process semctl’s/semop’s the semaphore asking the logwriter to write. Well, actually there is another sequence of function calls that can happen:

 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > ksl_exit_main_loop_wait(0x6003ef40, 0x7af972b8, ...)
 > kcrfw_redo_write_driver(0x7ffc3dd122a0, 0x18, ...)         <<<<<
 > ksbcti(0x12b84c74, 0x7f2c1eb804b0, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > dbktFlush(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksl_enter_main_loop_wait(0x1, 0x7a6032e8, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)

Here you see ksarcv, which calls kcrfw_redo_write_driver immediately after ksarcv and ksl_exit_main_loop_wait, not inside ksbcti. What happened in ksarcv, is that semtimedop call returned 0, indicating the semaphore was semctl’ed or semop’ed, instead of timing out after 3 seconds. If you payed close attention you saw that the first argument of the ksbcti function of kcrfw_redo_write_driver is different. I am not sure why, because the same functions as the call with 0x12b84bfc as first argument are called (the ones with functions hinting at dataguard and RAC). The ksarcv call that follows is not putting the process to sleep on a semaphore, it reads in the asynchronous messages structure in fixed sga (ksasga_) after which it returns immediately. What follows are dbktFlush and sltrgatime64 for administrative purposes (writing to the tracefile if necessary, and updating resource usage information), after which the process prepares the semaphore sleep using ksl_enter_main_loop_wait and then goes to sleep on a semaphore in ksarcv.

Please note this is the general cycle, there are other tasks that the log writer performs. One example of such a task is the log writer is looking up the process monitor (PMON) process to see if it is alive, which is done every minute.

This is the second post in a series of blogposts on Oracle database redo internals. If you landed on this blogpost without having read the first blogpost, here is a link to the first blogpost: https://fritshoogland.wordpress.com/2018/01/29/a-look-into-oracle-redo-part-1-redo-allocation-latches/ The first blogpost contains all the versions used and a synopsis on what the purpose of this series of blogposts is.

In the first part, I showed how the principal access to the public redo strands is controlled by redo allocation latches, and showed a snippet of trace information of memory accesses of a foreground session when using the first public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

And a snippet of trace information of memory accesses of a foreground session when using the second public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

If you pay really close attention you see the memory location moved from 0x76fdf3?? / 0x76fdf4?? to 0x76fdf4?? / 0x76fdf5??, all in permanent memory in the shared pool. Would this mean information for each strand is kept in a struct per strand? Let’s take a memory allocation on a specific offset in the kcrfw_redo_gen_ext function, and take the allocation on exactly the same offset in the kcrfw_redo_gen_ext function, which means it’s probably doing the same, and calculate the difference of the memory allocation:

strand 0
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8
strand 1
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8

Now take the memory addresses, and calculate the difference: 0x76fdf520 – 0x76fdf3f8 = 0x128 = 296

Okay, so would the size of the probable struct per strand be 296 bytes? Another question: where would this potential struct start?

Often, the location of structs is fetched from another location, just like we saw earlier for the redo allocation latch, especially if they are more dynamically created than fixed SGA structures, which are not dynamic by nature. Upon close examination of similarities between both pina traces, I found usage of a pointer in the kcrfsg_ struct at offset +0:

kcrfw_redo_gen_ext+14362:0x0000000060016228(fixed sga|var:kcrfsg_+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )

The difference between the memory address I took and the pointer address in kcrfsg_ at offset 0 for the first public redo strand is:
0x76fdf3f8 – 0x76fdf388 = 0x70 = 112
The difference between the memory address I took for the second public redo strand and the pointer in kcrfsg_ at offset 0 is:
0x76fdf520 – 0x76fdf388 = 0x198 = 408
Then subtract with the calculated difference between the memory locations of the first and second public redo strand:
408 – 296 = 112
Bingo!

It would be really helpful to have some kind of explanation about this structure… I searched through My Oracle Support (very seldomly the regular knowledge base provides some internals information, the best chance for internals is the bug database, the rest is…well, let me say: not that helpful. I also hate that other tabs are enabled if nothing is found), oraus.msg, oradebug dumplist, oradebug doc component for any tracing or dumps that could give more information, but nothing seemed to point to any explanation. I dumped the shared pool, but I found the same information (the memory location being permanent memory) as I obtained from x$ksmsp. Also, since the pinatrace annotation takes care of any x$ view that has it’s source address in the SGA, it seems there is no way other than puzzle all the individual locations in what I assume is a structure that holds per strand metadata.

I decided to go through the x$ views one more time, and found an x$ view that I already knew of, which points to both ‘strands’ and ‘kcrf’; x$kcrfstrand. However, the reason it is not annotated is because this x$ view has it’s addresses in the PGA:

SQL> select addr from x$kcrfstrand;

ADDR
----------------
00007F14CE31AE00
00007F14CE31AE00

This means that when querying this x$ view, the results are gathered from wherever it comes from (the source can be file headers like redo (x$kccle) and controlfile (x$kcccf) for example, in this case we will the actual source of x$kcrfstrand is the SGA) and then stored temporarily in PGA, potentially manipulated, and then shown from the PGA, as we can see from the addresses (addr field in x$kcrfstrand).

However, what I can do to investigate where the x$kcfstrand takes its information from and run the pina (memory access) trace on the execution of ‘select * from x$kcrfstrand’, and try to see if I can find where it takes its data from. This is how the relevant part looks like:

qerfxFetch+1563:0x00007ffee590a638():W:8:0x10ea8781/283805569()
kcrf_strand_ftcb+2:0x00007ffee590a630():W:8:0x7ffee590b130/140732749885744()
kcrf_strand_ftcb+10:0x00007ffee590a620():W:8:0x7f7780569f40/140151230996288()
kcrf_strand_ftcb+23:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()
kcrf_strand_ftcb+29:0x00000000600169c8(Fixed Size(pgsz:2048k)|+92616 fixed sga|var:kcrf_actv_pvt_strands_+0 ):R:4:0/0()
kcrf_strand_ftcb+35:0x00007f7780448558():R:4:0/0()
kcrf_strand_ftcb+52:0x0000000060016228(Fixed Size(pgsz:2048k)|+90664 fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrf_strand_ftcb+64:0x00007ffee590a618():W:8:0xcf7cb25/217565989()
_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
__intel_ssse3_rep_memcpy+160:0x00000000157b9488():R:4:0x8000/32768()
__intel_ssse3_rep_memcpy+184:0x0000000076fdf4a0(shared pool|permanent memor,duration 1,cls perm+16114848 ):R:16:0x00000000200040000/8590196736(??)
__intel_ssse3_rep_memcpy+3438:0x00007f7780448670():W:16:0x0000ec00000000000/4151755906482176()
...
__intel_ssse3_rep_memcpy+9254:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:16:0xfe56e00000fe56e00000/-2305841916834480128()
__intel_ssse3_rep_memcpy+9258:0x00007f7780448560():W:16:0xfe56e00000fe56e00000/-2305841916834480128()

Going from top to bottom through this snippet:
– qerfxFetch is the query execute rowsource function to query x$ tables, kcrf_strand_ftcb is an helper function (kernel cache recovery file strand fetch copy buffer is my guess) to fetch the kcrf strand data from the shared pool structure; the other functions are memory copy functions.
– Interestingly, the kcrf_strand_ftcb function reads offset 0 of kcrfsg_ just like we seen before. This is quite logical if you think about it, this struct is dynamically allocated in the shared pool, which means it doesn’t have a fixed address, so it needs to be looked up every time something must be read from it.
– The next functions are memcpy functions. I’ve shown the relevant ones. The first memcpy (_intel_fast_memcpy) probably is a helper function that switches to an intel optimised memcpy, __intel_ssse3_rep_memcpy, which takes advantage of intel’s SIMD optimisations (SSE3 means Supplemental Streaming SIMD Extensions 3).
– The memcpy functions at lines 11 and 12 show it’s reading 16 bytes from 0x76fdf4a0, and then writing 16 bytes to 0x7f7780448670.
– 0x76fdf4a0 is the struct for the second strand: kcrfsg_+0 = 0x76fdf388 + 296 = 0x76fdf4b0 + 16 (copy size) = 0x76fdf4a0.
– 0x7f7780448670 is quite probably and address in the process’ PGA, which is used for the intermediate representation of the struct’s data for displaying it for the x$ view.

Even more interesting, this confirms that x$kcrfstrand is the externalisation of the struct that is used together with kcrfsg_, and the calculations that were made previously are correct; kcrfsg_+0 = 0x76fdf388, and that address with 296 added is 0x76fdf4b0, which is the same as 0x76fdf4a0 + 16 (if you look into the above example, the memcpy function copies 16 bytes at a time).

I think it’s pretty safe to assume this kcrf related struct in the shared pool is commonly referred to as ‘KCRFA’ in x$kcrfstrand. Of course we can only guess that the ‘A’ means in kernel cache recovery file, I would say ‘auxiliary’, but is could also very well be the A in ‘strAnd’. It is a structure that clearly was needed when oracle changed the concept of a single log buffer into multiple public redo strands, because identical information of multiple strands need to be stored at the same time, and the information in kcrfsg_ is singular of nature.

Please mind that because of the intermediate representation of the x$ table in PGA, I could not use the trick I showed in my ‘drilldown into the logwriter inner working and communication’, which is query a single column in the x$ view, and see which memory addresses in this struct are touched; the entire struct is copied into PGA and processed before it is used to show the contents, even if only a single column of a single row is queried.

Also, also consider that since the struct data is copied into the process’ PGA in order to make this information human readable, not all offsets (in x$kqfco) of the PGA representation might be applicable to the SGA struct. In fact, I think I have proven that the KCRFA per strand metadata width is 296. Now look at the column offsets from x$kqfco of x$kcrfstrand:

SQL> select c.kqfconam, c.kqfcooff, c.kqfcosiz
from x$kqfco c, x$kqfta t
where t.indx = c.kqfcotab
and t.kqftanam = 'X$KCRFSTRAND'
order by c.kqfcooff;

KQFCONAM                         KQFCOOFF   KQFCOSIZ
------------------------------ ---------- ----------
ADDR                                    0          8
INDX                                    0          4
INST_ID                                 0          4
CON_ID                                  0          2
STRAND_HDR_SCN_KCRFA_CLN                0          8
HIGHEST_SCN_KCRFA_CLN                   0          8
PNEXT_BUF_KCRFA_CLN                    32          8
NEXT_BUF_NUM_KCRFA_CLN                 40          4
BYTES_IN_BUF_KCRFA_CLN                 48          8
AVAILABLE_BUFS_KCRFA_CLN               56          4
MEM_BUFS_AVAILABLE_KCRFA_CLN           60          4
STRAND_HEADER_BNO_KCRFA_CLN            72          4
FLAG_KCRFA_CLN                         76          1
PVT_STRAND_STATE_KCRFA_CLN             77          1
HIGHEST_SUBSCN_KCRFA_CLN               78          2
LWN_BUF_NUM_KCRFA_CLN                  80          4
STRAND_NUM_ORDINAL_KCRFA_CLN           84          4
STRAND_INFO_PTR_KCRFA_CLN              88          8
ACTION_KCRFA                          184          4
LOG_RESIDUE_KCRFA                     188          1
LAST_BUF_WRITTEN_KCRFA                216          4
LAST_BUF_GATHERED_KCRFA               220          4
CONS_LOW_BUFNO_KCRFA                  224          4
FIRST_BUF_KCRFA                       240          8
LAST_BUF_KCRFA                        248          8
TOTAL_BUFS_KCRFA                      256          4
STRAND_SIZE_KCRFA                     260          4
LOWBUF_KCRFA                          272          4
HIBUF_KCRFA                           276          4
WRAPHIBUF_KCRFA                       280          4
LOG_SEQ_KCRFA                         284          4
PTR_KCRF_PVT_STRAND                   304          8
INDEX_KCRF_PVT_STRAND                 312          4
SPACE_KCRF_PVT_STRAND                 316          4
TXN_KCRF_PVT_STRAND                   320          2

This shows that the representation of the KCRFA struct in PGA apparently has a width of 322 bytes, while we know the KCRFA struct itself is 296. I am not sure why this is, the most obvious reason would be some data is processed already in PGA to make easier to show. This might also very well be the reason there is a need for copying it to PGA first, and not display it directly from its shared pool location.

This will be a series of posts about Oracle database redo handling. The database in use is Oracle version 12.2.0.1, with PSU 170814 applied. The operating system version is Oracle Linux Server release 7.4. In order to look into the internals of the Oracle database, I use multiple tools; very simple ones like the X$ views and oradebug, but also advanced ones, quite specifically the intel PIN tools (https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool). One of these tools is ‘debugtrace’, which contains pretty usable output on itself (a indented list of function calls and returns), for which I essentially filter out some data, another one is ‘pinatrace’, which does not produce directly usable output, because it provides instruction pointer and memory addresses. However, the information it provides is extremely useful (and detailed!) once you translate the addresses to Oracle function names and memory region names. I have written a tool to do that: https://gitlab.com/FritsHoogland/pinatrace_annotate.git

The blog posts are meant for investigating Oracle database internals. This is a microscopic view into how the database works. I do believe by understanding the mechanics, you can very easy and quickly assess and resolve issues because of a fundamental understanding of how the database works.

First of all to immediately show the value of looking into the internals, a clear documentation bug: this entry from the online documentation of 12.2: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cncpt/memory-architecture.html#GUID-C2AD1BF6-A5AE-42E9-9677-0AA08126864B (not sure how long this link will be valid, Oracle is notorious for changing it) states that the log buffer is singular buffer called the log buffer. In fact, the log buffer is not a singular buffer, but a buffer that contains at least to public redo strands, which changed since at least Oracle 10g. These strands are externalised in x$kcrfstrand:

SQL> select indx,strand_size_kcrfa, first_buf_kcrfa, last_buf_kcrfa from x$kcrfstrand where last_buf_kcrfa != '00';

      INDX STRAND_SIZE_KCRFA FIRST_BUF_KCRFA  LAST_BUF_KCRFA
---------- ----------------- ---------------- ----------------
         0          67108864 00000000D8000000 00000000DBFFFE00
         1          67108864 00000000DC000000 00000000DFFFFE00

The number of public redo buffers is dependent on the undocumented parameter “_log_parallelism_max”. A quick question out on twitter revealed that systems with a high count of processors could have a higher public redo strand count.

The public strands are allocated in the ‘Redo Buffers’ memory allocation:

SQL> select "AREA NAME", "START ADDR", to_char((to_number(rawtohex("START ADDR"),'xxxxxxxxxxxxxxxx')+"SEGMENT SIZE"),'XXXXXXXXXXXXXXXX'),pagesize from x$ksmssinfo; 

AREA NAME                        START ADDR       TO_CHAR((TO_NUMBE   PAGESIZE
-------------------------------- ---------------- ----------------- ----------
Variable Size                    0000000061000000          D8000000    2097152
Redo Buffers                     00000000D8000000          E07C6000    2097152
Fixed Size                       0000000060000000          60A00000    2097152
skgm overhead                    00000000E1000000          E1003000       4096

Each public strand is protected by a redo allocation latch:

SQL> select addr, child#, name, gets, misses, immediate_gets, immediate_misses from v$latch_children where name = 'redo allocation';

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE17F0         20 redo allocation               5          0              0                0
0000000076FE1750         19 redo allocation               5          0              0                0
0000000076FE16B0         18 redo allocation               5          0              0                0
0000000076FE1610         17 redo allocation               5          0              0                0
0000000076FE1570         16 redo allocation               5          0              0                0
0000000076FE14D0         15 redo allocation               5          0              0                0
0000000076FE1430         14 redo allocation               5          0              0                0
0000000076FE1390         13 redo allocation               5          0              0                0
0000000076FE12F0         12 redo allocation               5          0              0                0
0000000076FE1250         11 redo allocation               5          0              0                0
0000000076FE11B0         10 redo allocation               5          0              0                0

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE1110          9 redo allocation               5          0              0                0
0000000076FE1070          8 redo allocation               5          0              0                0
0000000076FE0FD0          7 redo allocation               5          0              0                0
0000000076FE0F30          6 redo allocation               5          0              0                0
0000000076FE0E90          5 redo allocation               5          0              0                0
0000000076FE0DF0          4 redo allocation               5          0              0                0
0000000076FE0D50          3 redo allocation               5          0              0                0
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

Redo allocation latch child# 1 is the latch protecting allocations in public redo buffer 0, child# 2 is the latch protecting allocations in public redo buffer 1. Two questions that you would probably ask yourself:
1: why are there 20 redo allocation latches whilst there are only two public redo buffers?
2: how can you tell child# 1 is bound to public redo buffer 0, and child# 2 is bound to public redo buffer 1?

Answers to these questions:
1: the 18 other redo allocation latches are because of a mechanism called ‘private redo strands’, see https://blog.tanelpoder.com/files/Oracle_Performance_Scalability_10g_11g.pdf, https://fritshoogland.wordpress.com/2016/11/15/redo-a-blogpost/
2: in order to be sure about what that redo allocation latch protects, you need to look at the actual function calls.

For this, I am using a tool from intel, called ‘pin tools’, which is a suite of tools, of which one is called ‘debugtrace’: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

First I am going to look into how a latch is chosen.

Attach a debugtrace to a normal foregrond session that executes an insert and a commit, and filter the output to make it more readable using a custom filter: https://gitlab.com/snippets/1686145

Now search for the latch address (in lowercase) 0x76fe0c10 (the address of the redo allocation latch, see above query from v$latch_children):

 | | | | | | | | | | | | | | | > kcrfw_redo_gen_ext(0x3, 0x7ffd52f513e0, ...)
 | | | | | | | | | | | | | | | | > kcobrh(0x7ffd52f513e0, 0x3, ...)
 | | | | | | | | | | | | | | | | < kcobrh+0x00000000014b returns: 0x1b4
 | | | | | | | | | | | | | | | | > kcb_get_redo_version(0x3, 0x7ffd52f513a0, ...)
 | | | | | | | | | | | | | | | | < kcb_get_redo_version+0x0000000000c4 returns: 0x1
 | | | | | | | | | | | | | | | | > kcmglscn8(0x7ffd52f508d0, 0, ...)
 | | | | | | | | | | | | | | | | | > kcscu8(0x600113b8, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | < kcmglscn8+0x000000000054 returns: 0x756239
 | | | | | | | | | | | | | | | | > kscn_to_ub8_impl(0x7ffd52f50838, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | < kscn_to_ub8_impl+0x00000000003e returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)                             <<<<<<<<
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > ksvgcls(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < ksvgcls+0x000000000021 returns: 0
 | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7ffd52f508d8, ...)
 | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab838, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0x7ffd52f50848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kscnbur3_int_impl(0x756239, 0xd8eab824, ...)
 | | | | | | | | | | | | | | | | < kscnbur3_int_impl+0x000000000030 returns: 0x756239
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756231, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kslfre(0x76fe0c10, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0

If we look at the work of andrey nikolaev: https://andreynikolaev.files.wordpress.com/2016/03/latches_and_mutexes_in_oracle_12c.pdf
We know that the first argument of kslgetl is the latch address, and the second argument is 0 for no wait or 1 for willing to wait mode. What becomes clear, is that a foreground process gets the redo allocation latch in nowait mode. Of course this is widely known and documented. the return code of 0x1 means the latch is successfully taken. The latch that is fetched before the redo allocation latch of course is the redo copy latch:

SQL> @flba 76fdebe0
ADDR             NAME
---------------- ------------------------
0000000076FDEBE0 redo copy

If we look again at the figures of the two redo allocation latches protecting the public redo strands:

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

It becomes apparent that child# 1 is used way more than child# 2, which is visible in the column ‘immediate_gets’. How does that work? Let’s take the first redo allocation child latch, and issue an insert again with debugtrace on it.

session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1

session 2; debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1

First child# 1 is tried, but when the latch is already taken, indicated by returncode 0, it turns to the second latch.

What will happen when both latches are taken?

session 1:

SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1
SQL> oradebug call kslgetl 0x76FE0CB0 0 0 2442
Function returned 1

This is what it looks like on in debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | | | | | | | | | | | | | | | > kslges(0x76fe0cb0, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | > ksuSdiInProgress(0x2, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | | | | | | | | | | | | | > kskthbwt(0, 0, ...)
 | | | | | | | | | | | | | | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | | | | | | | | | | | | | | > kslwlmod(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | > kstmgetsectick(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | < kstmgetsectick+0x00000000003a returns: 0x5a4bd4b9
 | | | | | | | | | | | | | | | | | | < kslwlmod+0x0000000000cb returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | > kslwtbctx(0x7ffc45d6a988, 0x7a63c588, ...)

What this shows, is the foreground session first trying to get the latch of the first redo allocation child in immediate mode, which failed, then the latch of the second redo allocation child in immediate mode, which failed too, after which it had no other option than to get a redo allocation latch in willing to wait mode. For that it performed a redo allocation latch get for the first child latch again with the second argument set to 1, indicating the willing to wait mode. Because the latch was still taken, kslgetl called kslges in order to spin (which means a series of retries of getting the latch) for it; on Linux the spinning happens inside kslges, so it is not not visible in a debugtrace). After it spun for 20000 times, the process registers a wait (kslwtbctx), and then goes to sleep on a semaphore in the wait:

 | | | | | | | | | | | | | | | | | | < kslwtbctx+0x00000000050b returns: 0x77c000a4
 | | | | | | | | | | | | | | | | | | > skgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | > kslwait_conv_wait_time(0xffffffff, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | | > sskgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | | > semop@plt(0x98000, 0x7ffc45d6a888, ...)

This is the modern latch implementation where a session goes to sleep on a semaphore after spinning for a latch in willing to wait mode. In order to learn more about that, look in the work of Andrey Nikolaev. The semaphore implementation means that as soon as the latch becomes available, the session is posted and can do it’s allocation in the public redo strand.

However, this does not tie the child latch to a public strand, it only shows how a session bunny-hops public redo allocation latches in nowait mode until it hits a blind wall (if all have been taken) after which it will start waiting in willing to wait mode, waiting to be posted.

In order to see the connection between the redo allocation latches and the actual public strand memory areas, we need to look one level deeper, which means looking at (main) memory access of a process. This can be done using the pin tools pinatrace tool, and adding annotations to the pinatrace in order to understand what is going on. for this, I created a tool: https://gitlab.com/FritsHoogland/pinatrace_annotate

First, create a pinatrace of a normal insert on a test database. A word of warning: if you think a function call trace generates a lot of detailed data, the pinatrace alias memory access trace generates *WAY* more. Even modest things like ‘select count(*) from dual’ generates heaps of data.

Back to the pinatrace of the normal insert; generate the trace, and then annotate it using pinatrace_annotate tool. This should show the insert takes redo allocation latch child #1. To validate, search for the latch address again:

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007fff5d88efe8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007fff5d88efc8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007fff5d88efe8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

What this shows is the kcrfw_redo_gen_ext function reading address 0x60016570, which is in the fixed SGA, which belongs to a ‘fixed sga variable’ called kcrfsg_ at offset 840, reading the value 0x76fe0c10, which we already know is the address of redo allocation latch child #1. The latch address is written into address 0x7fff5d88efe8, which is in the process’ PGA, but we got no annotation of it, because there is nothing between the round brackets. Later, the latch address in PGA is read again.

We know the kcrfsg_ variable/struct from my earlier investigation into how the logwriter works. This is a struct in fixed sga that holds data important to kcrf (kernel cache redo file). My current guess is sg means ‘SGA’.

Only a little further in the trace, we see that the kslgetl function is invoked, and indeed it is trying to get this latch:

kslgetl+69:0x0000000076fe0c1f(Variable Size(pgsz:2048k)|+368970783 shared pool|(child)latch#1:redo allocation+15 shared pool|permanent memor,duration 1,cls perm+16120863 ):R:1:0x37/55()

The way a successful latch get looks like, which is visible in the trace, is this:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0/0()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x1a/26()

The work of Andrey nikolaev describes that when a latch is taken, the ora pid is written at offset 0. Here we see at offset 0, the kslgetl function reads a 0, indicating the latch is free, after which the value 26, the ora pid of the process, is written into it, which means this process now has taken the latch.

The next step is to see if there is any indication of it pointing to the public redo strand. It should be close to the latch get, because a latch should be hold for as little time as possible. Essentially, what I did then is search for ‘Redo Buffers’. I found this a little further still in the kcrfw_redo_gen_ext function:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

There’s a lot to say about this. The first thing to point out, is that these are all memory locations of a memory segment in the shared pool in permanent memory somewhere around 0x76fdf300 / 0x76fdf400. The first public strand is explicitly mentioned (by my annotation of the memory structures):

kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )

As well as the second public strand:

kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )

I am confident that the second public strand just happens to collide with a sort of ‘not used’ code, if you look at what the hexadecimal value of this is: 0xdeadbeef

The second thing to notice is there is a lot of information written really close to each other (0x76fdf300-0x76fdf400). The only location I know that holds this kind of redo related information is the previously mentioned kcrfsg_ struct, but this seems to be another location that holds critical information about redo, of which all we know is it sits in the shared pool (not in fixed sga), and it is allocated as permanent memory.

I tried various things to get more information about this memory location, like dumping the sga heap, inspecting this memory location in a heap dump to see if there is a sub heap that I can dump for more information (there isn’t), asking on twitter (which sometimes helps), but I gotten no information. I spend a considerable time searching in the bug database, and still gotten no answers….

I decided to prove usage of the second public strand by manually obtain the latch for the first strand, so an insert will take the second latch, and if that indeed means the second public redo strand is used. This means fetching the first latch using oradebug in a session, and then starting another database session, attach a pinatrace to it, and execute an insert.

Once the pinatrace is annotated, search for 0x76fe0c10 (the first redo allocation latch) again. The start is exactly the same, the latch address is fetched from kcrfsg_ at offset 840.

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007ffcca22e6e8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007ffcca22e6c8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007ffcca22e6e8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

Now look at the same offsets in the function kslgetl (+138) again:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0x27/39()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x27/39()

Now we see the latch already has a value at offset zero, which is 39, indicating the latch is taken. Now that the session executing the insert found the first redo allocation latch busy, first this happens:

kcrfw_redo_gen_ext+14267:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):R:4:0x1/1()
kcrfw_redo_gen_ext+14273:0x00007ffcca22e6d8():R:4:0x2/2()
kcrfw_redo_gen_ext+14283:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):W:4:0x2/2()

Because the latch for the first public redo strand was taken, an additional strand is enabled by setting the fixed sga variable kcrf_actv_strands_ to 2, indicating the second strand is now active too.

Then kcrfsg_ at offset 840 is read again, but a higher value is saved in PGA memory. In fact, the latch size is known (160), so in order to get then next latch, the session picks up the latch address of child #1, and adds 160 in order to get to child #2.

kcrfw_redo_gen_ext+14335:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+14342:0x00007ffcca22e6e8():W:8:0x76fe0cb0/1996360880(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 )

This latch is free, and is taken by this session:

kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):R:8:0/0()
kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):W:8:0x1a/26()

Now let’s look at the same function locations with the first redo allocation latch taken, and the second redo allocation latch acquired:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

Here we see reads and writes in the same fashion, including the 0xdeadbeef write, but now we see PUB_REDO_1, alias the second public redo strand. This proves the second redo allocation child latch is really protecting the second public redo strand!

This post is an introduction to pinatrace annotate version 2, which is a tool to annotate the output of the Intel Pin tools ‘pinatrace’ tool.

The pinatrace tool generates a file with every single memory access of a process. Please realise what this means: this is every single read from main memory or write to main memory from the CPU. This allows you to get an understanding what happens within a C function. This means you can determine what information or data is accessed in what function. Needless to say this is a tool for internals investigations and research, not something for normal daily database maintenance and support. Also, the performance of the process that you attached to is severely impacted, and it can only be turned off by stopping the process. Do not use this on a production database, use this at your own risk for research and investigational purposes only.

The general way to use the pin tool is to run something together with pin, for example ‘ls’:

$ ./pin -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -- ls

This generates an output file that looks like this:

#
# Memory Access Trace Generated By Pin
#
0x00007f49de60f173: W 0x00007ffd160c46f8  8     0x7f49de60f178
0x00007f49de60f840: W 0x00007ffd160c46f0  8                  0
0x00007f49de60f844: W 0x00007ffd160c46e8  8                  0
0x00007f49de60f846: W 0x00007ffd160c46e0  8                  0
0x00007f49de60f848: W 0x00007ffd160c46d8  8                  0
0x00007f49de60f84a: W 0x00007ffd160c46d0  8                  0
0x00007f49de60f84f: W 0x00007ffd160c46c8  8                  0
0x00007f49de60f85f: R 0x00007f49de82fe00  8                0xe
0x00007f49de60f866: W 0x00007f49de82fbd8  8      0x5f25f53b3dc
0x00007f49de60f877: R 0x00007f49de82ff90  8           0x221e00
0x00007f49de60f881: W 0x00007f49de8309a8  8     0x7f49de82fe00
0x00007f49de60f888: W 0x00007f49de830998  8     0x7f49de60e000

This on itself is not very helpful. Let me explain what you are seeing: the first column is the instruction pointer address, in other words: the place in the executable in which the memory transfer was initiated, ‘W’ and ‘R’ are self explanatory, the third column is the memory address to which the read or write is addressed, the fourth column is the size of the memory IO, and the fifth is the value.

Luckily, there is another way to run pin, which is attaching to an already running process. This is what is helpful in using the memory access trace with an Oracle process. The way this is done, is quite simple; first get the linux process number, then run pin specifying the process number:

$ ps -ef | grep [L]OC
oracle    7250  7249  0 09:42 ?        00:00:00 oracletest (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ./pin -pid 7250 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

There is an important caveat here: because pin inserts itself into the running process, the pin library (source/tools/SimpleExamples/obj-intel64/pinatrace.so in this case) must be available to this process. So if you downloaded, extracted and linked pin tools as root, and then as root try to attach to an oracle process, it will not work because the oracle process can not reach the library. So: extract and link the pin tools as oracle.

The output of the pinatrace (a file called pinatrace.out by default) is in $ORACLE_HOME/dbs when you attach to an Oracle database process.

Back to the pinatrace output: what if we could see the C function that was at the instruction pointer address, the name of the Oracle memory area for the memory address and for the fifth column the name of the Oracle memory area again in case the value is actually a pointer? That is exactly what my pinatrace annotate oracle tool does!

How to use it: first clone the pinatrace_annotate repository:

git clone https://gitlab.com/FritsHoogland/pinatrace_annotate.git

Then set the O_BINARY to your oracle binary at line 6 of pinatrace_annotate_oracle.sh.

The next thing is to generate the oracle memory area CSV files. I added a script to generate the memory ranges for the PGA too, however you should be aware this is very fluent, and it requires an execution of the oracle PID you point it to before the v$process_memory_detail gets populated. Sometimes I skip the PGA detail step because it doesn’t add too much value.
To run the scripts for generating the memory area CSV files:

$ sqlplus / as sysdba
...
SQL> @get_memory_ranges
...
	76 14,20,@1	   8645 		    SYS 			   sqlplus@memory-presentation.local (TNS V1-V3)
...
enter pid number to get pga details: 76
old   1: alter session set events 'immediate trace name pga_detail_get level &pid'
new   1: alter session set events 'immediate trace name pga_detail_get level 76'

Session altered.

execute something (eg. select * from dual) in the target process, then press enter

Now head over to the session you want the PGA memory details from, and execute something like ‘select * from dual’, and press enter in the sqlplus session in which you ran get_memory_ranges.sql. This will then run for a while, it generates pipe-delimited files that sqlite can parse.

After the CSV files have been generated, create the sqlite database for use by the tool:

$ ./pinatrace_annotate_oracle.sh -c
generate symbols from /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/oracle.
create and populate table symbol_ranges with symbols.
create index.
create table memory_ranges.
import memory_ranges.csv
import memory_ranges_xtables.csv
import memory_ranges_pga.csv
create index.

Now the pinatrace_annotate_oracle.sh tool is all set for usage!

Example usage
Let’s use an actual example to see where the memory trace provides benefit: latches. Nowadays, latches are not the issue these were in the oracle 8 days. In these days, the latches protecting the database buffers double linked lists, the ‘cache buffers chains’ latches, were not gotten in shared mode. That meant especially index root blocks could get latch contention if there was much concurrency.

Some great work has been done on the internals of latches by Andrey Nikolaev; see his latch internals presentation. His work is done on Solaris with Dtrace, I did some of that on linux (part 1, part 2 and part 3).

Let’s replay a willing to wait latch get and see how that looks like in the annotated pinatrace output! Setup the host with the pin tools (link it to get the pinatrace.so library), and clone pinatrace_annotate and configure it to point to the oracle binary you are using.

1. session 1:

$ sqlplus / as sysdba
SQL> select distinct sid from v$mystat;

2. session 2:

$ sqlplus / as sysdba
SQL> @get_memory_ranges

identify session 1, and fill out the pid.
take note of the O/S pid.
it will ask you to execute something in the session for which the pid was entered.

3. session 1:

SQL> select * from dual;

4. session 2:
press enter, and wait for @get_memory_ranges to finish.

SQL> exit
$ ./pinatrace_annotate_oracle.sh -c
$ sqlplus / as sysdba
SQL> select addr from v$latch where name = 'cache table scan latch';

note the latch address, it’s 0x6004A400 in my case.

SQL> oradebug setmypid
SQL> oradebug call kslgetl 0x6004A400 0 0 2442

result: ‘Function returned 1’ indicating a successful latch get.

5. session 1:

SQL> oradebug setmypid

6. session 3:
go to the directory where pin is extracted and linked, and go to the directory where the pin executable is located. Execute the following and change the pid for the O/S pid obtained in step 2.

$ ./pin -pid 8645 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -o pina_latch.txt

now the memory trace is enabled for pid 8645

7. session 1:

SQL> oradebug call kslgetl 0x6004A400 1 0 2442

this will hang, it is waiting for the latch which is taken by session 2.

8. session 2:

SQL> oradebug call kslfre 0x6004A400

this will return ‘Function returned 0’ indicating it freed the latch.

9. session 1:
the oradebug call will now return ‘Function returned 1’ indicating it has gotten the latch.

10. session 3:

$ cp /u01/app/oracle/product/12.2.0.1/dbhome_1/dbs/pina_latch.txt ~/pinatrace_annotate/pina_latch.txt

as indicated, the pinatrace output file will be in $ORACLE_HOME/dbs, and with this I copy the current state of the trace to the pinatrace_annotate directory where my pinatrace annotate tool is cloned.

$ cd ~/pinatrace_annotate
$ ./pinatrace_annotate_oracle.sh pina_latch.txt > pina_latch_annotated.txt

This will run for a while…

Once pinatrace_annotate_oracle.sh is done, open the pina_latch_annotated.txt file. Short guide on how this looks like; the first two lines in my file are:

function+offset:memory address(annotation+offset):Read/Write:size:value hex/dec(pointer annotation+offset)
??:0x00007fff224e6498():R:8:0x11060785/285607813()

The first line is the header, which is displayed once, and essentially explains it all. The second line starts with ‘??’. As indicated by the header, the first field is the function. Any call to a function (‘symbol’) that is not in the oracle executable, like functions in shared libraries, do not have an address in the executable. In fact, the place at which the shared libraries are loaded, and thus their functions are located is randomised by default, which explains why the executable does not have the address.

The second column is the address memory is read from or written into. In the second line, there is nothing in between the parentheses, meaning this address is not in a known memory range.

The ‘R’ or ‘W’ is obvious, and the number is the amount of memory read or written too.

The fifth field shows the content of the read or write in hexadecimal and in decimal. Here again there are parenthesis which are empty. I look up the value in the memory area’s, and put the memory area, if the value is in a known memory area, as annotation including offset. What you should realise is that I don’t know if a value is in fact a memory area and thus a pointer, or that the value coincidentally the same as a known memory area. I do think understanding if the value is a known memory area is handy and useful, in the C language pointers are heavily used.

Another example:

nsbasic_brc+548:0x00007f33d3026690(pga|Other, Fixed Uga, Fixed UGA heap+88712 ):R:8:0x7f33d3019600/139860560156160(pga|Other, Fixed Uga, Fixed UGA heap+3532
0 )

Here we see an example where the function is a function in the oracle executable, nsbasic_brc, which at offset 548 of the function read 8 bytes from memory address 0x00007f33d3026690, which was a known memory area at the time of running 0_get_pga_detail.sql when ‘select * from dual’ was run, which was in the PGA, in the Fixed UGA heap at offset 88712, and the value it read was 0x7f33d3019600, which is a memory address in the same chunk of memory at offset 3532, and therefore likely to be a pointer.

Now let’s look further into the annotated pinatracefile. If you search for ‘kslgetl’, you will find the actual latch get executed with oradebug. This is how that looks like on my system:

kslgetl+2:0x00007fff224e4ec0():W:8:0x7fff224e4ff0/140733768945648()
kslgetl+13:0x00007fff224e4ea0():W:8:0x7fff224e5870/140733768947824()
kslgetl+17:0x00007f33d8095f40(pga|Other, free memory, top call heap+243776 ):R:8:0x7f33d8095f40/139860644552512(pga|Other, free memory, top call heap+243776 )
kslgetl+26:0x00007fff224e4e98():W:8:0x1/1()
kslgetl+33:0x00007fff224e4e90():W:8:0/0()
kslgetl+37:0x00007fff224e4ea8():W:8:0x4/4()
kslgetl+44:0x00007f33d8087e78(pga|Other, free memory, top call heap+186232 ):R:8:0x7a63d268/2053362280(shared pool|permanent memor,duration 1,cls perm+5702248 )
kslgetl+58:0x00007fff224e4eb0():W:8:0x7fff224e57d0/140733768947664()
kslgetl+65:0x00007fff224e4e60():W:8:0/0()
kslgetl+69:0x000000006004a40f(fixed sga|(parent)latch:cache table scan latch+15 fixed sga|var:kcb_table_scan_latch_+15 ):R:1:0x37/55()
kslgetl+80:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):R:8:0/0()
kslgetl+96:0x000000006004a410(fixed sga|(parent)latch:cache table scan latch+16 fixed sga|var:kcb_table_scan_latch_+16 ):R:4:0x1/1()
kslgetl+103:0x000000007a63d5f0(shared pool|X$KSUPR.KSLLALOW+0 shared pool|permanent memor,duration 1,cls perm+5703152 ):R:4:0/0()
kslgetl+109:0x0000000011141524():R:4:0x1cffe/118782()
kslgetl+122:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0x6004a400/1610916864(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 )
kslgetl+131:0x00007f33d8087ea8(pga|Other, free memory, top call heap+186280 ):R:8:0x4b/75()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):W:8:0x22/34()
kslgetl+1898:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0/0()
kslgetl+1909:0x00000000600119b4(fixed sga|var:kcbccc_+0 ):R:1:0/0()
kslgetl+1938:0x00007fff224e4e60():R:8:0/0()
kslgetl+1942:0x00007fff224e4e38():W:8:0x10a6084b/279316555()

The two highlighted lines show that the function read the beginning of the latch memory and found a number, indicating the oracle process id that has the latch taken already. In other words, this means the process found a latch that has already been taken by another process, oracle pid 34.

If you scroll down further you see kslgetl called kslges, which is how a willing to wait latch works if the latch it wants to take is already taken, and then further down, you will see this pattern:

kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()

This is what is the spinning for the latch looks like from a memory access perspective. So that means that spinning on a latch does not mean the entire latching function (kslgetl or kslges) is called to try to take the latch, but rather when kslgetl finds the latch to be taken, it calls kslges, which does some housekeeping, and then in a really tight loop reads the latch’ memory at offset 0 to detect when it is freed. We can see it is a loop by looking at the offset numbers of the function, these are 930, 937 and 944, which are the offsets we see over and over.

In fact, if I grep for that line and count the number of lines, we can see this version of oracle (12.2.0.1) still does spin 20’000 times for a latch in willing to wait mode:

grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | wc -l
20001

Yes, this shows 20001 executions, remember we freed the latch in the other session?

$ grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | tail -3
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0/0()

The last line shows kslges finding the latch in a free state, which is after it has been woken from sleeping on a semaphore.

A modern version of Oracle doesn’t sleep for an arbitrary time and then spin again, it goes to sleep on a semaphore after it spun 20’000 times and then waits until it gets posted. That can be seen in the pinatrace output too, but if you want to look at the function flow, it might be better to generate a debugtrace trace, the pinatrace trace shows what information is used within a function, and might be too verbose. In fact, the function flow towards the semaphore call is kslges > skgpwwait > sskgpwwait.

Conclusion
pinatrace_annotate_oracle.sh is a tool to annotate a tracefile generated by the pin tool ‘pinatrace’ when running the oracle binary. It shows information about any memory access done by a process, which allows you to understand what information is obtained in what function of Oracle. This is a tool only for research and debug purposes.

Recently I was looking in the trace directory in the diag dest of my (12.2.0.1) instance, and found some trace files which were different from the others:

$ ls | grep dia0
test_dia0_9711_base_1.trc
test_dia0_9711_base_1.trm
test_dia0_9711_lws_1.trc
test_dia0_9711_lws_1.trm
test_dia0_9711.trc
test_dia0_9711.trm

The dia0 ‘test_dia0_9711.trc’ file is fully expected. But what are these ‘..lws_1.trc’ and ‘..base_1.trc’ files? And ‘base’ is something that I understand, but what would ‘lws’ mean? Lunatics Without Skateboards?

First, let’s look into the normally named trace file of the dia0 process:

$ cat test_dia0_9711.trc
HM: Early Warning and/or Long Waiting Session (LWS) is enabled.
    The output can be found in trace files of the form:
    test_dia0_9711_lws_n.trc where n is a number between 1 and 5.

*** 2017-11-26T12:18:25.849759+00:00
HM: Moving the default trace file to: test_dia0_9711_base_1.trc


*** 2017-11-26T16:45:33.193729+00:00
HM: Tracing moved from base trace file: test_dia0_9711_base_1.trc

HM: This is a single non-RAC instance.  Hang detection is enabled.
HM: Hang resolution scope is OFF.
HM: Cross Boundary hang detection is enabled.
HM: Cross Cluster hang detection is enabled.
HM: CPU load is not being monitored.
HM: IO load is NOT abnormally high - 0% outliers.

HM: Tracing is now being performed to this file.
2017-11-26 16:45:33.193 :kjzdshn(): Shutting down DIAG

Ah! So LWS means ‘Long Waiting Sessions’. This seems to be functionality that produces trace files for execution which do not perform optimally.

Let’s look into the ‘base’ trace file:

*** 2017-11-26T12:18:25.849773+00:00
HM: Tracing moved from default trace file: test_dia0_9711.trc

HM: This is a single non-RAC instance.  Hang detection is enabled.
HM: Hang resolution scope is OFF.
HM: Cross Boundary hang detection is enabled.
HM: Cross Cluster hang detection is enabled.
HM: CPU load is not being monitored.
HM: IO load is NOT abnormally high - 0% outliers.

*** 2017-11-26T12:18:25.849903+00:00
HM: Early Warning and/or Long Waiting Session (LWS) is enabled.
    The output can be found in trace files of the form:
    test_dia0_9711_lws_n.trc where n is a number between 1 and 5.

*** 2017-11-26T12:18:30.870494+00:00
HM: Detected incompatible setting 'INSTANCE' for Hang Resolution Scope
    - disabling hang resolution.

I get the feeling this is originally made for RAC, but now also working for single instance. Some RAC related features seem to be turned off. It seems that the actual session information is stored in the ‘lws’ trace files.

This is a sample of what is in a lws trace file:

*** 2017-11-26T13:14:25.053171+00:00
HM: Early Warning - Session ID 58 serial# 53774 OS PID 30760 (FG) is
     'not in a wait' for 33 seconds
     last waited on 'resmgr:cpu quantum'
     p1: 'location'=0x3, p2: 'consumer group id'=0x4a2f, p3: ' '=0x0

                                                     IO
 Total  Self-         Total  Total  Outlr  Outlr  Outlr
  Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
  Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     1      0      0      0      0      0      0      0 not in wait


HM: Dumping Short Stack of pid[71.30760] (sid:58, ser#:53774)
Short stack dump:
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-sslnxadd()+38<-pfrinstr_ADDN()+103<-pfrrun_no_tool()+60<-pfrrun()+919<-plsql_run()+756<-peicnt()+288<-kkxexe()+717<-opiexe()+22531<-kpoal8()+2679<-opiodr()+1229<-ttcpip()+1257<-opitsk()+1940<-opiino()+941<-opiodr()+1229<-opidrv()+1021<-sou2o()+145<-opimai_real()+455<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

HM: Current SQL: declare
        x number := 0;
begin
        for c in 1..1000000000 loop
                x := x + c;
        end loop;
end;

I got a sample PLSQL loop running for testing load (which is really silly), and this running for 33 seconds apparently generates an ‘Early Warning’. It shows what it was doing from a wait based analysis perspective (‘not in a wait’, ‘resmgr: cpu quantum’), it shows some statistics for the hang analyser, it dumps a short stack, which is fully expected; kdsests() to __sighandler() are the routines for generating the stack dump, the actual function that was executing was sslnxadd(), which is a system specific linux add function, which is called by pfrinstr_ADDN(), which is the PLSQL add function, which is called by pfrrun_no_tool(), which is the PLSQL fast interpretation loop function, and so on. The last thing it dumps is the current SQL, which is the silly PLSQL code.

Is there anything inside the Oracle database that shows anything related to this? Yes, it seems there are:

V$HANG_INFO
V$HANG_SESSION_INFO
V$HANG_STATISTICS
DBA_HANG_MANAGER_PARAMETERS

Doing some simple tests, I could not get V$HANG_INFO and V$HANG_SESSION_INFO populated. V$HANG_STATISTICS contains 87 statistics, and displays current data, of which most in my single instance test database are 0.

One thing that confused me at first was the abbreviation ‘HM’ for hang monitor, because the very same abbreviation is used in the diagnostic repository for ‘health monitor’.

It seems the hang manager is a quite extensive framework for detecting hangs and hang related issues; I peeked at the documented and undocumented parameters containing “hang” in them, of which there are quite an impressive amount quite likely belonging to the hang manager framework:

SQL> @parms
Enter value for parameter: hang
Enter value for isset:
Enter value for show_hidden: Y
_global_hang_analysis_interval_secs                10                                                                     TRUE     FALSE      FALSE
_hang_allow_resolution_on_single_nonrac            FALSE                                                                  TRUE     FALSE      FALSE
_hang_analysis_num_call_stacks                     3                                                                      TRUE     FALSE      FALSE
_hang_base_file_count                              5                                                                      TRUE     FALSE      FALSE
_hang_base_file_space_limit                        100000000                                                              TRUE     FALSE      FALSE
_hang_blocked_session_percent_threshold            10                                                                     TRUE     FALSE      FALSE
_hang_bool_spare1                                  TRUE                                                                   TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_bool_spare2                                  TRUE                                                                   TRUE     FALSE      FALSE
_hang_cross_boundary_hang_detection_enabled        TRUE                                                                   TRUE     FALSE      FALSE
_hang_cross_cluster_hang_detection_enabled         TRUE                                                                   TRUE     FALSE      FALSE
_hang_delay_resolution_for_libcache                TRUE                                                                   TRUE     FALSE      FALSE
_hang_detection_enabled                            TRUE                                                                   TRUE     FALSE      FALSE
_hang_enable_processstate                          TRUE                                                                   TRUE     FALSE      FALSE
_hang_hang_analyze_output_hang_chains              TRUE                                                                   TRUE     FALSE      FALSE
_hang_hiload_promoted_ignored_hang_count           2                                                                      TRUE     FALSE      FALSE
_hang_hiprior_session_attribute_list                                                                                      TRUE     FALSE      FALSE
_hang_hung_session_ewarn_percent                   34                                                                     TRUE     FALSE      FALSE
_hang_ignored_hang_count                           1                                                                      TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_ignored_hangs_interval                       300                                                                    TRUE     FALSE      FALSE
_hang_ignore_hngmtrc_count                         1                                                                      TRUE     FALSE      FALSE
_hang_int_spare1                                   0                                                                      TRUE     FALSE      FALSE
_hang_int_spare2                                   0                                                                      TRUE     FALSE      FALSE
_hang_log_io_hung_sessions_to_alert                FALSE                                                                  TRUE     FALSE      FALSE
_hang_log_verified_hangs_to_alert                  FALSE                                                                  TRUE     FALSE      FALSE
_hang_long_wait_time_threshold                     0                                                                      TRUE     FALSE      FALSE
_hang_lws_file_count                               5                                                                      TRUE     FALSE      FALSE
_hang_lws_file_space_limit                         100000000                                                              TRUE     FALSE      FALSE
_hang_max_session_hang_time                        96                                                                     TRUE     FALSE      FALSE
_hang_monitor_archiving_related_hang_interval      300                                                                    TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_msg_checksum_enabled                         TRUE                                                                   TRUE     FALSE      FALSE
_hang_resolution_allow_archiving_issue_termination TRUE                                                                   TRUE     FALSE      FALSE
_hang_resolution_confidence_promotion              FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_global_hang_confidence_promotion  FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_percent_hung_sessions_threshold   300                                                                    TRUE     FALSE      FALSE
_hang_resolution_policy                            HIGH                                                                   TRUE     FALSE      FALSE
_hang_resolution_promote_process_termination       FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_scope                             OFF                                                                    TRUE     FALSE      FALSE
_hang_running_in_lrg                               FALSE                                                                  TRUE     FALSE      FALSE
_hang_short_stacks_output_enabled                  TRUE                                                                   TRUE     FALSE      FALSE
_hang_signature_list_match_output_frequency        10                                                                     TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_statistics_collection_interval               15                                                                     TRUE     FALSE      FALSE
_hang_statistics_collection_ma_alpha               30                                                                     TRUE     FALSE      FALSE
_hang_statistics_high_io_percentage_threshold      25                                                                     TRUE     FALSE      FALSE
_hang_terminate_session_replay_enabled             TRUE                                                                   TRUE     FALSE      FALSE
_hang_trace_interval                               32                                                                     TRUE     FALSE      FALSE
_hang_verification_interval                        46                                                                     TRUE     FALSE      FALSE

For the sake of experimenting with the hang manager I locked a table using ‘lock table t2 in exclusive mode’ and another session doing an insert. The ‘base’ trace file showed the following information:

*** 2017-12-11T06:25:10.364037+00:00
HM: Hung Sessions (local detect) - output local chains
===============================================================================
Non-intersecting chains:

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (test.test)
                   os id: 7526
              process id: 38, oracle@memory-presentation.local (TNS V1-V3)
              session id: 27
        session serial #: 32599
    }
    is waiting for 'enq: TM - contention' with wait info:
    {
                      p1: 'name|mode'=0x544d0003
                      p2: 'object #'=0x57e6
                      p3: 'table/partition'=0x0
            time in wait: 1 min 35 sec
           timeout after: never
                 wait id: 74
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000123 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 26.493425 sec
                     wait id: 73               p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000025 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000002 sec
                     wait id: 72               p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000005 sec
              3.       event: 'SQL*Net break/reset to client'
                 time waited: 0.000055 sec
                     wait id: 71               p1: 'driver id'=0x62657100
                                               p2: 'break?'=0x0
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (test.test)
                   os id: 7481
              process id: 24, oracle@memory-presentation.local (TNS V1-V3)
              session id: 55
        session serial #: 9654
    }
    which is waiting for 'SQL*Net message from client' with wait info:
    {
                      p1: 'driver id'=0x62657100
                      p2: '#bytes'=0x1
            time in wait: 1 min 39 sec
           timeout after: never
                 wait id: 6714
                blocking: 1 session
            wait history:
              * time between current wait and wait #1: 0.000006 sec
              1.       event: 'SQL*Net message to client'
                 time waited: 0.000004 sec
                     wait id: 6713             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000203 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 17.115734 sec
                     wait id: 6712             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000007 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000004 sec
                     wait id: 6711             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
    }

Chain 1 Signature: 'SQL*Net message from client'<='enq: TM - contention'
Chain 1 Signature Hash: 0x163c4cba
-------------------------------------------------------------------------------

===============================================================================

If you wait a little longer, there is another type of dia0 trace file that appears: ‘..vfy_1.trc’, which is a file containing ‘verified hangs’. They appear one time for a given hang situation, and looks like this:

One or more possible hangs have been detected on your system.
These could be genuine hangs in which no further progress will
be made without intervention, or it may be very slow progress
in the system due to high load.

Previously detected and output hangs are not displayed again.
Instead, the 'Victim Information' section will indicate that
the victim is from an 'Existing Hang' under the 'Existing or'
column.

'Verified Hangs' below indicate one or more hangs that were found
and identify the final blocking session and instance on which
they occurred. Since the current hang resolution state is 'OFF',
there will be no attempt to resolve any of these hangs unless
hang resolution is enabled.  Terminating this session may well
free up the hang. Be aware that if the column named 'Fatal BG'
is marked with 'Y', terminating the process will cause the
instance to crash.

Any hangs with a 'Hang Resolution Action' of 'Unresolvable'
will be ignored. These types of hangs will either be resolved
by another layer in the RDBMS or cannot be resolved as they may
require external intervention.

Deadlocks (also named cycles) are currently NOT resolved even if
hang resolution is enabled.  The 'Hang Type' of DLCK in the
'Verified Hangs' output identifies these hangs.

Below that are the complete hang chains from the time the hang
was detected.

The following information will assist Oracle Support Services
in further analysis of the root cause of the hang.


*** 2017-12-11T07:05:07.989085+00:00
HM: Hang Statistics - only statistics with non-zero values are listed

                     number of hangs detected 1
                        number of local hangs 1
  hangs ignored due to application contention 1
      hangs:total number of impacted sessions 2
    hangs:minimum number of impacted sessions 2
    hangs:maximum number of impacted sessions 2
                      current number of hangs 1
            current number of active sessions 1
              current number of hung sessions 2

*** 2017-12-11T07:05:07.989122+00:00
Verified Hangs in the System
                     Root       Chain Total               Hang
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
     1 HANG VICSELTD    1    58     2     2    LOW  LOCAL IGNRD:HngRslnScpOFF
 Hang Ignored Reason: Hang resolution is disabled.

  Inst  Sess   Ser             Proc  Wait    Wait
   Num    ID    Num      OSPID  Name Time(s) Event
  ----- ------ ----- --------- ----- ------- -----
      1     45  5781     27709    FG     137 enq: TM - contention
      1     58 15676     27700    FG     146 SQL*Net message from client

Wait event statistics for event 'SQL*Net message from client'

Self-Resolved Hang Count                        : 0
Self-Resolved Hang Total Hang Time in seconds   : 0
Self-Resolved Hang Minimum Hang Time in seconds : 0
Self-Resolved Hang Maximum Hang Time in seconds : 0
Resolved Hang Count                             : 0
Total Hung Session Count                        : 1
Total Wait Time in micro-seconds                : 59799468
Total Wait Count                                : 67
Moving Average of Wait Time in micro-seconds    : 5220981

Victim Information
                                                         Existing or  Ignored
  HangID  Inst#  Sessid  Ser Num      OSPID  Fatal BG       New Hang    Count
  ------  -----  ------  -------  ---------  --------  -------------  -------
       1      1      58    15676      27700     N           New Hang        1

It appears Oracle adds more and more useful information in trace files, which only can be fetched from the trace files. Yet it doesn’t seem that any monitoring is actually picking this up, which, to my knowledge includes Oracle’s own Enterprise Manager.

The hang manager tracing talks about outliers, upon grepping through the dia0 trace files I found an explanation what dia0 is considering an outlier:

*** 2017-11-26T16:45:33.193479+00:00
HM: Hang Statistics - only statistics with non-zero values are listed

            current number of active sessions 1
  instance health (in terms of hung sessions) 100.00%

Dumping Wait Events with non-zero hung session counts
                       or non-zero outliers wait counts
(NOTE: IO outliers have wait times greater than 32 ms.
       non-IO outliers have wait times greater than 64 s.
       IDLE wait events are not output.)

                                                     IO
 Total  Self-         Total  Total  Outlr  Outlr  Outlr
  Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
  Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     0      0      0   1366      2      0      0      1 log file parallel write
     1      0      0   2206 4294966594      1 4294966272      0 resmgr:cpu quantum
     0      0      0      1      1      0      0      1 external table read
     1      0      0      0      0      0      0      0 not in wait


*** 2017-11-26T16:45:33.193686+00:00
TotalCPUTm:4230 ms TotalRunTm:16027340 ms %CPU: 0.03%

Recently I was triggered about the ‘automatic big table caching’ feature introduced in Oracle version 12.1.0.2 with Roger Macnicol’s blogpost about Oracle database IO and caching or not caching (https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-1 https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-2). If you want to read something about the feature in general, search for the feature name, you’ll find several blogposts about it.

If you are not familiar with automatic big table caching, it’s a feature which allows you to dedicate a percentage of the buffer cache for scanning big tables. Big tables are tables that are not small :-), which technically means tables bigger (a higher number of blocks) than the “_small_table_threshold” parameter, which defaults to 2% of the buffer cache or 20 blocks, whichever is bigger. The choice to store blocks in the big table cache area is dependent on the segment “temperature”, which essentially is a kind of touch count. There is one other caveat of using this feature, which I currently find weird: once a database is operating in clustered mode (RAC), serial scans can not use the automatic big table caching anymore, the use of it then exclusively is for caching parallel query workers blocks scanned.

The purpose of this blogpost is to investigate the IO implementation of using the big table caching feature. The available general read code paths that I am aware of in Oracle are:
– Single block reads, visible via the wait event ‘db file sequential read’, which always go to the buffer cache.
– Multiple single block reads, visible via the wait ‘db file parallel read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘db file scattered read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the process’ PGA.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the in-memory area.

For full scans of a segment, essentially the choices are to go buffered, or go direct. The question that I asked myself is: what code path is used when the automatic big table caching feature is used?

The reason for asking myself this is because the difference between a cached multi block read and a direct path read is significant. The significance of this decision is whether multiple IO requests can be submitted for multiple extents leveraging asynchronous IO, which is possible in the direct path read code path, or only the IO requests can be submitted for a single extent, which is what is true for the db file scattered read code path. To summarise the essence: can I only execute a single IO request (scattered read) or multiple IO requests (direct path read).

The answer is very simple to find if you enable the big table caching feature, and then read a table that is considered big with an empty big table cache. A table that is read into the big table area, shows the ‘db file scattered read’ wait event. What is interesting when a table is scanned that does not fit in the big table area: when a big table is not read for reading into the big table area, the wait event switches to ‘direct path read’. Or actually I should say: the general rules are applied for choosing between buffered or direct path reads, potentially being a direct path read.

I want to emphasise again on the importance of the IO code path decision. If you have an Oracle database server that uses modern high performance hardware for its disk subsystem, not choosing the correct multi block read codepath can severely influence throughput.

For example, if your IO subsystem has an average IO time of 1ms, the maximal theoretical throughput for buffered IO is 1/0.001*1MB=1GBPS (this will be way lower in reality). Direct path reads start off with 2 IO requests, which obviously means that that’s double the amount of buffered IO, but this can, depending on an Oracle internal resource calculation, grow up to 32 IO requests at the same time.

%d bloggers like this: