A look into oracle redo, part 7: adaptive log file sync
This is the seventh part of a blog series about oracle redo.
Adaptive log file sync is a feature that probably came with Oracle version 11.2. Probably means I looked at the undocumented parameters of Oracle version 11.1 and do not see any of the ‘_adaptive_log_file_sync*’ parameters. It was actually turned off by default with versions 11.2.0.1 and 11.2.0.2, and was turned on by default since version 11.2.0.3.
The adaptive log file sync feature means the logwriter can make the committing sessions switch from the traditional post/wait mechanism for redo write notification to a mechanism where the committing session picks a time to sleep after which it checks for the redo writing progress itself. The reasons for doing so that I can see are reducing the amount of work the logwriter needs to do, because semop’ing a semaphore can only be done serially, and prevent the logwriter process from the situation of getting scheduled off CPU because the processes it’s activating via semop are getting higher priorities than the logwriter.
An important thing to realise is that the adaptive log file sync mechanism only potentially can change a single communication event in the database, which is when a session needs to wait for its redo to be written to disk by the logwriter after a commit. Nothing else is changed and all other post/wait inter-process communication is done exactly the same.
Is this feature a good idea? In part 6 of my Oracle redo series I have shown that a process is checking the on-disk SCN even if the log file sync method is post/wait, but it needs to be posted before it will continue. I have seen cases where huge numbers of foreground processes were doing lots of commits per second, especially in these cases I can see how the logwriter can greatly benefit from not having to do a lot of semop calls after writing, and thus not spend its time on what the logwriter actually should be doing, which is writing. I have not seen any great benefits in databases where there’s a low redo writing and commit rate.
But how does this feature work? In order to investigate that I first executed a pinatrace on a sysdba session and executed:
SQL> alter system set "_use_adaptive_log_file_sync" = false scope = memory;
Then used my pinatrace annotate script to annotate the pinatrace output, and simply grepped for ‘kcrf’; I highly suspect the adaptive log file sync feature to be a ‘kernel cache redo file’ (kcrf) based feature:
$ grep kcrf pinatrace_17029_ann.txt 11452:kcrf_sync_adaptive_set+2:0x00007ffef43aecd0():W:8:0x7ffef43af1b0/140732995924400() 11453:kcrf_sync_adaptive_set+10:0x00007ffef43aecb0():W:8:0x7f703fa2ef40/140120080707392() 11454:kcrf_sync_adaptive_set+17:0x00007ffef43aecb8():W:8:0x115b3da0/291192224() 11455:kcrf_sync_adaptive_set+37:0x00007ffef43aecc0():W:8:0x7ffef43aefe8/140732995923944() 11456:kcrf_sync_adaptive_set+44:0x00007ffef43aecc8():W:8:0x7f703f9ec9a0/140120080435616() 11457:kcrf_sync_adaptive_set+48:0x00007ffef43aeca8():W:8:0xcf7c7f5/217565173() 11506:kcrf_sync_adaptive_set+81:0x00007ffef43aeca8():W:8:0xcf7c816/217565206() 11555:kcrf_sync_adaptive_set+100:0x0000000060016a34(Fixed Size(pgsz:2048k)|+92724 fixed sga|var:kcrf_sync_sleep_usecs_+0 shared pool|(indx:0)X$KSXRSG+49636 ):R:4:0/0() 11556:kcrf_sync_adaptive_set+161:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):W:4:0/0() 11557:kcrf_sync_adaptive_set+202:0x0000000060016a40(Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 ):W:4:0/0() 11558:kcrf_sync_adaptive_set+352:0x00007ffef43aecc8():R:8:0x7f703f9ec9a0/140120080435616() 11559:kcrf_sync_adaptive_set+356:0x00007ffef43aecc0():R:8:0x7ffef43aefe8/140732995923944() 11560:kcrf_sync_adaptive_set+360:0x00007ffef43aecb8():R:8:0x115b3da0/291192224() 11561:kcrf_sync_adaptive_set+364:0x00007ffef43aecb0():R:8:0x7f703fa2ef40/140120080707392() 11562:kcrf_sync_adaptive_set+371:0x00007ffef43aecd0():R:8:0x7ffef43af1b0/140732995924400() 11563:kcrf_sync_adaptive_set+372:0x00007ffef43aecd8():R:8:0x1034e7e/16993918() 77244:kzam_check_limit+342:0x0000000060016398(Fixed Size(pgsz:2048k)|+91032 fixed sga|var:kcrfsg_+368 shared pool|(indx:0)X$KSXRSG+47944 ):R:4:0xd3/211()
If the feature is set/changed (the database was in adaptive log file sync mode, because “_use_adaptive_log_file_sync” was not set), it must write in a shared memory location for the other processes to be able to see the setting. That means two things: 1) any non-shared memory location can not be relevant to the setting, so we can exclude these 2) in order to change something, there needs to be a WRITE to set it.
To make the investigation easier to do, I added an option to the annotate script to write the pinatrace_annotate output to a sqlite3 database. In order to do this, use the ‘-d’ (database) switch to the pinatrace_annotate.sh script.
First we need to know the max SGA address of the instance, this can be found in the memory_ranges.db sqlite database, which is created when the memory details are prepared using ‘./pinatrace_annotate.sh -c’:
$ sqlite3 memory_ranges.db SQLite version 3.21.0 2017-10-24 18:55:49 Enter ".help" for usage hints. sqlite> select printf('%x',max(end_address)) from memory_ranges; e0000000
(please mind you need an sqlite3 version higher than 3.7 (which comes with Oracle Linux) to use printf. I downloaded the latest version from https://www.sqlite.org/download.html)
Okay so anything higher than 0xe0000000 is not in the SGA. Now use the database that was created using the ‘-d’ switch when I annotated the pinatrace output of ‘alter system set “_use_adaptive_log_file_sync” = false scope = memory’:
$ sqlite3 pinatrace_17029.txt.db SQLite version 3.21.0 2017-10-24 18:55:49 Enter ".help" for usage hints. sqlite> select line, function_offset, printf('%x',memory_address), memory_annotation, read_write, size, value, value_annotation from annotation where memory_annotation like '%kcrf%' and memory_address <= printf('%d',0xe0000000) and read_write = 'W'; 11556|kcrf_sync_adaptive_set+161|60016a38|Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 |W|4|0|?? 11557|kcrf_sync_adaptive_set+202|60016a40|Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 |W|4|0|??
This means that when I execute the alter system command, something is written into the fixed SGA, into a variable called ‘kcrf_alfs_info_’. I am pretty sure ‘alfs’ means adaptive log file sync here (and not many occasions of Gordon Shumway). There is a zero written in a 4 byte memory location at kcrf_alfs_info_+0, and there’s a zero written at kcrf_alfs_info_+8.
Now let’s repeat this, but with the pinatrace output when executing ‘alter system set “_use_adaptive_log_file_sync” = true scope = memory’:
$ sqlite3 pinatrace_8204.txt.db SQLite version 3.21.0 2017-10-24 18:55:49 Enter ".help" for usage hints. sqlite> select line, function_offset, printf('%x',memory_address), memory_annotation, read_write, size, value, value_annotation from annotation where memory_annotation like '%kcrf%' and memory_address <= printf('%d',0xe0000000) and read_write = 'W'; 82476|kcrf_sync_adaptive_set+161|60016a38|Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 |W|4|1|?? 82477|kcrf_sync_adaptive_set+202|60016a40|Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 |W|4|0|??
Here the same thing is happening as previously, when it was set to false, with the difference that at kcrf_alfs_info_+0, there is a one written.
Now let’s repeat this again, but with the pinatrace output when executing ‘”_use_adaptive_log_file_sync” = polling_only scope = memory’. This the third value that is allowed for “_use_adaptive_log_file_sync”. In case you are wondering how I knew about this being an allowed parameter value, there’s a script in Tanel Poder’s script bundle that can list them, called pvalid.sql:
SQL> @pvalid adaptive_log_file_sync Display valid values for multioption parameters matching "adaptive_log_file_sync"... old 11: LOWER(NAME_KSPVLD_VALUES) LIKE LOWER('%&1%') new 11: LOWER(NAME_KSPVLD_VALUES) LIKE LOWER('%adaptive_log_file_sync%') PAR# PARAMETER ORD VALUE DEFAULT ------ -------------------------------------------------- ---------- ------------------------------ ------- 1840 _use_adaptive_log_file_sync 1 TRUE DEFAULT _use_adaptive_log_file_sync 2 FALSE _use_adaptive_log_file_sync 3 POLLING_ONLY
Here is the memory trace of setting “_use_adaptive_log_file_sync” to polling only:
$ sqlite3 pinatrace_25469.txt.db SQLite version 3.21.0 2017-10-24 18:55:49 Enter ".help" for usage hints. sqlite> select line, function_offset, printf('%x',memory_address), memory_annotation, read_write, size, value, value_annotation from annotation where memory_annotation like '%kcrf%' and memory_address <= printf('%d',0xe0000000) and read_write = 'W'; 94687|kcrf_sync_adaptive_set+161|60016a38|Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 |W|4|1|?? 94688|kcrf_sync_adaptive_set+175|60016a40|Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 |W|4|1|?? 94689|kcrf_sync_adaptive_set+185|60016a78|Fixed Size(pgsz:2048k)|+92792 fixed sga|var:kcrf_alfs_info_+64 shared pool|(indx:0)X$KSXRSG+49704 |W|4|0|?? 94690|kcrf_sync_adaptive_set+191|60016a7c|Fixed Size(pgsz:2048k)|+92796 fixed sga|var:kcrf_alfs_info_+68 shared pool|(indx:0)X$KSXRSG+49708 |W|4|0|??
That’s interesting! Now the values set at offsets 0 and 8 are both one, and there are two additional offsets written to, 64 and 68, to which a zero is written.
It is my interpretation that zero at kcrf_alfs_info_+0 means the adaptive log file sync feature is turned off. In a database starting from version 11.2.0.3 or higher, this can only be the case if the parameter “_use_adaptive_log_file_sync” has been explicitly set to false.
This is a foreground session in a database that where I have set “_use_adaptive_log_file_sync” to false, and performed an insert and commit:
$ grep kcrf_alfs_info_\+[08] insert_ann.txt 72891:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0/0() 81847:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0/0() 81929:kcrf_commit_force_int+602:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0/0() 81998:kcrf_commit_force_int+1503:0x0000000060016a90(Fixed Size(pgsz:2048k)|+92816 fixed sga|var:kcrf_alfs_info_+88 shared pool|(indx:0)X$KSXRSG+49728 ):R:4:0x3e8/1000()
And this is a foreground session in a database where I removed the parameter “_use_adaptive_log_file_sync”, which means the parameter defaults to true, and executed an insert and commit:
$ grep kcrf_alfs_info_\+[08] pinatrace_2444_ann.txt 71189:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0x1/1() 80148:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0x1/1() 80230:kcrf_commit_force_int+602:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0x1/1() 80231:kcrf_commit_force_int+615:0x0000000060016a40(Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 ):R:4:0/0() 80254:kcrf_commit_force_int+844:0x0000000060016a90(Fixed Size(pgsz:2048k)|+92816 fixed sga|var:kcrf_alfs_info_+88 shared pool|(indx:0)X$KSXRSG+49728 ):R:4:0x550/1360()
As you can see at row 80230 of the memory trace, the kcrf_commit_force_int function reads kcrf_alfs_info_+0 and finds a one, indicating the adaptive log file sync feature is turned on, and as a result needs to find out if it’s set to post/wait or polling, which is done by looking at kcrf_alfs_info_+8 (row 80231), and finds a zero. The kcrf_commit_force_int execution here is the second stage of the invocation of kcrf_commit_force_int, which is with 1 as second argument, which is when the function performs a wait for the logwriter to write its redo. One way of detecting this usage of kcrf_commit_force_int is by looking at the functions lead to this invocation of kcrf_commit_force_int (ksupop, kcbdsy), because in the pinatrace the function arguments are not visible.
The zero for kcrf_alfs_info_+8 probably means that the adaptive feature still uses post/wait to wait for the logwriter, which is reasonably easy to verify by looking if we see the functions that are associated with the post/wait interface, of which the a few obvious ones are kslawe for setting up the post/wait entry, and another one is a kslgetl/kslfre combination that obtains and frees the ‘post/wait queue’ latch:
79629:kslawe+381:0x0000000060006a60(fixed sga|var:kslpwqs_+0 ):R:8:0x6dd76db0/1842834864(Variable Size(pgsz:2048k)|+219639216 shared pool|(child)latch#1:post/wait queue+0 shared pool|permanent memor,duration 1,cls perm+1322416 ) 79630:kslawe+388:0x00007ffcb79aecf0():W:8:0x6dd851b0/1842893232(shared pool|(child)latch#305:post/wait queue+0 shared pool|permanent memor,duration 1,cls perm+1380784 ) 79631:kslawe+392:0x00007ffcb79aecd8():W:8:0x10a69edd/279355101() 79632:kslgetl+2:0x00007ffcb79aecd0():W:8:0x7ffcb79aed20/140723388869920() 79633:kslgetl+13:0x00007ffcb79aecb0():W:8:0x132/306() 79634:kslgetl+17:0x00007f303824ff40(??):R:8:0x7f303824ff40/139845077106496(??) 79635:kslgetl+26:0x00007ffcb79aeca8():W:8:0x21/33() 79636:kslgetl+33:0x00007ffcb79aeca0():W:8:0x130/304() 79637:kslgetl+37:0x00007ffcb79aecb8():W:8:0x130/304() 79638:kslgetl+44:0x00007f3038241e78(??):R:8:0x6dde6f10/1843293968(shared pool|(indx:71)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+1781520 ) 79639:kslgetl+58:0x00007ffcb79aecc0():W:8:0x131/305() 79640:kslgetl+65:0x00007ffcb79aec70():W:8:0x130/304() 79641:kslgetl+69:0x000000006dd851bf(shared pool|(child)latch#305:post/wait queue+15 shared pool|permanent memor,duration 1,cls perm+1380799 ):R:1:0x37/55() 79642:kslgetl+80:0x000000006dde7268(shared pool|(indx:71)X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+1782376 ):R:8:0/0()
Above we see the a snippet of part of the kslawe function reading the start address of the child latches from the fixed SGA variable kslpwqs_, then kslgetl is called for post/wait queue child latch number 305.
From the previous article we know that a committing session checks the on disk SCN in the kcrfsg_ struct in order to understand if the redo that it put into the public redo strand has been written or not. If the logwriter did not increase the on disk SCN beyond or equal to the commit SCN yet, the committing session loops executing the following functions:
– kslawe: registering itself in the post-wait queue
– kcscu8: check on disk SCN
– kslwaitctx: main waiting routine, semtimedop: wait to be posted for 100ms
– (if not posted) ksliwat_cleanup: remove post-wait entry
– (if not posted) kcscu8: check on disk SCN
Of course when the instance is in adaptive log file sync mode (which any normal database should be nowadays!), it chooses between post-wait and polling, and when post-wait is chosen, the above function sequence is valid, this is independent of being dedicated in post-wait mode or being in post-wait mode because the adaptive log file sync mechanism chose it.
How does this look like when the database in polling mode?
– kcscu8: check on disk SCN
– sltrusleep, nanosleep: wait for a certain amount of time
Yet, it is that simple: all it does is verify the on disk SCN and if it has not progressed equal to or beyond the session’s commit SCN, it sleeps in nanosleep.
The time sleeping in nanosleep as far as I seen, has changed: prior to Oracle version 12.2, the time was calculated dynamically, probably directly or indirectly from the IO time of the logwriter. As far as I can see, in version 12.2 the sleeping time changed to sleeping 1000ns. This sleeping time was the topic of my previous article because a 1000ns/1us sleep seems to be lesser than linux can sleep.
Pingback: A look into oracle redo: index and overview | Frits Hoogland Weblog