A look into oracle redo, part 8: generate redo

This blogpost looks at the very start of oracle redo: the generation of it. In order to do that, I start off with a very simple table, and look at the redo generation part. I guess the regular readers of this blogpost series understand that redo generation is closely connected with making changes made to blocks. This post therefore is not only about redo generation, but also about how the technical implementation of block changes.

I created a simple table (create table test (f1 varchar2(10)) with no index to make the execution as simple as possible, and simply insert rows (insert into test values (‘a’)). It could be argued that not having an index makes it not the most real life scenario, and this might be right. However, the goal here is to make the execution as simple as possible.

I then looked at the execution of the SQL, and created an overview of the relevant functions that are executed in my session:

insexe
  qerltcStart
  qerltcFetch
    qerltcSingleRowLoad
      qerltcSimpleInsRowCBK
        kdtSimpleInsRow
          kdtgrs
          kdtwrp
            kdtchg
              ktbchg2
                ktuchg2
                  ktcbgn
                  ktubnd
                    ktuGetTxForXcb
                      ksqgtlctx
                  ktugur
                  kcbchg1
                    kcbchg1_main
                      ktiimu_chg
                      kcb_block_status
                      ktiimu_chg					
                      ktiimu_chg
                      kcopcv
                      kcopcv
                      kcopcv
                      kcrfw_redo_gen_ext
                        kcrfw_copy_cv
                        kcscur_rba
                        kcbapl
                          kcbhlchk
                          kcoapl
                            kco_issue_callback
                              kturdh
                            kco_blkchk
                            kcoadv_hdr
                            kcbhfix_tail
                        kcbapl
                          kcbhlchk
                          kcoapl
                            kco_issue_callback
                              kturdb
                            kco_blkchk
                            kcoadv_hdr
                            kcbhfix_tail
                        kcbapl
                          kcbhlchk
                          kcoapl
                            kco_issue_callback
                              kdoirp
                            kcoadv_hdr
                            kcbhfix_tail	
                        kcrfw_partial_checksum
    qerltcPostInsertProcessing
    qerltcSetupOerdef
    qerltcFreeMemory

How to read this: I started looking at the functions that are executed when an insert statement is executed at the ‘insexe’ function, which is the main insert function. An indention means it is called from the previous less indented function.

The insertion code calls query execute rowsource load table conventional functions (qerltc). Conventional here means that it’s not a direct insertion. The qerltcStart function essentially initialises memory which is needed to perform the execution of the insert statement.

After the rowsource layer, the data layer is entered using the kdtSimpleInsRow. The first thing that needs to be done here is to find space in a block in which the row can be inserted. Finding a block is done in the kdtgrs function (kernel data table get space for rows). Once a block is found, the code inserts the row using the kdtwrp function (kernel data table write row piece). Via the kdtchg function (kernel data table change) function, the transaction layer is entered.

The transaction layer function that are then used are ktbchg2 (kernel transaction block (header) change) and ktuchg2 (kernel transaction undo called from write logs and perform changes), after which the transaction is initialised in ktcbgn (kernel transaction control begin) next is ktubnd (kernel transaction undo bind undo segment. Inside ktubnd not only the undo segment is selected (‘bound’), but the transaction enqueue lock is set too (ksqgtlctx), which locks the row logically. Further down below you will see Oracle also needs to provide a physical way of locking access when it makes changes to prevent from wrong data being read.

Then we enter the cache layer via the kcbchg1 function (kernel cache buffers change). The cache layer is the actual layer that executes functions to change the block. The actual main function which performs block layer functions is kcbchg1_main.

In order for Oracle’s recovery mechanism to work, the redo must be generated before the block change, so a block can be recovered using the redo. Therefore, kcrfw_redo_gen_ext (kernel cache redo file write/broadcast SCN main redo generation function (12c)) is called. Inside kcrfw_redo_gen_ext, the redo change vectors are copied into the public redo strand using the kcrfw_copy_cv function (kernel cache redo file write/broadcast SCN copy change vectors), after which the changes to the block in the cache are done using the kcbapl (kernel cache buffers apply).

There are three kcbapl executions; the first one applies the change to the undo header where the transaction is assigned a slot which points to the actual undo data inside the undo segment, visible inside the kco_issue_callback function kturdh (kernel transaction undo write undo header data to undo block), the second one applies the change to the undo block with the previous version of the data, so a change be undone and read consistency can be provided. This is also visible inside the kco_issue_callback function: kturdb (kernel transaction undo write undo data to undo block), and the second invocation of kcbapl has the function kdoirp (kernel data operations insert row piece), which performs the actual change of the block to insert the row.

If you look closely at the functions inside kcbapl, you see that a check is done to the buffer prior the the change (kcbhlchk, kernel cache buffers header logical check), and after the change the block metadata is changed to reflect the change (kcoadv_hdr, kernel cache operations advance SCN in buffer header and kcbhfix_tail, kernel cache buffers header update tail).

After the change, a checksum is applied to the generated redo using kcrfw_partial_checksum.

This concludes the insert. A row lock is obtained, the redo change vectors have been copied into the public redo strand, and the changes have been applied to the undo and table blocks.

The functions return, and execution is back at the query execute rowsource layer (qer), which performs some post processing using qerltcPostInsertProcessing, qerltcSetupOerdef and qerltcFreeMemory.

A question that I asked myself is where the actual redo change vectors are created. It’s known that kcrfw_copy_cv copies the change vectors into the public redo strand, but these need to created before you can copy them. Are these created on the spot in kcrfw_copy_cv? Probably not. Of course this dives so deep into the actual working of the Oracle engine that Oracle does not reveal anything about it (that I am aware of).

Luckily, using the Intel Pin tools trace we can see memory access, and therefore we can track the memory area’s that a process is using, and see where (in which function) memory is written to before it is written into the public redo strand. First let’s look the kcrfw_copy_cv function:

40626:kcrfw_copy_cv+544:0x00007ffcd200f778():W:8:0x10e3d3e5/283366373()
40627:_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
40628:__intel_ssse3_rep_memcpy+30:0x00000000137d0c90():R:4:0xf2afbf82/4071604098(??)
40629:__intel_ssse3_rep_memcpy+8722:0x00007ffcd2011678():R:16:0x5223030ffff10100542b760/-263813982931104()
40630:__intel_ssse3_rep_memcpy+8727:0x00000000d80cbe54(Redo Buffers(pgsz:2048k)|+835156 redo|PUB_REDO_0+835156 ):W:16:0x5223030ffff10100542b760/-263813982931104()
40631:__intel_ssse3_rep_memcpy+8732:0x00007ffcd2011688():R:16:0x000010ffff00000000/4785070309113856()
40632:__intel_ssse3_rep_memcpy+8737:0x00007ffcd201168c():R:16:0x10ffff0000000040200/-4503599627107840()
40633:__intel_ssse3_rep_memcpy+8742:0x00000000d80cbe64(Redo Buffers(pgsz:2048k)|+835172 redo|PUB_REDO_0+835172 ):W:16:0x000010ffff00000000/4785070309113856()
40634:__intel_ssse3_rep_memcpy+8747:0x00000000d80cbe68(Redo Buffers(pgsz:2048k)|+835176 redo|PUB_REDO_0+835176 ):W:16:0x10ffff0000000040200/-4503599627107840()
40635:__intel_ssse3_rep_memcpy+8752:0x00007ffcd200f778():R:8:0x10e3d3e5/283366373()
40636:kcrfw_copy_cv+569:0x00007ffcd200f890():R:4:0x2/2()

What this shows, is the kcrfw_copy_cv function calling _intel_fast_memcpy. The _intel_fast_memcpy is using a specialised memcpy function, __intel_ssse3_rep_memcpy, which uses a processor extension called ‘ssse3’. At line numbers 40629, 40631 and 40632 16 bytes are read and at 40630, 40633 and 40634 the reads are written into the public redo strand.

In order to understand where the data in these 16 bytes memory addresses has been written, I use an extension to my pinatrace annotate oracle script that creates a sqlite database of the annotation results using the ‘-d’ switch.

In order to see where the contents of the memory address 0x00007ffcd2011678 have been written, fire up sqlite3 with the created database:

$ sqlite3 insert.txt.db
sqlite> select * from annotation where memory_address between printf('%d',0x00007ffcd2011678) and printf('%d',0x00007ffcd2011678)+15 and line < 40629;
14439|kssadd+322|140723831772792|??|W|8|279514983|??
14444|kpdbIdToUid+231|140723831772792|??|R|8|279514983|??
16562|kglget+33|140723831772792|??|W|8|0|??
16563|kglget+37|140723831772800|??|W|8|0|??
17380|kglget+290|140723831772792|??|R|8|0|??
17381|kglget+294|140723831772800|??|R|8|0|??
17393|kglpin+13|140723831772792|??|W|8|1810933992|Variable Size(pgsz:2048k)|+183544040 shared pool|KGLH0^d0901451 DS:free memory,duration 1,cls free+96 shared pool|KGLH0^d0901451,duration 1,cls recr+2792
17406|kglpin+77|140723831772804|??|W|4|1|??
17621|kglpin+857|140723831772800|??|W|4|1|??
17720|kglpin+1115|140723831772800|??|R|4|1|??
17726|kglpin+1164|140723831772804|??|R|4|1|??
18697|kglpin+1993|140723831772792|??|R|8|1810933992|Variable Size(pgsz:2048k)|+183544040 shared pool|KGLH0^d0901451 DS:free memory,duration 1,cls free+96 shared pool|KGLH0^d0901451,duration 1,cls recr+2792
18755|kzpchkbu_internal+133|140723831772792|??|W|4|12|??
18763|kzpchkbu_internal+187|140723831772800|??|W|1|2|??
18772|kzpchkbu_internal+230|140723831772801|??|W|1|0|??
18778|kzpchkbu_internal+285|140723831772798|??|W|2|0|??
18818|kzpchkbu_internal+567|140723831772792|??|R|4|12|??
18867|kzpchkbu_internal+1241|140723831772796|??|W|1|0|??
19084|kzpchkbu_internal+1496|140723831772796|??|R|1|0|??
19088|kzpchkbu_internal+1524|140723831772792|??|R|4|12|??
19090|kzpchkbu_internal+1528|140723831772801|??|R|1|0|??
19218|kzpchkbu_internal+1747|140723831772801|??|R|1|0|??
20936|kglati+52|140723831772792|??|W|8|139724814752480|??
21062|kglati+190|140723831772800|??|W|8|80|??
22201|kglati+407|140723831772792|??|R|8|139724814752480|??
22203|kglati+415|140723831772800|??|R|8|80|??
30665|__intel_memset+2691|140723831772800|??|W|16|0|??
31116|__intel_memset+2691|140723831772800|??|W|16|0|??
38371|ktuchg2+9341|140723831772792|??|W|1|5|??
38372|ktuchg2+9345|140723831772793|??|W|1|2|??
39504|kcopcv+207|140723831772794|??|W|2|35|??
39507|kcopcv+221|140723831772800|??|W|4|272|??
39511|kcopcv+249|140723831772796|??|W|4|4294901763|??
39517|ub8_to_kscn_impl+21|140723831772804|??|W|4|7744340|??
40116|kcrfw_redo_gen_ext+1260|140723831772796|??|R|2|3|??
sqlite>

The query is simple: look up all records which were generated before line 40629, which did access the memory region that is read for copying into the public redo strand.

This is the complete output, please be aware that the high memory addresses are PGA memory addresses. PGA memory is highly fluent in nature. If you look in memory access traces you see that nearly every layer that is used during execution of SQL and PL/SQL requires allocation of memory area’s to store information. In fact, it’s so fluent that some memory area’s can be reused during repetitions in execution within the same statement.

The “__intel_memset’ functions are functions that set a memory area to a certain value, in this case zero. This might be an initialisation of memory. In order to be sure, lookup the lines (30665/31116), and see from what function they were called:

30620:kdtgrs+233:0x00007ffcd2012188():W:8:0/0()
30621:kdtgrs+244:0x000000006c20fb98(Variable Size(pgsz:2048k)|+186710936 shared pool|PLMCD^6daf6103 DS:permanent memor,duration 4,cls perm+2472 shared pool|PLMCD^6daf6103,duration 4,cls freeabl+2512 ):R:1:0x8/8
()
30622:kdtgrs+250:0x000000006c20fba4(Variable Size(pgsz:2048k)|+186710948 shared pool|PLMCD^6daf6103 DS:permanent memor,duration 4,cls perm+2484 shared pool|PLMCD^6daf6103,duration 4,cls freeabl+2524 ):R:1:0x80/
128()
30623:kdtgrs+273:0x00007ffcd20110e8():W:8:0x10e0cc66/283167846()
30624:_intel_fast_memset+7:0x00000000157f62a0():R:8:0x137fff/1277951()
30625:__intel_memset+1072:0x00000000157f6724():R:4:0x2/2()
30626:__intel_memset+1128:0x00000000062c7240():R:8:0x92/146()
30627:__intel_memset+2633:0x00000000157b9480():R:4:0x600000/6291456()
30628:__intel_memset+2660:0x00007ffcd2011430():W:16:0x0000000000000000/0()
30629:__intel_memset+2664:0x00007ffcd2011440():W:16:0x0000000000000000/0()
30630:__intel_memset+2669:0x00007ffcd2011450():W:16:0x0000000000000000/0()
30631:__intel_memset+2674:0x00007ffcd2011460():W:16:0x0000000000000000/0()
30632:__intel_memset+2686:0x00007ffcd2011470():W:16:0x0000000000000000/0()
30633:__intel_memset+2691:0x00007ffcd2011480():W:16:0x0000000000000000/0()

It’s called from kdtgrs. And by looking at the long list of memset commands, this clearly is initialisation of memory. This means that any allocation before the memset is something completely different. This means these functions did provide the redo information:

38371|ktuchg2+9341|140723831772792|??|W|1|5|??
38372|ktuchg2+9345|140723831772793|??|W|1|2|??
39504|kcopcv+207|140723831772794|??|W|2|35|??
39507|kcopcv+221|140723831772800|??|W|4|272|??
39511|kcopcv+249|140723831772796|??|W|4|4294901763|??
39517|ub8_to_kscn_impl+21|140723831772804|??|W|4|7744340|??

The ktuchg2 function hints at undo, and the kcopcv function is a function that seems to examine the block and create redo information based on the block. The ub8_to_kscn_impl is a function for the conversion of SCN numbers.

Let’s take the next memory region, and see what functions manipulated region:

38375|ktuchg2+9375|140723831772816|??|W|2|0|??
38376|ktuchg2+9380|140723831772818|??|W|2|0|??
38377|ktuchg2+9385|140723831772820|??|W|2|0|??
38378|ktuchg2+9390|140723831772822|??|W|2|0|??
39506|kcopcv+216|140723831772816|??|W|2|0|??
39510|kcopcv+237|140723831772814|??|W|2|65535|??
39518|ub8_to_kscn_impl+37|140723831772810|??|W|2|0|??
39519|ub8_to_kscn_impl+41|140723831772808|??|W|2|0|??
39523|kcopcv+326|140723831772812|??|W|1|1|??
39526|kcopcv+403|140723831772813|??|W|1|0|??

A little more of the same, ktuchg2/kcopcv/ub8_to_kscn_imp, although in another order.
And the third memory region:

38375|ktuchg2+9375|140723831772816|7ffcd2011690|??|W|2|0|0|??
38376|ktuchg2+9380|140723831772818|7ffcd2011692|??|W|2|0|0|??
38377|ktuchg2+9385|140723831772820|7ffcd2011694|??|W|2|0|0|??
38378|ktuchg2+9390|140723831772822|7ffcd2011696|??|W|2|0|0|??
38383|ktuchg2+9426|140723831772826|7ffcd201169a|??|W|2|32|20|??
39506|kcopcv+216|140723831772816|7ffcd2011690|??|W|2|0|0|??
39510|kcopcv+237|140723831772814|7ffcd201168e|??|W|2|65535|ffff|??
39523|kcopcv+326|140723831772812|7ffcd201168c|??|W|1|1|1|??
39526|kcopcv+403|140723831772813|7ffcd201168d|??|W|1|0|0|??
39895|kcobrh+157|140723831772824|7ffcd2011698|??|W|2|4|4|??

More of the same, although another function was added ‘kcobrh’. I am not familiar with this function, although my suspicion is this is a function to generate a header. Of course these 3 regions were copied using the same call.

The next memcpy call copies two 16 byte regions:

40641:kcrfw_copy_cv+544:0x00007ffcd200f778():W:8:0x10e3d3e5/283366373()
40642:_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
40643:__intel_ssse3_rep_memcpy+49:0x00000000137d0a40():R:4:0xf2afc3cc/4071605196(??)
40644:__intel_ssse3_rep_memcpy+9244:0x00007ffcd201174c():R:16:0x120880000000000000/612489549322387456()
40645:__intel_ssse3_rep_memcpy+9249:0x00000000d80cbe88(Redo Buffers(pgsz:2048k)|+835208 redo|PUB_REDO_0+835208 ):W:16:0x120880000000000000/612489549322387456()
40646:__intel_ssse3_rep_memcpy+9254:0x00007ffcd201173c():R:16:0x100069600e810008a1c0/7593084918960792000()
40647:__intel_ssse3_rep_memcpy+9258:0x00000000d80cbe78(Redo Buffers(pgsz:2048k)|+835192 redo|PUB_REDO_0+835192 ):W:16:0x100069600e810008a1c0/7593084918960792000()
40648:__intel_ssse3_rep_memcpy+9262:0x00007ffcd200f778():R:8:0x10e3d3e5/283366373()
40649:kcrfw_copy_cv+569:0x00007ffcd200f890():R:4:0x2/2()

(lines 40644 and 40646)
Which memory is written to by different functions. In fact, these were the functions that I suspected to generate the redo information:

37790|ktuchg2+808|140723831773004|??|W|2|2|??
38266|ktugur+1575|140723831773006|??|W|2|136|??
38270|ktugur+1618|140723831773004|??|R|2|2|??
38271|ktugur+1625|140723831773004|??|W|2|18|??
38272|ktugur+1629|140723831773008|??|W|1|0|??
38279|ktugur+1663|140723831773012|??|W|2|0|??
38281|ktugur+1677|140723831773014|??|W|2|0|??
38283|ktugur+1690|140723831773016|??|W|4|0|??
38291|ktugur+1756|140723831773004|??|R|2|18|??
38292|ktugur+1763|140723831773004|??|W|2|18|??
38297|ktugur+1811|140723831773004|??|W|2|18|??

One write of ktuchg2 and then ktugur (kernel transaction undo generate undo and redo).

I gone through all of the memory area’s that are written to the public redo strand, and found the functions in which redo is generated:
– ktuchg2
– kcopcv
– kcobrh
– ktugur
Other functions that I have found in other regions of memory:
– kdtwrp
– kcosze
– ktugul

There is one other thing I found that I didn’t suspect. During the investigation I used a debugger to stop on certain functions (in fact, that is how I gotten the list of functions at the start of this article) during the insert. When doing that, I tried querying the table in order to see what happened visibly to the table at which I performed the insert while it was happening.

I found that the kcbchg1_main function performs cache buffers chains latch gets without calling kslgetl/ksl_get_shared_latch(_int). This has been documented on a few places, it is using a macro to perform the latch get. However, when stopping (by means of breaking execution using a debugger), I found that my other session simply executing a ‘select’ query hung on the latch.

Also, after the cache buffers chain latch get, the block is getting modified. That’s obvious of course, we are doing an insert. However, during the entire time between kcopcv up to the two kcbapl functions, the block is not accessible for any querying, and keeps any other query waiting in the wait event ‘buffer busy wait’. This is a direct violation of the mantra that writers don’t block readers.

Now, before you jump to conclusions, please bare with me a little more. Is this a problem? No, this is the direct consequence of concurrency control, which, in a shared system, always must happen. Certain things can only be done one at a time. In fact, that is the whole purpose of a latch.

However, I do want to make you think about it. These concurrency control waits simply must happen in order not to disturb changes being made to a block, or read an inconsistent version of a block. And this probably in much the same way has been for a long time this way in Oracle, and it has not a problem. Generally. Where this does become a problem, is when you have your processes randomly sleep for longer periods of time. Because of your inserting process is stalling during executing the kcrfw_redo_gen_ext function, other processes will get a buffer busy wait.

So when does this happen? Why am I saying this? Well, these random waits of processes is something that actually happens when your system is overloaded on CPU, or when your system is actively swapping. Now think about this in a broader sense. Whenever you run virtualised, and you do not have strict policies enforced on CPU and memory allocation, and especially if there is oversubscription of CPU and/or memory, these random waits can be applied by the hypervisor. So, never -ever- let your database server run short on CPU or memory, because the database will start showing a very peculiar waits, which could put you on your wrong foot very easily.

Conclusion
I actually tried putting in three distinct messages in this blogpost. The first one is to show how a change (an insert) is done internally. This shows that Oracle is build up in layers, and a lot of these layers are crossed even when you do the simplest of things, like inserting a single row in a table. Yes, this is not simple or even plain confusing. That’s not something I can change. However, I do think that this information is crucial in order to understand how the database works, and thus how to troubleshoot issues and understand what happens.

When a change is made to a block, first the redo is generated, then the undo is written, and then a row is inserted into the data block.

The second message is research on where the redo change vectors are generated. This is background information that will not help with anything concrete, it’s just background information for improving understanding how Oracle works. The research shows that redo is generated in a couple of functions:
– ktuchg2
– kcopcv
– kcobrh
– ktugur
– kdtwrp
– kcosze
– ktugul

At certain points, Oracle serialises access to blocks in order to make sure changes are not disturbed, and other processes do not read an inconsistent version of a block. This is technical locking, not the logical transaction locking that Oracle provides using enqueues. This is the third piece of information that I put in this article. This is directly usable and concrete information, and this is crucial, especially in today’s world of using virtualisation on premises or in the cloud.

If processes get stuck during these times of serialisation, all kinds of waits become visible, which might look like other problems; my first thought with buffer busy waits is a block being read into the buffer cache from disk or being written from the buffer cache onto disk, not being modified in the cache. In fact, if you start inserting into a freshly truncated table, I found I could get TX lock waits while doing a select. Again: this is not obvious.

So the last paragraph warns to make absolutely sure your database server is not oversubscribed on any resource, especially CPU and memory, because it can show itself in unpredictable ways, making it harder to diagnose, and thus to solve.

2 comments

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.