Transactions and SCNs

It’s general knowledge that the Oracle database is ACID compliant, and that SCNs or ‘system change numbers’ are at the heart of this mechanism. This blogpost dives into the details of how the Oracle engine uses these numbers.

Oracle database version 12.1.0.2.161018
Operating system version: OL 7.2, kernel: 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

Redo generation
Whenever DML is executed, redo is generated in the form of ‘change vectors’. These change vectors are copied into the redo buffer as part of the transaction, during the transaction. The function that performs this action is called ‘kcrfw_copy_cv()’. This can be derived by watching the foreground process perform memory copy into the memory area of the redo buffer.

In order to do this, you first need to find the memory area of the redo buffer. This can be done by executing ‘oradebug setmypid’ and ‘oradebug ipc’ as sysdba, and examine the resulting trace file:

 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1  4784130 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008600000 default       readwrite

This shows the redo buffers are allocated between 0x96400000 and 0x9E934000 (actual address and actual address + segment size).

The function Oracle uses to copy memory is called “_intel_fast_memcpy()”. This probably is an optimised version of the standard C memcpy() function. What is important, is that the arguments of memcpy function:

$ man memcpy
NAME        
       memcpy - copy memory area
SYNOPSIS    
       #include <string.h>
       void *memcpy(void *dest, const void *src, size_t n);

So the first argument of memcpy is a pointer to the destination, the second argument is a pointer to the source, and the third argument is the size of the memory area to be copied.

In order to find the function that uses memcpy() to copy into the redo buffers area, we can use the following gdb macro:

break _intel_fast_memcpy if $rdi > 0x96400000 && $rdi < 0x9E934000
commands
silent
printf "_intel_fast_memcpy dest %lx, source %lx, size %d\n", $rdi, $rsi, $rdx
frame 1
c
end

The macro creates a conditional breakpoint on the function _intel_fast_memcpy(). The condition is that the CPU register RDI (first argument, destination) should be higher than 0x96400000 and lower than 0x9E934000, which is the memory area the redo log buffers are mapped to. If the breakpoint and its conditions are triggered, the function name is printed, together with the arguments to memcpy, and ‘frame 1’ to list the function that called _intel_fast_memcpy().

To execute this macro, startup an Oracle foreground session, and attach to the session with gdb, and execute the macro. Next, execute an insert statement. In my case, I created a very simple table called ‘T’ with one field, T, varchar2(1). Please mind that you need to watch out for deferred segment creation, so ideally you first need to create the table, and insert a few rows to take care of that.

This is how the gdb side looks like when executing an insert statement:

_intel_fast_memcpy dest 96817454, source 7ffc9fcc9598, size 36
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817478, source 7ffc9fcc9658, size 32
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817498, source 7faf6ad677f0, size 44
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174c4, source 7ffc9fcc9678, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174d8, source 7ffc9fcc95e4, size 76
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817524, source 7faf6ad67888, size 8
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 9681752c, source 7faf6ad67760, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817540, source 7faf6ad69ff8, size 40
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817568, source 7faf6ad6a070, size 24
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817580, source 7faf6ad69f60, size 49
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968175b4, source 89753568, size 1
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()

The destination of the memcpy() action is the logbuffers, the break conditions made sure of that. The source of all except for the last memcpy() is the PGA (visible by the high number), the last memcpy() copies from the SGA. The memory address points to an SQLA component (cursor) in the SGA.

Okay, at this point we established the redo of a DML operation is put in the redo buffers by a function called kcrfw_copy_cv(). Also please mind that modern Oracle versions do put redo into the redo buffers during doing DML, not after it!

To move further on the topic of transactions, let’s look at when the current SCN is fetched and when kcrfw_copy_cv() is called. To do this, I use the following gdb macro:

break kcrfw_copy_cv
commands
silent
printf "kcrfw_copy_cv\n"
c
end
awatch *0x6001fbb0
commands
c
end

When running this gdb macro against an Oracle session running an insert statement, the following output is generated:

Hardware access (read/write) watchpoint 2: *0x6001fbb0

Old value = 347661
New value = 347666
0x000000000ccbdfb3 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbeb04 in kcsgssn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbdfb3 in kcscur3 ()
kcrfw_copy_cv
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()

Actually, I was surprised by this output. I suspected a call to generate an unique SCN for this change. However, the calls/functions which are visible are:
– kcscur3: a function to read a SCN (the global SCN, but is also used to read other SCNs, like the LWN SCN and the on-disk SCN).
– kcsgrsn: kernel cache service get recent SCN, for reading the global SCN.
– kcsgcsn: kernel cache service get current SCN, for reading the global SCN.
– kcsgssn: kernel cache service get snapshot SCN, for reading the global SCN.
All of these functions do READ a SCN value, but do not modify it. In order to update a SCN, the function kcsgbsn (kernel cache service get batched SCN) should be called. There is a memory write shown (row 1-5), but that really is a read, it is an anomaly of watchpoints with memory changed by other processes.

So, conclusion at this point is that DML (in this case an insert statement) just reads the global SCN, and does not seem to update it.

This got me thinking: if an insert statement just reads the global SCN, instead of generating an unique SCN, it should be possible that the changes (redo vectors actually) of DML executed simultaneously in different sessions on different segments share the same SCN? It sounds counter intuitive to what I have been taught about Oracle.

This can be tested in the following way:
1 – Have two foreground sessions in a different schema execute an insert statement at exactly the same time. This can be done in iTerm2 using the ‘broadcast input’ option, which allows input to be executed in multiple panes at the same time. The command executed in both pane’s is ‘insert into t values (‘a’);’.

2 – Have gdb attached to the two sessions, having an access watchpoint at the global SCN, to be able to identify the SCN that both the session used during executing the insert, and a breakpoint at kcrfw_copy_cv() to understand which SCN it uses. I used the following gdb macro:

break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end
awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

3 – To validate what is visible in gdb, dump the current logfile and find the SCN as shown in gdb. The current logfile can be dumped to trace using the following PL/SQL procedure:

----------------------------------------
-- This script will dump the current redolog file.
-- If the log file is big, this might take some time.
--
--
-- Original Author : Riyaj Shamsudeen
-- Modifications by: Frits Hoogland
-- No implied or explicit warranty !
----------------------------------------
set serveroutput on size 1000000
declare
  v_sqltext varchar2(255);
begin
 select 'alter system dump logfile '||chr(39)||member||chr(39) into v_sqltext
  from v$log lg, v$logfile lgfile
  where lg.group# = lgfile.group# and
  lg.sequence# = ( select sequence# from v$log where status='CURRENT' )
  and rownum <2;
  dbms_output.put_line ('Executing :'||v_sqltext);
  execute immediate v_sqltext;
end;
/

In step 2, I got the following gdb output on both the inserts:

(1st session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

This is as expected based on our observation, because the insert statement only reads the global SCN. Now let’s dump the current redologfile as shown in step 3 above, and see if we can find SCN 5c860.
Here are the redo records for both changes:

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.0010 LEN: 0x01a8 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.01b8 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

Watch the SCN from the gdb session and at the line with SCN and the SUBSCN value being the same. Also watch the RBA (redo block address) being different.

Both redo records contain 3 changes:

(1st session)
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0005c801 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:32 AFN:3 DBA:0x000000f9 OBJ:4294967295 SCN:0x0000.0005c800 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600c OBJ:20455 SCN:0x0000.0005c857 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

(2nd session)
CHANGE #1 CON_ID:0 TYP:0 CLS:35 AFN:3 DBA:0x00000110 OBJ:4294967295 SCN:0x0000.0005c7b2 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:36 AFN:3 DBA:0x00000895 OBJ:4294967295 SCN:0x0000.0005c78c SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600e OBJ:20455 SCN:0x0000.0005c858 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

These 3 transactions are consistent with an insert of a single row, this is visible by the opcodes:
5.2=update rollback segment header
5.1=undo block
11.2=insert row piece

This means the DML of the two inserts have exactly the same SCN! A logic thought that you might come up with is: but how is Oracle then able to apply these in the correct order, which is one of the fundamental reasons for having SCNs? For that, it’s important to realise this is just the change to a block, but not a committed transaction yet. In order to get a full understanding of how the SCN mechanism works for at least a single row insert, we need to change the gdb macro from the test above:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    if $rip >= 0xccbe340 && $rip <= 0xccbe7b0
      printf "kcsgbsn - SCN: %x (@%x)\n", *0x6001fbb0, $rip
    end
    c
  end
break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end

The function kcsgbsn means: kernel cache service get batched SCN. This function DOES update the global SCN(!). However, this function does not get called during the insert. It does get called when commit is called (or rollback).

So, in order take another step in understanding SCN handling by Oracle, let’s use the updated version of the gdb macro, and insert another row into the table in two sessions, and then also commit:

(1st session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcsgbsn - SCN: 7e39c (@ccbe63a)
kcrfw_copy_cv

In both sessions we see the insert (lines 2-4 and 11-13), and then the commit (lines 5-8 and 14-18).
During the insert, we see both sessions using the same SCN (7e39b), which can be verified by dumping the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.0010 LEN: 0x01c0 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.01d0 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

And we see kcsgbsn() being called. As has been documented on multiple places, the current value of the global SCN is read, and then advanced (get and advance). This can be seen in lines 6 and 7; 0x7e39b is read, and then advanced to 0x7e39c. However, to be honest to my surprise, the second session does the same, as seen in lines 15-17, it advances from 0x7e39b to 0x7e39c too. However it rereads the SCN one additional time. The surprise is the sessions seem to have the same SCN for the commit. Validation from the dump of the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.0174 LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.01dc LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:19 AFN:3 DBA:0x00000090 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

Yes, both commits (visible by the opcode (OP) 5.4) have the same SCN. At this point I think it’s fitting to theorise why this is happening. I think this is happening and makes sense because the table is protected for consistency at another level in the database, which is the locking mechanisms (TX locks for rows, TM locks for segments), which guarantees that two transactions at the same time do not compromise data consistency. And because of that, it’s not a problem to batch them with the same SCN.

Conclusion
Changes in the database are synchronised by system change numbers, also known as SCNs. SCNs are the way Oracle serialises changes to data. However, this post shows that the Oracle engine tries to maximise the number of changes inside a SCN, and as such a SCN is not an highly granular serialisation mechanism. This is not a problem, since Oracle’s transactional protection mechanisms (like TX and TM locks) do work alongside SCNs to make sure no conflicting changes can happen at the same time. The changes itself do just read and pick the current global SCN from the instance, a commit of these changes does increase the global SCN, but as little as possible. You might find multiple changes in the same SCN, and you might even find commits with the same SCNs.

Bonus topic: finding nemokcscur3.
In case you wondered what this little macro means, and how got the memory addresses:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

awatch is a memory access watch function. The commands executed when this watch triggers are:
-silent (do not display output)
-if $rip >= 0xccbdfa0 && $rip <= 0xccbe010 (if the instruction pointer contains a value higher or equal to 0xccbdfa0 and the instruction pointer contains a value lower or equal to 0xccbe010)
-printf "kcscur3 – SCN: %x\n", *0x6001fbb0 (print some stuff and print (in hexadecimal format) where the value the pointer 0x6001fbb0 points to)
The text 'kcscur3' in the printf spoils that these addresses are actually the memory addresses where the function kcscur3 is located. How to find these? The starting address can be found in a variety of ways, for example in gdb: 'info function kcscur3', or in the shell using nm: 'nm oracle | grep kcscur3'. It is a bit harder to find the ending address of a function. However, it's still quite simple doable with nm: 'nm -S oracle | grep kcscur3':

[root@o12102fs bin]# nm -S oracle | grep kcscur3
000000000ccbdfa0 0000000000000070 T kcscur3

The first column is the starting memory address of the kcscur3 function, the second column is the length (in hexadecimal) of the function code. 0xccbdfa0+0x70=0xccbe010.

Update oktober 24, 2016:
Roger noted the following: Minor correction, Oracle doesn’t do DML directly, it first generates change vectors which are then applied to the local database which is effectively running in recovery. This way Oracle uniquely guarantees recovery on the primary or physical standby will always do exactly the same thing as the original DML.

Stefan noted the following: Hey Frits. Great work – as always. However i am not quite sure about your kcrfw_copy_cv() PGA interpretation for private redo strands. AFAIK private redo strands reside in permanent memory allocations (in SGA) and you should see memory copy operations between these memory chunks and the public redo buffer(s) on commit. Your GDB breakpoint should only kick in under these circumstances as anything else is handled in private redo (at least on your mentioned database version) and no memory operations on redo buffer. I also modeled a short test case with 12.1.0.1 on OEL 6.7 to verify my theory/assumption and the result was as expected – not quite sure about your observed and mentioned PGA to redo log buffer copy operations then.

I should add that when I run snapper on the insert in my database, the statistics ‘IMU- failed to get a private strand’ and ‘IMU pool not allocated’ are increased, giving a hint no private strands are used.

Advertisements
4 comments
  1. It’s late and I suspect I have not really read this post properly but the first question which jumped out was;

    Is the processing the same if the table being inserted has “rowdependencies” defined, meaning the SCN is recorded at row level not block level?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: