Since the direct path feature for serial processes was discovered after it became available in Oracle 11.2.0.1 (as far as I know, I haven’t checked Oracle 11.1), there have been a lot of blog posts on when this happens. A lot of these do not specify the Oracle version, which is a failure in my opinion. There are different decisions made in different versions.

The purpose of this blogpost is to show the results of my tests on when the Oracle database engine switches from buffered to direct path reads and vice versa. There probably are decisions made by the database engine for this feature based on internal kept statistics, like general activity and object usage, which means my tests might be different in your database. For that purpose I included an anonymous PL/SQL block in this post so you can replay the same test in your own database, except for the table, which you have to create yourself.

There are a few basics rules that are applied by the database engine for the buffered or direct path read full segment scan decision:
– If the segment size is smaller than 5 * _SMALL_TABLE_THRESHOLD (11.2.0.1 and 11.2.0.2) or _SMALL_TABLE_THRESHOLD (11.2.0.3 and higher) it will be read into the buffercache (also known as a buffered read), any block already read into the buffercache will be read from the cache.
– If the segment size is bigger than 5 * the buffer cache size (set by _VERY_LARGE_OBJECT_THRESHOLD, which value is set at 500 by default, which seems to be 500% of the buffercache), it will be read via direct path, any blocks that are in the cache are ignored. If blocks of this segment are dirty, these are checkpointed to disk before the segment is read, which is visible with the events ‘messages’ and ‘enq: KO – fast object checkpoint’.
– If a segment is between above mentioned sizes, the database engine makes a decision. In other words: you can get different behaviour in different scenario’s. In order to give this size a name, the most logical name to indicate this size is a medium table. Anytime a direct path read is chosen, any dirty blocks are checkpointed to disk, visible with the events ‘messages’ and ‘enq: KO – fast object checkpoint’.

Starting from 11.2.0.3, the direct path decision starting point changed from 5 * _SMALL_TABLE_THRESHOLD to _SMALL_TABLE_THRESHOLD. When a nsmtio trace is taken, an extra line appears in the trace when a segment is sized between _SMALL_TABLE_THRESHOLD and 5 * _SMALL_TABLE_THRESHOLD:
Partial trace lines for a medium segment sized bigger than 5 * _SMALL_TABLE_THRESHOLD (12.1.0.2):

NSMTIO: kcbism: islarge 1 next 0 nblks 21128 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 21128 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 21128 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

On the first line, nblks indicates the segment size, and kcbstt indicates _SMALL_TABLE_THRESHOLD.

Partial trace lines for a medium segment sized smaller than 5 * _SMALL_TABLE_THRESHOLD (121.1.0.2):

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: hit age_diff adjts last_ts nbuf nblk has_val kcbisdbfc cache_it 898 8181 7283 182931 3668 1 0 1
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

The first line shows the segment size (3668) being just above the _SMALL_TABLE_THRESHOLD (3658).
What is interesting to mention is the line with kcbcmt1. Things like hit, age_diff, adjts, last_ts seem to indicate additional statistics on (recent) activity are included.

For the results shown below, I created an anonymous PL/SQL block that is included at the end of the blogpost. Also, what I did is I flushed the buffercache so the measurement starts off with an empty cache. If you are running this on a real used database, you shouldn’t flush the buffercache, and probably you will see small numbers of read blocks during reading.

Version 11.2.0.1, _SMALL_TABLE_THRESHOLD = 1436 (_STT*5=7180), segment size 7040.

table: T_OVER_STT size: 7040 blocks, 330171 rows.
cached - physical reads cache/direct: (10)		749/0
full scan - physical reads cache/direct:			  6129/0
cached - physical reads cache/direct: (20)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (30)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (40)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (50)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (60)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (70)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (80)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (90)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

First 10% is read with into the buffer cache, then a full scan is initiated. The full scan reads it buffered to, because the segment size is smaller than 5 * _SMALL_TABLE_THRESHOLD. Next, all other scans are read via the cache because it has been read into the cache already, which is why all the other scans do not cause physical reads.

Version 11.2.0.2, _SMALL_TABLE_THRESHOLD = 1492 (_STT*5=7460), segment size 1920.

table: T_JUST_OVER_STT size: 1920 blocks,  rows.
cached - physical reads cache/direct: (10)		239/0
full scan - physical reads cache/direct:			  1623/0
cached - physical reads cache/direct: (20)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (30)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (40)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (50)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (60)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (70)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (80)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (90)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

Here the same thing is happening, 11.2.0.2 behaves identical to 11.2.0.1 with a segment that is smaller than 5 * _SMALL_TABLE_THRESHOLD. First the first 10% of the blocks is read, then a full scan reads the other blocks in the cache, after which all the other scans read from the cache, so no physical reads are shown.

Version 11.2.0.3, _SMALL_TABLE_THRESHOLD = 1482 (_STT*5=7410), segment size 1664.

table: T_OVER_STT size: 1664 blocks, 74404 rows.
cached - physical reads cache/direct: (10)		251/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (20)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (30)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (40)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (50)		252/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (60)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (70)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (80)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (90)		252/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (100)		 44/0
full scan - physical reads cache/direct:			  0/1548

Here we see the database behave differently! The same pattern as shown earlier is executed: first a percentage of the table is read set fixed to doing a buffered read, then a full scan is executed, which now scans via direct path for a segment that is smaller than 5 * _SMALL_TABLE_THRESHOLD. Another interesting thing to see is that despite the last cached/buffered scan reading ALL the blocks in the cache, a full scan remains scanning via direct path. Please mind we are scanning a segment bigger than _SMALL_TABLE_THRESHOLD, but smaller the 5 * _SMALL_TABLE_THRESHOLD.

Careful testing shows that all the versions up to Oracle 12.1.0.2 behave consistent like this. A final consideration, as mentioned earlier is the testing in my lab is on idle databases, often just started up for the sake of the test, so dynamics and (internally kept) runtime statistics might show something else in a really used database.

How about a segment that is truly bigger than 5 * _SMALL_TABLE_THRESHOLD?
Version 11.2.0.1, _SMALL_TABLE_THRESHOLD=1436, 5*_STT=7180, segment size 21504.

table: T2 size: 21504 blocks, 1000000 rows.
cached - physical reads cache/direct: (10)		2135/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (20)		2142/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (30)		2016/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (40)		2149/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (50)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (60)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (70)		2168/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (80)		2168/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (90)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (100)		 2044/0
full scan - physical reads cache/direct:			  0/0

Here we see a table that is bigger than 5 * _SMALL_TABLE_THRESHOLD, and we see that the size makes a difference! When the amount of blocks in the cache is close to 100%, the full scan might choose a buffered scan instead of a direct path read. In fact, if I change the starting percentage to 90% and the increase to 1%, we can see the switching point is at 99%:

table: T2 size: 21504 blocks, 1000000 rows.
cached - physical reads cache/direct: (90)		18898/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (91)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (92)		124/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (93)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (94)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (95)		128/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (96)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (97)		252/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (98)		128/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (99)		256/0
full scan - physical reads cache/direct:			  132/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

When 99% of the blocks is read into the buffercache, you can see the full scan following it choosing a buffered scan method, and reading the remaining 132 blocks into the cache.

This behaviour is consistent up to Oracle version 12.1.0.2. Just to make sure we are all on the same page again, this is in my test databases. I love to hear if you see something different in real life.

Another thing that influences direct path reads is the number of dirty blocks. If v_read_update in the procedure below is set to UPDATE, instead of scanning a certain percentage into the buffercache, a certain percentage of blocks is updated, after which the same full scan is executed, and then the update is rolled back. Please mind I took a table that is bigger than 5 * _SMALL_TABLE_THRESHOLD in the tests below. When the procedure is set to UPDATE, only the full scan is shown, not the update. The number between brackets is the percentage of the table that has been updated prior to the full scan.

Version 11.2.0.1

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 0/20941
full scan - physical reads cache/direct: (70)			 0/20941
full scan - physical reads cache/direct: (80)			 4084/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.2

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 0/20941
full scan - physical reads cache/direct: (70)			 0/20941
full scan - physical reads cache/direct: (80)			 4084/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.3

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.4

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 12.1.0.1

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 12.1.0.2

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

During repeated tests, the above figures show there is a difference between Oracle versions 11.2.0.1/11.2.0.2 and version 11.2.0.3 and above. Versions 11.2.0.1 and 11.2.0.2 seem to have their switching point from direct path to buffered at approximately 75% dirty blocks, while versions 11.2.0.3 and higher have their switching point at approximately 50%.

I am convinced there are more things that influence this. In fact, I had to re-run the tests a couple of times (flushing the buffercache before each run of course) in order to come up with the figures above. This means that sometimes this can show fluctuations, even in my idle database.

This is the anonymous PL/SQL procedure to test this in your own database:

declare
	-- this is a direct path read tester.
	-- please mind the user executing this anonymous PL/SQL block must have the following rights:
	-- - alter session
	-- - select on v$mystat and v$statname
	--
        -- -> because user_tables is used, statistics must be correct!
        --
	-- fill out the following details
	-- table to read -> it's best to flush the buffer cache if you want to start with a low percentage of blocks in the cache!
	v_table varchar2(30) := 'YOUR_TABLE_HERE';
	-- percentage to start at for reading blocks in the cache.
	v_current_pct number := 10;
	-- the amount to add to the percentage after the run has been executed.
	v_step number := 10;
	-- read or update (uppercase).
	v_read_update varchar2(10) := 'READ';
	--
	v_nr_blocks number;
	v_numrows number;
	v_readrows number;
	v_rd_cache number;
	v_rd_direct number;
	v_tmp number;
	v_read_to_cache varchar2(200) := 'select /*+ full('||v_table||') */ count(*) from '||v_table||' where rownum < :1';
	v_update varchar2(200) := 'update '||v_table||' set id=id+1 where rownum < :1';
	v_try_read varchar2(200) := 'select /*+ full('||v_table||') */ count(*) from '||v_table;
	v_ret number;
begin
	select blocks into v_nr_blocks from user_segments where segment_type = 'TABLE' and segment_name = v_table;
	select num_rows into v_numrows from user_tables where table_name = v_table;
	dbms_output.put_line('table: '||v_table||' size: '||v_nr_blocks||' blocks, '||v_numrows||' rows.');
	if v_read_update = 'READ' then
	while v_current_pct <= 100 loop
		v_readrows := (v_nr_blocks / 100) * v_current_pct * (v_numrows / v_nr_blocks);
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate 'alter session set events ''10949 trace name context forever, level 1''';
		execute immediate v_read_to_cache into v_ret using v_readrows;
		execute immediate 'alter session set events ''10949 trace name context forever, level 0''';
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('cached - physical reads cache/direct: ('||v_current_pct||')              '||v_rd_cache||'/'||v_rd_direct);
		--
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate v_try_read into v_ret;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('full scan - physical reads cache/direct:                          '||v_rd_cache||'/'||v_rd_direct);
		v_current_pct := v_current_pct + v_step;
	end loop;
	end if;
	if v_read_update = 'UPDATE' then
	v_current_pct := 10;
	v_step := 10;
	dbms_output.put_line('--update');
	while v_current_pct <= 100 loop
		v_readrows := (v_nr_blocks / 100) * v_current_pct * (v_numrows / v_nr_blocks);
		execute immediate v_update using v_readrows;
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate v_try_read into v_ret;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('full scan - physical reads cache/direct: ('||v_current_pct||')                    '||v_rd_cache||'/'||v_rd_direct);
		execute immediate 'rollback';
		v_current_pct := v_current_pct + v_step;
	end loop;
	end if;
end;
/

This script should be run as the owner of the table you want to investigate for direct path reads.
The Oracle user must have ALTER SESSION rights to be able to set an event to disable direct path reads, and it needs select rights on V$MYSTAT and V$STATNAME. And obviously select rights on the table it wants to read. Another important thing is the table must have accurate statistics. If you gather statistics after you have run the PL/SQL procedure, you should flush the buffer cache, and flush the shared pool in order to flush the cached cursors and statistics.
It first reads a certain percentage of the table with event 10949 set, which disables direct path reads, then the event is disabled and a full table scan is executed. Then the variable for the percentage is increased, and the buffered scan is executed again, and the full scan following it, etc.

Change the v_table definition to the table you want to scan.
What you additionally can do, is v_current_pct (10 by default) as the starting amount to read buffered, and the amount to increase the percentage with in v_step (10 by default). One of the things I done is change v_current_pct to 90 and v_step to 1.

(warning: this is a rather detailed technical post on the internal working of the Oracle database’s commit interactions between the committing foreground processes and the log writer)

After the Trivadis Performance days I was chatting to Jonathan Lewis. I presented my Profiling the log writer and database writer presentation, in which I state the foreground (user/server) process looks at the commit SCN in order to determine if its logbuffer contents are written to disk by the logwriter(s). Jonathan suggested looking deeper into this matter, because looking at the commit SCN might not the way it truly works.

The reasoning is the foreground process flushes its log information from its private strand into the public logbuffer, and as such only needs to keep track of the log blocks it allocated in the public logbuffer to see if these are written. Different processes can allocate different blocks in the public log buffer, which potentially do not have to be written all in SCN order. In other words: it could be hard to figure out on write time what the exact on-disk SCN of the redo is.

In order to verify this, let’s first recap what I tell about how a foreground process behaves on commit:

This is a schematic drawing of a foreground process committing.
There are two timelines, one of the foreground process, and one of the logwriter.
The foreground process commits. The commit copies the changes (redo vectors) from the process’ private strand into the public logbuffer (not shown, all indicated with “commit;”), after which it signals (“posts”) the logwriter via the semctl() system call.

When idle, the logwriter sleeps in a systemcall called semtimedop() for 3 seconds after which it performs some “household” tasks (measure time and resource usage for example) then calls semtimedop() again. When signalled, or when it finds contents in the public log buffer, the log writer writes the public log buffer contents to the online redo log file(s) via AIO if available. This is indicated by the io_submit() and io_getevents() calls.

In the meantime, after executing semctl(), the foreground process enters the function kcrf_commit_force(), or the kcrf_commit_force_int() function with Oracle 12. Inside this function, it calls the function kcscur3() three times. If the log writer was able to write the public log buffer contents before kcscur3() has been executed 3 times, the foreground will not enter the wait event ‘log file sync’, and not sleep on semtimedop() in order to wait for being signalled by the log writer, which is why I put it in a gray bar (indicating “optional”). Also, in that case, because the foreground process is not awaiting to be signalled by the logwriter, the log writer does not execute semctl(). The log writer knows what process(es) it needs to signal via a list that administers posters and waiters, which is called the “post-wait queue”, which is not externalised. If the log writer did not write the public log buffer contents fast enough, the foreground process registers it started waiting (in the aforementioned post-wait queue), and starts sleeping in semtimedop() for 100ms, after which it executes kcscur3() two times, do some household actions like measuring time and resource usage, and then calls semtimedop() again. This repeats until the process is receiving a message from the log writer via semctl().

Starting from Oracle version 11.2.0.3, the underscore parameter “_use_adaptive_log_file_sync” is set to TRUE, which enables a feature called “adaptive log file sync”. This settings means the database can and will (!) switch between post-wait mode, which has just been described, and a mode called ‘polling’. Polling means the processes that requested the log writer to write will not wait until they are posted by the log writer, but look at the log writer write progress, and continue after the log writer has written their specific log blocks from the public log buffer.

This is how that schematically looks like:

As you can see, it is quite alike the post wait mode, only the sleeping is done using nanosleep() instead of semtimedop() and there is no semctl() on the log writer timeline. The foreground process issues the same kcscur3() functions, but because these measure SCN progress, the foreground can determine if its public log buffer contents have been written to disk, and stop waiting and continue processing.

I made the conclusion that the kcscur3() function is used to determine the commit SCN based on profiling the function call sequence and logic reasoning. Encouraged by Jonathan’s reasoning, let’s try to see if we can dig deeper.

The first thing we need to do, is see if we can obtain more information on the kcscur3() function. One way of doing that, is investigating the function’s arguments. This can be done without source code or debug symbols, because the arguments of a function are passed via CPU registers on X86_64 Linux: first to fourth arguments are in registers RDI, RSI, RDX, RCX.

A way to do this, is insert data into a table, then attach to this process using gdb, and execute the following gdb macro:

break kcscur3
commands
silent
printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
continue
end

After continuing the foreground via gdb, gdb shows kcscur3 is executed 6 times:

kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Interesting! But what are these numbers? Values? Addresses?
In order to understand if these arguments mean anything, let’s first get the addresses for the general shared memory area’s. This can be done using ipc command with oradebug:

SYS@v12102 AS SYSDBA> oradebug setmypid
Statement processed.
SYS@v12102 AS SYSDBA> oradebug ipc
IPC information written to the trace file
SYS@v12102 AS SYSDBA> @lt

@lt is a script to look at the current active trace file. The relevant contents of this trace file are the shared memory area’s:

Handle:            0x113c2940 `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102'
 Dump of unix-generic realm handle `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102', flags = 00000000
  key 604726764 actual_key 604726764 num_areas 4 num_subareas 4
  primary shmid: 112590854 primary sanum 3 version 3
  deferred alloc: FALSE (0) def_post_create: FALSE (0) exp_memlock: 2050M
 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002ca788 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2 112492547 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cb000 0000000000400000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000077000000 Minimum Subarea size 01000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0 112525316 0x00000061000000 0x00000061000000 0x00000061000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000077000000 0000000077000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008d35000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1 112558085 0x000000d8000000 0x000000d8000000 0x000000d8000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008d35000 0000000008e00000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000003000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3 112590854 0x000000e1000000 0x000000e1000000 0x000000e1000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000003000 0000000000003000 default       readwrite

We see the 4 shared area’s and their memory address:
– Fixed size, start address 0x60000000, size 0x2cb000
– Variable size, start address 0x61000000, size 0x77000000
– Redo buffers, start address 0xd8000000, size 0x8d35000
– Skgm overhead, start 0x1e000000, size 0x3000

If we combine this information with the kcscur3() arguments, we see that the first argument points to the ‘fixed size’ area, in other words: the fixed SGA. The fixed SGA variables metadata is listed in x$ksmfsv. Please mind this view does not list all the contents of the fixed SGA, for example the latches are in the fixed SGA too.

We got 3 addresses which are inside the fixed SGA on calling commit: 0x6001fbb0, 0x60027c68 and 0x60027c98. Let’s see if we can find them in x$ksmfsv:

SYS@v12102 AS SYSDBA> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
2  where to_number('6001fbb0','XXXXXXXX')
3  between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX') and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcsgscn_                                                         000000006001FBB0         48

SYS@v12102 AS SYSDBA> c/6001fbb0/60027c68/
  2* where to_number('60027c68','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

SYS@v12102 AS SYSDBA> c/60027c68/60027c98/
  2* where to_number('60027c98','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

The first fixed SGA address, 0x6001fbb0, points to a variable called kcsgscn_. Some sources on the internet report this is the current SCN. (Kernel Cache System Global SCN?)
Let’s test this! First I set a watchpoint on 0x6001fbb0 in gdb, then query v$database.current_scn:

(gdb) awatch *0x6001fbb0
SYS@v12102 AS SYSDBA> select current_scn from v$database;

This triggers the watchpoint!

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 11278094
New value = 11278163
0x000000000cc77983 in kcscur3 ()

Ah! So when querying the current_scn, it uses the kcscur3() function too, and apparently, kcscur3() can change a value too (we see an old value, and the new value). I press ‘c’ and enter to let the debugger continue the program it debugs. In fact, I have to do this multiple times in the functions kcscur3(), kcsgssn(), kcsgcsn() and kcsgbsn. The watchpoint shows memory address 0x6001fbb0 is left with a number exactly one higher than is shown in the query ‘select current_scn from v$database’.

The second and third fixed SGA addresses, 0x60027c68 and 0x60027c98, both point to a variable called kcrfsg_. Actually, they do not point to the starting address of the variable, but rather to “somewhere” in this variable. First let’s see what value is stored at these addresses in the variable kcrfsg_ using gdb:

(gdb) x/dw 0x60027c68
0x60027c68:	11278915
(gdb) x/dw 0x60027c98
0x60027c98:	11278917

These are large numbers, which are highly likely to be SCNs of some sort.

The variable kcrfsg_, which is quite probably a c “struct” (a variable composed of multiple variables, alike records in a table) is linked with the x$ view x$kcrfws:

SYS@v12102 AS SYSDBA> select addr from x$kcrfws;

ADDR
----------------
0000000060027C38

The address reported is 8 bits into kcrfsg_. The x$kcrfws view is only used by the v$ view v$xstream_capture, and because of that it is reported to have something to do with replication. That is incorrect. My current assumption is x$kcrfws means Kernel Cache Redo Write Status.

In order to figure out which field in x$kcrfws is linked to which memory address (0x60027c68 and 0x60027c98), I use gdb once again, and use a watchpoint on a memory address. Oradebug also provides this functionality, but it doesn’t work in my setup (mprotect error 22). I attach gdb to a SYSDBA session, and execute:

(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 1: *0x60027c68

Then query the fields in x$kcrfws one by one. It turns out, memory address 0x60027c68 is accessed for the fields lwn_scn_bas and lwn_scn_wrp, and memory address 0x60027c98 is accessed for the fields on_disk_scn_bas and on_disk_scn_wrp.

So, what do we know now? It becomes apparent Oracle uses the kcscur3() function for reading SCN values. The function seems to be dynamic and can be used for multiple locations holding different types of SCNs. We witnessed reading the instance current SCN, the on disk SCN and the LWN SCN.

The on disk SCN (x$kcrfws.on_disk_scn_(bas|wrp)) seems to be a registration of up to which SCN is written by the logwriter, and the LWN SCN (x$kcrfws.lwn_scn_(bas|wrp)) is a registration of up to which SCN is in the current log write number (LWN). The log write number (LWN) seems to be a number appointed to groups of redo blocks for writing them in batch.

This information is needed to make more sense of how my foreground works. In order to make the tracing of the foreground more meaningful, we need to add a break on semctl() to understand when all redo vectors are copied into the public log buffer, and the foreground actually starts waiting on the log writer, and peeks at its progress. It is also handy to add breaks to semtimedop() and nanosleep(), so we know what logwrite mode is in use:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000cc77970 <kcscur3>
	breakpoint already hit 6 times
        silent
        printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
        continue
(gdb) break semctl
Breakpoint 2 at 0x3bfdaeb030
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "semctl\n"
>continue
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3bfdaeb060
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>printf "semtimedop\n"
>continue
>end
(gdb) break nanosleep
Breakpoint 4 at 0x3bfde0ef90
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>printf "nanosleep\n"
>continue
>end
(gdb) disable
(gdb)

I disabled all breakpoints (disable command), and do the insert. After the insert, the enable command enables all breakpoints, and watch what the function call sequence:

(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

So, after a foreground process has written its change vectors into the public log buffer and requests the log writer to write using semctl(), it doesn’t poll a single location for a SCN, but requests the on disk SCN and the current SCN! It also seemed my log writer was speedy enough to write, because there is no nanosleep() nor semtimedop() call.

It seems we need to manipulate the logwriter too, in order to get the call sequences I showed in my slides. In order to do that, I open another terminal session, and attached a debugger session to the lgwr too. At this point I should point out I made sure my Oracle (12.1.0.2) database was using the single lgwr process and not the log writer workers by setting the “_use_single_log_writer” undocumented parameter to true. (do NOT do this on a production database; both setting an undocumented parameter without the blessing of oracle support, nor attach gdb to a log writer process!!)

I added a break to the io_getevents_0_4 function, which is the function for reaping asynchronous submitted IO, and then disable the breakpoint:

break io_getevents_0_4
disable
continue

I also disabled the breakpoints in the other debugger session attached to the foreground session, and insert another record. After the insertion, enable the breakpoints in both gdb sessions, and enter commit. Now the log writer debugger session will break on io_getevents_0_4, and the foreground debugger session should show a repeating pattern:

(gdb) enable
(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
...

The nanosleep and kcscur3 calls for address 60027c98 and 60027c68 continue to be executed. After continuing the log writer debugger session, the foreground debugger session shows:

kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Bingo! We see the nanosleep() call, so we are in polling mode, not in post-wait mode. And because we artificially made the log writer stop from progressing , the foreground process is reading the on-disk SCN and LWN SCN, then calls nano sleep, then scans the on-disk and LWN SCNs again, etcetera, until the on-disk SCN gets higher than the foreground process commit SCN. Interestingly, it seems that once kcscur3() on address 0x60027c98 was able to identify the log writer progressed beyond it’s commit SCN, the next kcscur3() call does not read address 0x60027c68, but instead reads address 0x6001fbb0, alias the current SCN.

Okay, so this points to the on-disk SCN being responsible for commit progress. If this is actually true, we should now take a look at the log writer and see if the log writer indeed changes the on-disk SCN at memory location 0x60027c98 after the write. For this I can use the memory watch facility of the debugger again:

(gdb) awatch *0x60027c98
Hardware access (read/write) watchpoint 2: *0x60027c98
(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 3: *0x60027c68
(gdb) c
Continuing.

This will quite probably immediately break on access of these memory areas. The log writer is looking at these memory area’s a lot. Here’s a snippet of the logwriter approximately during the commit of the foreground:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 3: *0x60027c68

Old value = 11174890
New value = 11174892
0x000000000cc77ac4 in kcsnew3 ()
(gdb) c
Continuing.

Breakpoint 1, io_getevents_0_4 (ctx=0x7f165de70000, min_nr=2, nr=128, events=0x7ffc696611b8, timeout=0x7ffc696621b8)
    at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x000000000cc77983 in kcscur3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x0000000002d70bb9 in kcsadj3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Old value = 11174890
New value = 11174892
0x0000000002d70c13 in kcsadj3 ()

First we see a function called kcsnew3() change the value at 0x60027c68. Then we see the breakpoint at io_getevents() indicating it looks for submitted IOs to return, in other words: the log writer was asked to write something or found something to write. After the write we see the log writer reading the memory location 0x60027c98 using the kcscur3() function, just as our foreground process does. Then we find a function called kcsadj3() which first reads the on-disk SCN location 0x60027c98, and then changing it. This indicates the log writer is increasing the value of the on-disk SCN in 0x60027c98 after it has written using the function kcsadj3(), a safe bet is Kernel Cache Service Adjust, which the foreground in polling mode is reading to understand if the log writer has written the contents the process put in the public log buffer.

Of course this still isn’t the ultimate proof, I break at a few, arbitrary functions, there can be all kinds of other things going on. This means that still the foreground process can just keep the list of log buffer blocks and really use that, and only use the SCN values for verification, because I am looking at only a small subset of the functions it is executing.

However, there is a way to test this!
=> Another word of caution, the next manipulation can render your database into a smoking pile of bits and bytes! <=
The log writer mode needs to be polling. Also, for simplicity, in Oracle 12 “_use_single_log_writer” must be set to true, to be sure the log writer is performing all the tasks, and not log writer slaves.

In order to proof that a foreground session is truly looking at 0x60027c98 for the on-disk SCN to understand if the log writer has written its log buffer contents, we let the log writer stop before it has adjusted the on-disk SCN. That can be accomplished by attaching to the log writer with gdb, and break on io_getevents_0_4:

break io_getevents_0_4
disable
continue

Now the breakpoint is set, disabled, and execution is resumed.

The next thing to do, is go to the sqlplus foreground session and insert a row of data. Once the insert is done, go to the debugger session, stop execution, enter “enable”, and continue. The debugger resumed execution again, but now will break execution once it encounter io_getevents. Now go back to the foreground sqlplus session and enter commit.

This will trigger the log writer to write, and break execution because of the breakpoint. Because the log writer is stopped the foreground session will appear to be “hanging”. In fact what the foreground session is doing is sleeping in nanosleep(), then executing kcscur3() to look at the on-disk and LWN SCNs, then sleeping in nanosleep() again, etc.

This next part is experimental!
Because the log writer is stopped at the io_getevents() call, it didn’t execute kcsadj3() yet to update the on-disk SCN in 0x60027c98. What we will do, is read the current value in 0x60027c98, and increase the value using gdb. If the foreground is truly only waiting for the value at 0x60027c98, increasing the value at this memory address should give back the prompt of the sqlplus session, because it thinks the commit fully succeeded because the log writer performed all its functions.
First query the on-disk SCN:

(gdb) x/dw 0x60027c98
0x60027c98:	11175042

So, the on-disk SCN value is 11175042. Mind you the sqlplus session appears hanging, but really is “polling” the on-disk SCN, and the log writer has stopped executing, because of the breakpoint.

Let’s see if increasing the on-disk SCN makes the sqlplus session think the commit succeeded. Please mind I randomly increased the value from above by changing the 7 to 8 on the 5th position.

(gdb) set {int}0x60027c98 = 11185042

Voila! the foreground process continues and returns the prompt, indicating it is thinking the log writer has finished!

This means that Oracle uses the log writer on-disk SCN for processes to determine if their log buffer contents have been written to disk.

When the log writer is in post-wait mode, this mechanism is in place too, but the foreground process, once waiting, needs to wait for the log writer to post it via semctl(). There is one exception to this, which is true for both the polling and the post-wait mechanisms: right after a foreground process signalled the log writer to write, it will call kcscur3() looking for the on-disk SCN and if it finds the on-disk SCN beyond its commit SCN, there will not be a wait triggered. In all other cases the process will register a wait “log file sync”.

Conclusion
The kcscur3() function is a function to read and update SCNs. There are multiple SCNs, like current SCN, on-disk SCN and LWN SCN, which all can be read with kcscur3(). Also kcscur3() is not exclusive to commit, when querying the current_scn field in v$database, this function is used too.

When a process commits, right after it signalled the log writer, it checks for the log writer progress via the on-disk SCN. If the log writer was fast enough to have the contents flushed to disk and the on-disk SCN updated, the process will not wait for ‘log file sync’ at all.

Both log writer modi, post-wait and polling, look for the log writer write progress via the on-disk SCN. Both modi will show the wait ‘log file sync’ if the above “fast log writer exception” didn’t happen.

When the log writer is in post-wait mode, a process without the “fast log writer exception” will register itself in the post-wait queue and sleep on a semaphore waiting to be posted by the log writer, and register it is waiting for event ‘log file sync’. Still, it does keep track the on-disk SCN (twice) with a check of the LWN SCN in between, via calls to kcscur3(). After being signalled by the log writer, it again verifies the on-disk SCN and looks for the current SCN, after which it continues.

When the log writer is in polling mode, a process without the “fast log writer exception” will register a waiting for event ‘log file sync’. Now it sleeps in nanosleep() and only executes two checks: the on-disk SCN and the LWN SCN, via calls to kcscur3(). Once the log writer progressed the on-disk SCN beyond the process’ commit SCN, it will continue.

This blogpost is about how to install and run the Oracle database in docker. Please mind this is not an officially supported virtualisation platform for the Oracle database. This is a proof of concept setup.

Linux host setup.
In my setup, I used a linux host, freshly installed with Oracle Linux 6.7, which is going to be used as docker server. Please mind you need to leave diskspace (or a disk device) unused for the commonly documented docker setup with the btrfs driver. The root filesystem is using a ext4 filesystem by default. For the proof of concept setup, a 20G diskspace for the Operating system only is enough. I used the minimal linux installation.

The first step is to add a disk, or use a partition and format it with btrfs and mount it:

# lsblk
# yum install btrfs-progs
# mkfs.btrfs -L btrfs1 /dev/sdb
# vi /etc/fstab: LABEL=btrfs1 /var/lib/docker btrfs defaults 0 1
# mkdir /var/lib/docker
# mount -a

1- list blockdevices. this showed me my added block device for docker was /dev/sdb.
2- because of the minimal install, I need to add the btrfs userspace programs.
3- create a btrfs filesystem on /dev/sdb, and label it btrfs1.
4- edit /etc/fstab, make it mount the btrfs filesystem to /var/lib/docker.
5- create the mount point for the btrfs filesystem.
6- mount all the mounts listed in /etc/fstab.

The second step is to install and preliminary configure docker:

# vi /etc/yum.repos.d/public-yum-ol6.repo: enable public_ol6_addons
# yum install docker
# vi /etc/sysconfig/docker: add "-s btrfs" to other_args
# service docker start
# chkconfig docker on
# printf "limit memlock unlimited unlimited\n" > /etc/init/docker.conf

1- enable public_ol6_addons, in which the docker rpm sits.
2- install docker.
3- make docker use the btrfs driver.
4- start docker.
5- enable autostart for docker.
6- reset the memlock ulimit for the containers (needed for the Oracle database SGA later on).

Now we can go into the true docker part of it. Docker lives, as far as I can see at this point, in the /var/lib/docker directory. In order to build my oracle database image, I create a ‘dockerfiles’ directory, and in it a ‘build-oracle-12102’ directory:

mkdir -p /var/lib/docker/dockerfiles/build-oracle-12102

Next up, the dockerfile (aptly named Dockerfile by default) needs to be made in the build-oracle-12102 directory to build an image:

FROM	oraclelinux:6
MAINTAINER frits.hoogland@gmail.com
RUN	groupadd -g 54321 oinstall
RUN	groupadd -g 54322 dba
RUN	useradd -m -g oinstall -G oinstall,dba -u 54321 oracle
RUN	yum -y install oracle-rdbms-server-12cR1-preinstall perl wget unzip
RUN	mkdir /u01
RUN	chown oracle:oinstall /u01
USER	oracle
WORKDIR /home/oracle
ENV	mosUser=you@example.com mosPass=supersecret DownList=1,2
RUN	wget https://dl.dropboxusercontent.com/u/7787450/getMOSPatch.sh
RUN	wget https://dl.dropboxusercontent.com/u/7787450/responsefile_oracle12102.rsp
RUN	echo "226P;Linux x86-64" > /home/oracle/.getMOSPatch.sh.cfg
RUN	sh /home/oracle/getMOSPatch.sh patch=17694377
RUN	unzip p17694377_121020_Linux-x86-64_1of8.zip
RUN	unzip p17694377_121020_Linux-x86-64_2of8.zip
RUN	rm p17694377_121020_Linux-x86-64_1of8.zip p17694377_121020_Linux-x86-64_2of8.zip
RUN	/home/oracle/database/runInstaller -silent -force -waitforcompletion -responsefile /home/oracle/responsefile_oracle12102.rsp -ignoresysprereqs -ignoreprereq
USER	root
RUN	/u01/app/oraInventory/orainstRoot.sh
RUN	/u01/app/oracle/product/12.1.0.2/dbhome_1/root.sh -silent
RUN	rm -rf /home/oracle/responsefile_oracle12102.rsp /home/oracle/getMOSPatch.sh /home/oracle/database
USER	oracle
WORKDIR	/home/oracle
RUN     mkdir -p /u01/app/oracle/data
RUN     wget https://dl.dropboxusercontent.com/u/7787450/manage-oracle.sh
RUN     chmod 700 /home/oracle/manage-oracle.sh
RUN     wget https://dl.dropboxusercontent.com/u/7787450/db_install.dbt
EXPOSE  1521
CMD	/home/oracle/manage-oracle.sh

1- FROM; this is the (operating system) image which is used as a base image. If it is not available locally, it will be pulled from the central docker repository. oraclelinux:6 means Oracle Linux version 6. This serves as the base image on which all the other things are built.
2- MAINTAINER; the maintainer of the new image, obviously.
3/4/5- RUN; the groups and user for the oracle database (oinstall, dba, oracle) are created. This looks like an redundant step, as this is supposed to be done by the oracle-dbms-server-12cR1-preinstall rpm. However, this does not work. By manually creating them this way, the assignment is alright.
6- RUN; yum -y installs all the necessary packages. Oddly, it seems unzip is not in the preinstall package?
7- RUN; /u01 is the directory in which the database software is installed.
8- RUN; chown user:group is used to change the user and group permissions of /u01.
9- USER; this changes the current user for execution to ‘oracle’.
10- WORKDIR; this sets the work directory.
11- ENV; this sets a few environment variables needed by the getMOSPatch.sh script. Please mind you need to fill out your own My Oracle Support username (mosUser) and password (mosPass). Leave DownList as it is.
12- RUN; this downloads a modified version of the getMOSPatch.sh script by Maris Elsins (thank you Maris and John Piwowar). The only modification I made is to make the downloads pre-selectable by setting the DownList environment variable.
13- RUN; this downloads a response file for the installation.
14- RUN; this echoes the language for the patches to download.
15- RUN; sh /home/oracle/getMOSPatch.sh patch=17694377 runs getMOSPatch.sh and downloads file 1 and 2 of patch 17694377, which is the patch for the Oracle database version 12.1.0.2.
16/17- RUN; unzips the two files as the result of the patch download in step 15.
18- RUN; cleanup the zip files after they have been extracted.
19- RUN; runs the installer with the template downloaded, and a few extra switches.
20- USER; here we switch to root, because after installation there are two scripts which need to be run as root.
21/22- RUN; runs the two post-install root scripts.
23- RUN; cleans up the getMOSPatch.sh script, the template and the database directory, which contains the database installation media.
24/25- USER/WORKDIR; change back to the oracle user.
26- RUN; create a directory on which we hook the persistent storage of the database
27- RUN; download manage-oracle.sh, the main script that sets up the listener and the database.
28- RUN; sets the mode of the manage-oracle.sh script to 700 (read/write/execute for the oracle user).
29- RUN; download the database installation template.
30- EXPOSE; make port 1521 (the listener port) available externally.
31- CMD; make /home/oracle/manage-oracle.sh the main command (pid 1 in the container).

After creating the Dockerfile, we can build an image with the description in the Dockerfile:

# cd /var/lib/docker/dockerfiles/build-oracle-12102
# docker build -t "oracle-12102" .

(mind the trailing dot)

This will take some time, as it is downloading the operating system base image, the software and does the installation. If all goes well, you end up with two images:

# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
oracle-12102        latest              28c18b083371        42 seconds ago      12.25 GB
oraclelinux         6                   cfc75fa9f295        11 days ago         156.2 MB

Which are the image we started with (oraclelinux), and our fresh created image “oracle-12102”.

In order to start the container, we need to prepare a directory for the database files, which is mounted into the container, and is persistent (so we can actually store and save everything, which is one of the basic things a database is used for). For this I create a directory, and set the userid and group id of the Oracle user (inside the container!):

# mkdir -p /var/lib/docker/db/frits
# chown 54321:54321 /var/lib/docker/db/frits

Now that we made all preparations, we can start a container, in which the database will run. The way I created the script (manage-oracle.sh), the container will create or start a database with the hostname set for the container. At this point I don’t know if that is the best choice. It seems like a good way to automate currently. This is how a container is started based on the image we just created. The startup line looks rather long, because a couple of settings need to be made, which are saved by docker:

# docker run --ipc=host --volume=/var/lib/docker/db/frits:/u01/app/oracle/data --name frits --hostname frits --detach=true oracle-12102
25efb5d26aad31e7b06a8e2707af7c25943e2e42ec5c432dc9fa55f0da0bdaef

If you wonder what the long hexadecimal string is, this is the full image hash. Now the container is starting. There are a couple of important things to see in the docker run line:
–ipc=host This makes the host’s IPC available to the container. Otherwise my container only could see 32MB of shared memory.
–volume=/var/lib/docker/db/frits:/u01/app/oracle/data This maps the docker hosts directory /var/lib/docker/db/frits available in the container at /u01/app/oracle/data (–volume=host:container). This is persistent storage, unlike the container’s image.
–name=frits This gives the docker container the name frits.
–hostname=frits This makes the container set the hostname to frits.
–detach=true This makes the container not be connected to the starting shell, because otherwise the shell signals will be honoured by the container, meaning the CTRL-C will stop the container.
oracle-12102 This is the image on which the container is based.

If you want to see what is going on in the container, the first thing is to use the docker logs command. This will show the standard out of the main running process. The container will stop if the main running process terminates (CMD in our Dockerfile, which is starting manage-oracle.sh). In the manage-oracle.sh I put a tail on the alert.log of the database after starting the listener, so you can use ‘docker logs’ to look how the database is doing (-f means follow):

# docker logs -f frits

LSNRCTL for Linux: Version 12.1.0.2.0 - Production on 11-AUG-2015 15:33:30

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Starting /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 12.1.0.2.0 - Production
System parameter file is /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=frits)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 12.1.0.2.0 - Production
Start Date                11-AUG-2015 15:33:30
Uptime                    0 days 0 hr. 0 min. 5 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
The listener supports no services
The command completed successfully

SQL*Plus: Release 12.1.0.2.0 Production on Tue Aug 11 15:33:36 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 1048576000 bytes
Fixed Size		    2932336 bytes
Variable Size		  247464336 bytes
Database Buffers	  658505728 bytes
Redo Buffers		  139673600 bytes
Database mounted.
Database opened.
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Database Characterset is WE8MSWIN1252
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process AQPC
Tue Aug 11 15:33:55 2015
AQPC started with pid=30, OS id=88
Starting background process CJQ0
Tue Aug 11 15:33:57 2015
CJQ0 started with pid=61, OS id=150
Completed: ALTER DATABASE OPEN
Tue Aug 11 15:34:10 2015
===========================================================
Dumping current patch information
===========================================================
No patches have been applied
===========================================================

If you followed the examples in the blogpost, this is not what you would see, because there isn’t a database. Instead, if you look with docker logs, you will see:

# docker logs -f frits

LSNRCTL for Linux: Version 12.1.0.2.0 - Production on 11-AUG-2015 15:44:51

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Starting /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 12.1.0.2.0 - Production
System parameter file is /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=frits)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 12.1.0.2.0 - Production
Start Date                11-AUG-2015 15:44:51
Uptime                    0 days 0 hr. 0 min. 5 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
The listener supports no services
The command completed successfully
Creating and starting Oracle instance
2% complete
3% complete
5% complete

Which is the dbca working to create a database for you.

If you wonder what is going on inside the container, you can create a shell in it using the ‘docker exec’ command. This can not be done if the container is stopped, only if the container is running:

# docker exec -ti frits bash
[oracle@frits ~]$

Unlike an attached start, a docker exec bash session can exit without interrupting the container.

In order to see running containers, use docker ps:

# docker ps
CONTAINER ID        IMAGE                 COMMAND                CREATED             STATUS              PORTS               NAMES
304d273d9424        oracle-12102:latest   "/bin/sh -c /home/or   2 minutes ago       Up 2 minutes        1521/tcp            frits

The container can be stopped:

# docker stop frits
frits

It is not visible anymore as running container with docker ps:

# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

However, you can see stopped containers when you add ‘-a’ to ps:

# docker ps -a
CONTAINER ID        IMAGE                 COMMAND                CREATED             STATUS                     PORTS               NAMES
304d273d9424        oracle-12102:latest   "/bin/sh -c /home/or   25 minutes ago      Exited (0) 2 minutes ago                       frits

You can start a stopped container with docker start:

# docker start -a=false frits
frits

If you want containers to start on boot, you must automate the docker start command on the operating system.

Please mind the exposed port on the container (1521, the listener) is local to the container, and not visible outside of the docker host. If you want to make the port visible outside of the docker host, you must publish it from the container to the host. In order to do that, you must redefine the container. This requires you to drop it first:

# docker stop frits
frits
# docker rm frits
frits
# docker run --ipc=host --volume=/var/lib/docker/db/frits:/u01/app/oracle/data --name frits --hostname frits --detach=true --publish=1521:1521 oracle-12102
84421ae6ee344df7d78aba5f0f5314894137f120a05b94870fcc0f805287b632

Because the database is on persistent storage, it can just startup again.

In my previous post, I looked at non shared latches and how the latching is done by Oracle. This post is a description on how the latching works for shared latches.

The information is quite internal, if you landed on this page it might be a good idea to start with my first post on this topic: first post.

A famous example for shared latches is the ‘cache buffers chains’ latch.

For the sake of the test I quite randomly scanned a test table, and had a little gdb script to look at the function call ksl_get_shared_latch:

break ksl_get_shared_latch
  commands
    silent
    printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
    c
  end

After getting the results, I looked up the latch addresses (laddr) in V$LATCH.ADDR, and searched for a cache buffers chains latch.

Next, I did setup the same construction as the previous investigation into non shared latches, I started two sqlplus / as sysdba sessions. In the first session, I simply took a shared latch in exclusive mode (16 as the fourth argument).

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x94af8768 1 0 2303 16
Function returned 1

This takes latch 0x94af8768 in mode 16, exclusive mode.

I essentially did the same as with the previous investigation, I ran perf record, on the session trying to get the latch, and looked at the functions which were used during the spinning for the shared latch with perf report. If the spinning is not clear enough visible, set “_spin_count” to an higher value.

After the investigation of perf report, I worked out the relevant functions. Here is the gdb script with the relevant functions that I discovered:

break ksl_get_shared_latch
  commands
    silent
    printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
    c
  end
break kslgess
  commands
    silent
    printf "kslgess %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslskgs
  commands
    silent
    printf "kslskgs %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break *0xc291a9
  commands
    silent
    printf " kslskgs loop: %d\n", $r15d
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

If you attach to the second process and source the breakpoint script, this is what you will see:

ksl_get_shared_latch laddr:94af8768, willing:1, where:0, why:2303, mode:16
kslgess 94af8768, 16, 1, 0
kslskgs 94af8768, 1, 464092136, 464092728
 kslskgs loop: 2000
 kslskgs loop: 1999
...
 kslskgs loop: 2
 kslskgs loop: 1
kslwlmod 464092480, -1780205088, -1800435864, 1
kslskgs 94af8768, 1, 464092136, 464092728
 kslskgs loop: 1
skgpwwait 464091920, 1628424256, -1780204168, 0
sskgpwwait 464091920, 1628424256, -1780204168, 0
semop 1409027, 464091720, 1, -1

1-A shared latch is gotten using the distinct ksl_get_shared_latch function call. In the ksl_get_shared_latch function an attempt to get the latch is done, and returns if succeeded.
2-Next kslgess (kernel service latch get spin shared (latch)). It does not appear a latch get attempt is done in this function.
3-kernel service latch spin kernel get shared (this is a guess). Here the latch is tried up to _spin_count-1 times.
4-8 this is the kslskgs function spinning over a latch get.
9-kernel service latch wait list modification. Here the process is (almost?) done spinning, and registers itself in the post wait interface for getting post when the latch is freed.
10/11-here a last time the latch is tried.
12-system kernel generic post wait wait: the function to setup the waiting for getting posted.
13-system system kernel generic post wait wait: probably the port specific code for the post wait setup.
14-semop: semaphore operation: the session starts sleeping on a semaphore waiting to get posted. As has been indicated previously, this makes the process stay clear of running unless it is posted.

This also means that instead of the (very) CPU heaving spinning on the latches, which was done in the old days, this way is very light on the CPU, and seems to be very efficient. Or more efficient than the old days.

A word of caution: it seems the shared latch get in exclusive mode can not be freed; the session that has gotten the latch will hang when calling kslfre, and the session that was waiting for the latch did get:

ORA-03113: end-of-file on communication channel
ORA-24323: value not allowed

This seems to be expected behaviour, as indicated by Stefan Koehler and Andrey Nikolaev, which is not fixed in Oracle version 12.1.0.2.

Oracle DBAs who are so old that they remember the days before Oracle 11.2 probably remember the tuning efforts for latches. I can still recall the latch number for cache buffers chains from the top of my head: number 98. In the older days this was another number, 157.

But it seems latches have become less of a problem in the modern days of Oracle 11.2 and higher. Still, when I generate heavy concurrency I can see some latch waits. (I am talking about you and SLOB mister Closson).

I decided to look into latches on Oracle 12.1.0.2 instance on Oracle Linux 7. This might also be a good time to go through how you think they work for yourself, it might be different than you think or have been taught.

In order to understand how latching works, I searched for Oracle related traces. I could find event 10005, but it did not return anything latching related. My current understanding is that event 10005 is Oracle KST tracing, for which the results are put in X$TRACE.

Luckily, I could get a great headstart by studying the work of Andrey Nikolaev. However, that work seems to be strictly Solaris based.

I decided to take a look on how this works on Linux. In order to do this, I did setup a system for the specific purpose of this test. Disclaimer: The techniques below are for educational purposes only, and should never be done on a real database!

The work of Andrey shows kslgetl() as the overal latching function, which does:
kslgetl
– sskgslgf (immediate latch get)
– kslges (wait latch get)
— kskthbwt
— kslwlmod (setup wait list)
— sskgslgf (immediate latch get)
— skgpwwait (sleep latch get)
— semop

In order to do predictable latch gets and generate latch misses, in a very predictable way, I used the method that Andrey shows on his website (which he attributed to Tanel), which is using oradebug call to call the latch get function. In order to get latch waits, you need at least two processes doing something with a latch, one holding the latch, and another one requesting it. In order to facilitate this, I setup two sqlplus / as sysdba sessions.

Taking a latch manually can be done using the kslgetl or the ksl_get_shared_latch functions. Freeing a latch is done using the kslfre function. The kslfre function just takes the latch address as argument. The kslgetl and ksl_get_shared_latch functions take the following arguments:
1-latch address
2-immediate get (0 means yes, 1 means no)
3-where (X$KSLLW.INDX)
4-why (X$KSLWSC.INDX)
5-mode (8=shared,16=exclusive; only for ksl_get_shared_latch function)

Immediate mode get for a non shared latch.
An immediate mode get just tries to fetch a latch once. In order to manually do an immediate latch get, I done:

Session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 1

This means session 1 has gotten latch 0x60023a80. I gotten the latch in willing to wait mode, but that does not really matter here. The session did get the latch.

If you want to check this, simply use V$LATCHHOLDER view to verify (in another session):

SQL> select * from v$latchholder;

        PID        SID LADDR            NAME                                                                   GETS     CON_ID
---------- ---------- ---------------- ---------------------------------------------------------------- ---------- ----------
        38        134 0000000060023A80 cache table scan latch                                                  709          0

Please mind that with a latch manually gotten, you need to free the latch before you try to do anything else with your session, otherwise you encounter a (non critical) ORA-600. Freeing a latch is done using oradebug call kslfre and one argument: the latch address.

Session 2:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 0

The ‘Function returned 0’ means the immediate latch get failed. As you can see this was an immediate get because the second argument is 0.

In order to understand which functions are involved, I first used the perf record linux utility. However, because the immediate get does not spin, and a latch get by all means has speed/low overhead as one of the principal design criterions, I could not see the function.

This meant I needed to go to one of the tools I have used extensively in the past: gdb (the GNU debugger). You need to attach to the Oracle database server shared process locally on the database server. Here is what I did:

# gdb -p 4600
...
(gdb) set pagination off
(gdb) rbreak ^ksl.*
...
Breakpoint 262 at 0x8ea7b0
<function, no debug info> ksl_event_stats_rollup;
(gdb) commands 1-262
type commands for breakpoint(s) 1-262, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

The first gdb function turns off having to press enter for every full screen of output of gdb, the second function breaks on all functions in the oracle executable that start with ‘ksl’. The commands command creates commands that are executed in gdb if breakpoints 1-262 are encountered, which is ‘c’: continue.

Now, with the debugger set, I executed the kslgetl function again:

Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 247, 0x000000000c8e10a0 in kslws_check_waitstack ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 242, 0x000000000c8ddcb0 in kslgetl ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()

It is important to understand Oracle does a lot of other stuff outside the latch get via kslgetl. Most of the stuff above are functions which start with kslwt, which is the Oracle wait interface. A couple of times a shared latch is taken (as can be seen by the function ksl_get_shared_latch), and freed (kslfre). The important part here is: kslgetl is executed once, and did not go into any other function to try to get the latch.

Getting a latch in willing to wait mode for a non shared latch.
Now let’s do something a bit more exciting: getting a taken latch in willing to wait mode. The first session can do exactly the same, just take the latch. The second session needs to be changed a little bit to indicate it is willing to wait:

SQL> oradebug cell kslgetl 0x60023a80 1 0 2442

This will call additional functions. In order to understand what these functions are, I used perf record, perf report and perf script.

I was able to create a smaller, more specific gdb script to see what is going on:

break kslgetl
  commands
    silent
    printf "kslgetl laddr:%x, willing:%d, where:%d, why:%d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslges
  commands
    silent
    printf "kslges %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

This is how the gdb output looks like when the latch get in willing to wait mode is executed:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
kslwlmod 13311368, -1780327896, 1610758784, 1
skgpwwait 13311608, -1767360, -1780326976, 0
sskgpwwait 13311608, -1767360, -1780326976, 0
semop 360451, 13310840, 1, -1

Interestingly, if the latch is not taken, this is how the latch get sequence looks like:

kslgetl laddr:60023a80, willing:1, where:0, why:2442

In other words, for getting a non shared latch in willing to wait mode:

1-the function kslgetl is called, which tries to fetch the latch.
If the latch can be taken, the function returns, if not:
2-the function kslges (kernel service latch get spinning) is called, which supposedly also tries to take the same latch.
If the latch still can not be taken, the next function is:
3-the function kslwlmod (kernel service latch waiting list modify) is entered.
In this function the process registers itself as waiting in the post/wait list.
4-the function skgpwwait (system kernel generic post/wait wait) is entered.
This function sets up the waiting for the process so it can be posted.
5-the function sskgpwwait (system system kernel generic post/wait wait)
My current understanding is the ‘ss’ function contain the platform specific code for database functions.
6-the (operating system) function semop (semaphore operation) is called.
This will make the process sleep waiting on a semaphore (operating system signalling mechanism). This way, the process will not be runnable on the CPU unless the semaphore is posed.

The information that is missing here, is the spinning. The earlier work of Andrey Nikolaev showed that in the Solaris port of the database, a distinct function (sskgslgf [immediate] and sskgslspin [spin]) was used to get the latch, which made it easy to count.

Some searching around revealed that a CPU register reveals this information. Add this to the above gdb script:

break *0xc29b51
  commands
    silent
    printf " kslges loop: %d\n", $ecx
    c
  end

And try to get a non shared taken latch in willing to wait mode:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
 kslges loop: 19999
 kslges loop: 19998
...
 kslges loop: 1
 kslges loop: 0
kslwlmod 1208568840, -1780327896, 1610758784, 1
skgpwwait 1208569080, -1281169344, -1780326976, 0
sskgpwwait 1208569080, -1281169344, -1780326976, 0
semop 360451, 1208568312, 1, -1

So…this reveals that getting a non shared latch in willing to wait mode will spin 10*_spin_count. In other words: not _spin_count, which is 2000 by default. What is even more interesting, is for the described type of latch, there is no (short) timed sleep done; the kslges function spins 10*_spin_count times for the latch, then puts itself on the waiting list, and goes to sleep on a semaphore. The way the latching mechanism works has been described (by Oracle AFAIK) as that a process will spin for _spin_count times trying to get a latch, then goes to sleep for some time, and then spins trying to get the latch, goes to sleep, etc.

I strace’d the process holding the latch to see if it is doing a semctl systemcall to signal the waiting process if the latch is freed, and indeed that is what is happening. This is different from the “old days” where processes spinning on latches (cache buffers chains latches most of the time) were really hammering the system and would eat up a lot of the CPU slices trying to get a latch.

Part of the classic latching problem, specifically the cache buffers chains latching problem, quite probably is mitigated by having shared latches, which were introduced in Oracle in different versions for different ports.

Watch out for a second blogpost where I do the same investigation for shared latches.

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.

#!/usr/local/bin/stap

global latency

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()])
}

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

...
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 79542
>kcbgtcr
<kcbgtcr, latency(us): 4
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 13
>kcbgtcr
<kcbgtcr, latency(us): 1
...

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

#!/usr/local/bin/stap
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("<kcbgtcr, latency(us): %d\n", latency[pid()])
}

probe end {
	if ( @count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")
		println(@hist_log(latency_histogram))
	}
}

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:

...

==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

#!/usr/local/bin/stap
#
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
	}
	printf("\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiosq0") {
	printf("=PARSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiexe") {
	printf("=EXECUTE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opifch2") {
	printf("=FETCH\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiclo") {
	printf("=CLOSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	printf("=KSLWTBCTX\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	printf("=KSLWTECTX\n")
}
probe process("/lib64/libaio.so.1").function("io_submit") {
	printf("=IO_SUBMIT\n")
}
probe process("/lib64/libaio.so.1").function("io_getevents_0_4") {
	printf("=IO_GETEVENTS\n")
}
probe process("/lib64/libc.so.6").function("pread64") {
	printf("=PREAD\n")
}

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
=IO_SUBMIT
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
=IO_GETEVENTS
=IO_GETEVENTS
169481 oracle_5487_tes(5487):              > kcflwi
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTBCTX
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTECTX
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;

  COUNT(*)
----------
   1000000

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;
    169999

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1’ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0’.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on 11.2.0.3 and 12.1.0.2, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on 11.2.0.3, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.

Conclusions

The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on 12.1.0.2 gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

Follow

Get every new post delivered to your Inbox.

Join 3,042 other followers

%d bloggers like this: