Archive

Oracle EE

This is the second blogpost on using PL/SQL inside SQL. If you landed on this page and have not read the first part, click this link and read that first. I gotten some reactions on the first article, of which one was: how does this look like with ‘pragma udf’ in the function?

Pragma udf is a way to speed up using PL/SQL functions in (user defined function), starting from version 12. If you want to know more about the use of pragma udf, and when it does help, and when it doesn’t, please google for it.

create or replace function add_one( value number ) return number is
        pragma udf;
        l_value number(10):= value;
begin
        return l_value+1;
end;
/

select sum(add_one(id)) from t2;

As you can see, really the only thing you have to do is add ‘pragma udf’ in the declaration section of PL/SQL.

Here is how the flamegraph looks like:

What is visible, is that the functions between the plsql interpreter (pfrrun) and the function that makes the operand evaluation switch to PL/SQL (evapls) now is only one function, peidxrex. However, inside the evapls function there are two additional functions called (kkxmsagof, kkxmsagif, not readable) which take noticeable time. Conclusion at this point is pragma udf is doing it in yet another way than a native PL/SQL function and the subquery factoring.

Profiling this using the systemtap script:

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

Shows:

# stap -x 92509 plsql.stap
Begin.
^C
evapls time:      2211412
pfrrun time:       804178

So, that’s very close to using this function using subquery factoring, a bit longer (2192685). This is very strictly depending on what is actually done, so milage may vary for your own use.

While we are at it, let’s have a look how this looks like when no PL/SQL is used, so:

select sum(id+1) from t2;

Here it is:

The function used for adding is now evaaddrset. From the size of the kdst_fetch function can be seen that it takes way lesser time. Let’s measure it with a changed version of the systemtap script:

global evaaddrset_time, evaaddrset_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset") {
	if ( pid() == target() )
		evaaddrset_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset").return {
	if ( pid() == target() )
		evaaddrset_tot+=local_clock_us()-evaaddrset_time
}

probe end {
	printf("\nevaaddrset time: %12d\n", evaaddrset_tot)
}

This is how the output looks like:

# stap -x 92509 plsql.stap
Begin.
^C
evaaddrset time:        43389

A simple calculation shows that doing the addition native in SQL only takes 43389/2211412*100=2% of the runtime of PL/SQL with pragma udf.

Whenever you use PL/SQL in SQL statements, the Oracle engine needs to switch from doing SQL to doing PL/SQL, and switch back after it is done. Generally, this is called a “context switch”. This is an example of that:

-- A function that uses PL/SQL 
create or replace function add_one( value number ) return number is
        l_value number(10):= value;
begin
        return l_value+1;
end;
/
-- A SQL statement that uses the PL/SQL function
select sum(add_one(id)) from t2;

Of course the functionality of the function is superfluous, it can easily be done in ‘pure’ SQL with ‘select sum(id+1) from t2’. But that is not the point.
Also, I added a sum() function, for the sake of preventing output to screen per row.

The first thing to check, if there is a difference in performance between executing with sum(id+1) and sum(add_one(id)). If there isn’t we can stop here🙂

TS@frits > set timing on
TS@frits > select sum(id+1) from t2;

 SUM(ID+1)
----------
5.0000E+11

Elapsed: 00:00:00.19
TS@frits > select sum(add_one(id)) from t2;

SUM(ADD_ONE(ID))
----------------
      5.0000E+11

Elapsed: 00:00:01.13

This statement executes a full table scan, I’ve closely guarded the IO times were alike.
But it looks there is a whopping difference between including PL/SQL and not: 113/19*100=595%, or differently worded: almost six times slower.

How does this work? Let’s have a look using stapflames. The idea behind flame graphs in general is to understand in which (c code) functions (user mode and kernel mode) the time is spend. Because of the full backtrace (all the called functions on top of each other), it gives an insight on how a program is working.

The first thing we need to establish, is how PL/SQL looks like from the perspective of C-functions. For that reason, I created a bogus PL/SQL program to profile:

declare
t number:=0;
begin
while t < 1000000 loop
t:=t+1;
end loop;
end;
/

Yes, that is right, the only thing this anonymous PL/SQL block does, is declare a number to a variable named ‘t’, and then loop adding one to the variable until the variable reaches the number 1000000. Again, what this program does is not interesting nor functional, the only thing it needs to do is run, so when we profile the program we are sure it is doing PL/SQL.

I ran this anonymous PL/SQL block using my stapflame utility to generate a flamegraph, and this is how that looks like:
I have taken the flamegraph of all time considered on cpu by the Oracle database.

First of all, one important property of flamegraphs is shown: the sequence is random. If you look at the kpoal8 function, you see there are two different paths taken from this function: opiexe (oracle program interface execute) and opiosq0 (oracle program interface prepare to parse). Of course the PL/SQL block is first parsed and then executed, so the order is different than shown.

What is also very visible, is that almost all time executing, is done using a function ‘pfrrun’, which seems to be the main function executing PL/SQL. On top of that we can see some functions which roughly resemble the functionality used in the PL/SQL block: pfrinstr_ADDN (addition, t:=t+1) and pfrinstr_RELBRNCH (the loop). This also gives a fair indication PL/SQL is interpreted rather than compiled. Anyway, what is important is that from looking at the little test above, we can distinguish running PL/SQL from SQL by the pfrrun function.

Now let’s look at at a flamegraph of running a PL/SQL function in the SQL statement:
The flamegraph shows all the time considered running on CPU for executing the statement ‘select sum(add_one(id)) from t2’. There is a lot to see!

When we look on top of the function kpoal8, we see the function opifch2. This means the vast majority of the time is spend in the fetch phase of the SQL statement. On top of the opifch2 function we see two functions which start with qer. ‘qer’ probably means Query Execute Rowsource. ‘qertbFetch’ is the fetch procedure for table scans. Then we see the kdsttgr function (kernel data scan table get row), and then the ultra fast table scan function (kdstf; kernel data scan table full) followed by a lot of zero’s and/or one’s and ending with ‘km’ or ‘kmP’. There are a lot of kdstf functions in the Oracle executable, I assume the zero’s and one’s after ‘kdstf’ are a bitmap of potentially needed functions during the full scan, so the full table scan function can omit a lot of choices on runtime, leading to better CPU efficiency. See an article by Tanel on what this means.

In the full table scan function, there are two main functions which consume time, kdst_fetch and qesaFastAggNonDistSS. ‘kdst_fetch’ and deeper functions are functions related to doing IO for reading the data from the data file. ‘qesaFastAggNonDistSS’ is the main function for actually processing the data. The function qesaFastAggNonDistSS has two main functions consuming its time, evaopn2 and a function visible as ‘sl..’, which is actually a function called slnxsum, in other words, the sum() function. The function evaopn2 is a function to evaluate operands. This evaluation is the path towards executing the PL/SQL function.

Actually, when carefully looking at the evaopn2 function, we see the slnxsum function, and ‘evapls’, which is the function to switch to PL/SQL. The two main functions in ‘evapls’ are kgmexec and opiomc. Again here the order is switched; what happens here is first a cursor must be mapped for executing PL/SQL (opiomc function), after which it can be executed (kgmexec function).

In order to understand what the time taken by “switching” to PL/SQL is, we can take the total time the query engine is processing everything PL/SQL related, which is the total time taken by the ‘evapls’ function, and measure the time actually running PL/SQL, which is the time taken by the ‘pfrrun’ function. This can be accomplished by simple systemtap script:
(please mind you need a recent systemtap version, preferably gotten from https://sourceware.org/git/systemtap.git, and kernel version 3.10 to be able to use return probes)

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

This is how it looks like on my machine:

# stap -x 29680 plsql.stap
Begin.
^C
evapls time:      3203637
pfrrun time:       951830

So, the overhead or context switching time, which must be Oracle server code executing between the the evapls function, where it determines it needs to execute PL/SQL and the pfrrun function, which is the PL/SQL interpreter, is on my machine:
(1-951830/3203637)*100=70%
Wow!

One way of reducing this problem, is using subquery factoring, alias the ‘with clause’. To use the function that way, this is how the SQL should be written:

with
function add_one( value number ) return number is
	l_value number(10):= value;
begin
	return l_value+1;
end;
select sum(add_one(id)) from t2;
/

Let’s have a look at the flamegraph of this construction:
It becomes apparent that with subquery factoring, there are way lesser functions between the evapls and pfrrun functions.
Normal PLSQL: kgmexec, kkxmpexe, kkxdexe, peidxexe, peidxr_run, plsql_run
Subquery factoring: kkxmss_speedy_stub, peidxrex
Also mind there is no codepath for mapping a cursor.

Let’s have a look at the timing:

# stap -x 29680 plsql.stap
Begin.
^C
evapls time:      2192685
pfrrun time:       880230

The time spend in PL/SQL, by looking at total time spend in the evapls function reduced by 32% ((1-2192685/3203637)*100).
However, if you calculate the overhead, it is still pretty significant: (1-880230/2192685)*100=60%

Conclusion
The most simple conclusion I can make is: do not use PL/SQL if you can solve it in SQL, like in the example above. This does not mean you should never use PL/SQL, contrary: in a lot of cases processing should be done where the data is, and sometimes you need a procedural language for that.

I made a lot of assumptions in this little investigation. The function naming (the translation from the Oracle C function name to what functionality it is supposed to deliver) are speculations.

The context switch between SQL mode and PL/SQL mode looks like it is technically setting up the execution environment for PL/SQL. Indeed this takes time, and the true PL/SQL execution time when repeatedly executing PL/SQL is very low in my case. Please mind actual times will differ on different systems. However, the main conclusion is that using PL/SQL in SQL execution probably is not the most performant thing to do, including using subquery factoring.

There’s been a lot of work in the area of profiling. One of the things I have recently fallen in love with is Brendan Gregg’s flamegraphs. I work mainly on Linux, which means I use perf for generating stack traces. Luca Canali put a lot of effort in generating extended stack profiling methods, including kernel (only) stack traces and CPU state, reading the wait interface via direct SGA reading and kernel stack traces and getting userspace stack traces using libunwind and ptrace plus kernel stack and CPU state. I was inspired by the last method, but wanted more information, like process CPU state including runqueue time.

I started playing around with systemtap, and was able to read a process’ CPU state including run queue time. This involves using kernel tapset scheduler, which unfortunately needs the kernel debug info packages (kernel-euk-debuginfo and kernel-uek-debuginfo-common, available via oss.oracle.com). It is not hard to include wait interface information, this is work Luca and I collaborated on in the past. I created a systemtap script called cpu_and_wait_profile.stap, which shows the oracle database state transition between on cpu and in a wait, as well as kernel CPU state information. This is how that should be executed and what it outputs:

# stap -x 6641 cpu_and_wait_profile.stap
Begin.
w     - 388 (no begin)
c     1    	tot:         334	on:         333	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     2 384	tot:           5	on:           5	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
c     3    	tot:         644	on:         644	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     4 212	tot:          58	on:          41	off:          17	q:           5	ti:           9	tu:           0	#slices:    1
c     5    	tot:         371	on:         371	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     6 212	tot:         146	on:          58	off:          88	q:          14	ti:          69	tu:           0	#slices:    1
c     7    	tot:        1787	on:        1745	off:          42	q:          37	ti:           0	tu:           0	#slices:    2
w     8 212	tot:         265	on:          30	off:         234	q:          12	ti:         218	tu:           0	#slices:    1

The first column indicates if the process is inside an Oracle wait event(w), or is considered running on cpu (c) by the database.
The second column is a serial number. The third column is the wait event number if the process is considered inside a wait, or empty if on CPU.
The column ‘tot’ is the total time (in microseconds) spent on cpu or inside a wait event.
The column ‘on’ is the time spent truly running on CPU. Obviously, ‘off’ is all the time not spent running on the CPU.
The columns ‘q’, ‘ti’ and ‘tu’ are off CPU states. ‘q’ is time spend in the CPU runqueue. This is not a kernel state, a process gets the state ‘TASK_RUNNING’ to indicate it is willing to run, after which it’s the task of the scheduler to manage willing to run processes and get them onto a CPU. ‘ti’ is a kernel state, which means ‘TASK_INTERRUPTABLE’. This is a state after which the process is taken off the CPU, because it is waiting for something to complete. ‘Something’ means a disk IO, a timer to expire, etc. ‘tu’ means ‘TASK_UNINTERRUPTIBLE’, which is used if a process should only continue when a specific condition is met, and reacting to signals would be problematic.
The last column ‘#slices’ is the number of times the process has gotten on cpu.
If you look at the example output above, you see that the process started running, and remained running until sequence number 4 (sequence number is the second column). Sequence number 4 is an Oracle wait event, number 212 (direct path read). The earlier wait event number 384 was passed without actually waiting; total time is 5us, on cpu was 5us too (!). The total time spent in the wait event in sequence #4 is 58us, of which 41us was spent on cpu, and 17us off cpu. The off cpu time is composited of 5us run queue time (q) and 9us ‘TASK_INTERRUPTIBLE’ time, of which the total is 14us, which leaves 3us off cpu/unaccounted for. This is time taken by the state transitions and context switches. The actual sequence of events of the CPU state is: TASK_RUNNING (on cpu), then TASK_INTERRUPTIBLE is entered, which is actually waiting for IOs in this case (wait event ‘direct path read’, remember?). The ‘TASK_INTERRUPTIBLE’ state means the process is stopped from processing by the kernel (taken off cpu), which is logical, because it means the process is deliberately waiting for something before it can continue. Once the condition is met (IO(s) ready in this case), the process can continue. To continue, the process state is set to ‘TASK_RUNNING’, and put on a runqueue. This means there is no explicit process state ‘in run queue’. This state (state set to ‘TASK_RUNNING’ but not running on CPU yet) is shown with ‘q’. Once the process has enough priorities, the scheduler switches the process running on the CPU again.

Okay, so at this point we have a (systemtap) script that can very precisely count the time spend of a process. Wouldn’t it be great if we can see a flame graph per sequence number? I spent a great deal of time trying to figure out a way to combine the two. Until I learned about the ‘-T’ switch of perf record:

    -T, --timestamp       Sample timestamps

Great!! The way this works, is that perf includes ‘timestamps’ during recording (perf record), which are printed when the perf recording is externalised with the ‘perf script’ command:

oracle_92213_fv 92213 34075.900988: cycles:
        ffffffff810483da native_write_msr_safe ([kernel.kallsyms])
        ffffffff8102bf91 intel_pmu_enable_all ([kernel.kallsyms])
        ffffffff810263cc x86_pmu_enable ([kernel.kallsyms])
        ffffffff811221db perf_pmu_enable ([kernel.kallsyms])
        ffffffff81124d09 perf_event_context_sched_in ([kernel.kallsyms])
        ffffffff811252c5 __perf_event_task_sched_in ([kernel.kallsyms])
        ffffffff810962ce finish_task_switch ([kernel.kallsyms])
        ffffffff8159f81d __schedule ([kernel.kallsyms])
        ffffffff8159fec9 schedule ([kernel.kallsyms])
        ffffffff8119e56b pipe_wait ([kernel.kallsyms])
        ffffffff8119f030 pipe_read ([kernel.kallsyms])
        ffffffff81195c37 do_aio_read ([kernel.kallsyms])
        ...etc...

‘34075.900988’ is the timestamp. However, what is this number?? I searched for quite some time, and there is no clear description to be found. It clearly is not epoch time.

Some experimentation learned that -apparently- the number is seconds since startup with microsecond granularity. Further experimentation using systemtap learned that exactly the same number can be fetched with the systemtap local_clock_us() function. This makes it possible to link perf stacktraces with systemtap output!! I created a script (run_stapflame.sh) that runs perf record -g and systemtap at the same time, then combines the information from both tools (meaning the systemtap runtime data is pushed into the stack trace information), after which flame graphs are created.

When a process is not running, there will be no perf stack traces, because there is no process for perf to take the stack trace from. So only when running on CPU (TASK_RUNNING state excluding run queue time), there should be perf data. Also, the systemtap times are accurately measured, but the stack traces of perf are sampled. This means it is missing data (by definition: sampling means you are going to lookup something at a certain interval, which means you are not looking between the intervals). What I have done, is extrapolate the perf samples found for an Oracle CPU or wait interval relative to the time in the measured interval. This means that if the time in the interval is 100us, and I get two collapsed stack traces with 1 and 3 samples, the extrapolated time will be; 1: 1/(1+3)*100us=25us, and 3: 3/(1+3)*100us=75us. This is not scientific, but it is meant to give an idea. Also, in order to show something useful in flame graphs, all the data needs to be based on the same data type, so I need to transform the number of stack traces to time.

I created a github project stapflame for my files.

First, you need to install the debuginfo kernel packages, as has been mentioned in this blogpost.
Then, you need to execute eventsname.sql in order to generate eventsname.sed, which is used to translate wait event numbers to wait event names. Wait event numbers change between Oracle database versions, and could potentially change after PSU apply. eventsname.sed must be in the same directory as the run_stapflame.sh script.
Then, you need to fetch stackcollapse-perf.pl and flamegraph.pl from Brendan Gregg’s github flamegraph repository. These need to be in the same directory as the run_stapflame.sh script too, and have the execute bit set.

Once the requirements are met, you can use the run_stapflame.sh script:

# ./run_stapflame.sh 123

The first argument must the PID of an existing Oracle foreground process.
This will compile and run the systemtap script. Once both systemtap and perf are running, the following message is displayed:

profiling started, press enter to stop

Now execute what you want to be profiled in the Oracle session. Once you are done, return to the session where you ran run_stapflame.sh, and press enter to stop the profiling.
Depending on how much systemtap and perf information this generated, the script will process for some time (it is coded in bash script, it’s not highly optimised, in fact it’s quite inefficient with larger sets(!)).
There will be some messages from perf indicating how many samples it captured:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.213 MB perf.data (~9286 samples) ]

And systemtap complaining about missing symbols for the kernel module it just created (for itself :-)):

No kallsyms or vmlinux with build-id 532f482ae316500802639d35de6c302fdb685afa was found
[stap_b6a486219fd483305d4991b483feb397_16565] with build id 532f482ae316500802639d35de6c302fdb685afa not found, continuing without symbols

The stapflames in the example mentioned here are of a simple ‘select count(*) from table’ in oracle, resulting in a direct path read.

This is the resulting flame graph of my original idea. In short, this does not look very useful in this case, and might never be useful as soon as you got more than approximately 20-100 cpu-wait combinations.

However, other flame graphs are more useful; look at this flame graph, it’s about process CPU state (stack traces grouped by process CPU state, which means that Oracle events (ON CPU and the wait events) can be found for every CPU state). It’s nice to see that in the kernel state TASK_RUNNING (which is on CPU, RUNQUEUE is specified independently) is 46.86% of the total time. Of this time, almost all time (40.35% of the total time) is in the Oracle state ‘ON CPU’. After TASK_RUNNING, TASK_INTERRUPTIBLE time is 49.56% of the total time. Almost all time in TASK_INTERRUPTIBLE has gone to the wait event direct path read. There is not a great deal of time spend in the run queue (1.74% of the total time). If you look up through the stacktraces in TASK_RUNNING and ON CPU, you see that of the 40.35% of running on CPU, 15.33% is spend on executing the io_submit function. This means 15.33/40.35*100=37.99% of the time on CPU is spend on submitting IO requests. Also in TASK_RUNNING and ON CPU, 7.57% is spend in the function sxorchk, which is block checksumming (db_block_checksum; set to ‘typical’ by default).

Another interesting view is this flamegraph; this one is only the wait event ‘direct path read’. As you can see, most of the wait event is not spent running on CPU. That is quite logical for a wait event🙂. 96.03% of the time spent in the wait event ‘direct path read’ is in the ‘TASK_INTERRUPTIBLE’ state. 2.68% of the time in the wait event ‘direct path read’ is spend in TASK_RUNNING on CPU state. Despite being inside wait event time, it’s very logical to have some time spend on running on CPU, because you need to run on the CPU to reap IO requests.

I love to hear comments, additions, corrections or any other feedback!

Credits and other references:
# Brendan Gregg and his work on FlameGraphs.
# Luca Canali and his work on stack profiling, which inspired me to create this tool, and proofreading.
# Tanel Poder and his work on Oracle internals in general.
# Klaas-Jan Jongsma for proofreading.

Footnotes:
– The systemtap script contains the full path to the Oracle database executable in the userspace function probes. This obviously must be changed to reflect the path to the Oracle executable of the process the systemtap script is profiling. Alternatively, it can just list the executable name (“oracle”), but then it the executable must be in the $PATH.

This post is about manually calling and freeing a shared latch. Credits should go to Andrey Nikolaev, who has this covered in his presentation which was presented at UKOUG Tech 15. I am very sorry to see I did miss it.

Essentially, if you follow my Oracle 12 and shared latches part 2 blogpost, which is about shared latches, I showed how to get a shared latch:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x94af8768 1 0 2303 16
Function returned 1

Which works okay, but leaves a bit of a mess when freed:

SQL> oradebug call kslfre 0x93FC02B8
ORA-03113: end-of-file on communication channel
ORA-24323: value not allowed

(please read part 2 to get the full story on how to set it and how to use it for investigations)

Andrey Nikolaev shows there is another function, kslgetsl_w(latch address, wait, why, where, mode), which can be used to get a shared latch (alike the function ksl_get_shared_latch) but without the oradebug session blowing up in your face:

SQL> oradebug call kslgetsl_w 0x93faec78 1 0 2329 16
Function returned 1

And after testing the latch:

SQL> oradebug call kslfre 0x93faec78
Function returned 1

However, when querying v$latchholder from the same session as the oradebug call, an ORA-600 is thrown:

SQL> select * from v$latchholder;
select * from v$latchholder
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [504], [0x095306210], [2], [1], [session idle bit], [1], [0x093FAEC78], [cache buffers chains], [], [], [], []

After which the latch is freed.

The error message strongly hints at the latches gotten in an incompatible order. The session idle bit seems to be a latch gotten to indicate the session state switching from active to inactive (according to an article here), and indeed holding a cache buffer chains latch, a session should not switch state at all. However, a second sysdba session can be used to query v$latchholder.

Some very limited investigations show the function is kslgetsl_w is not used by the database itself, which rather uses ksl_get_shared_latch. Again, the investigation was very limited, there might be situations this function is used.

Conclusion
The reason for this blogpost is to point out an alternative for using the ksl_get_shared_latch function manually for doing investigation. This post also shows that you should not change or play with internals on production systems, because it might lead to all kinds of additional behaviour or plain errors, for which the ORA-600 is a good example.

This article is about the Oracle 12c in-memory option, and specifically looks at how the background worker processes do IO to populate the in-memory column store.

Hardware: Apple Macbook with VMWare Fusion 7.1.3.
Operating system: Oracle Linux 6.7, kernel: 3.8.13-118.el6uek.x86_64.
Database version: Oracle 12.1.0.2
Patch: opatch lspatches
19392604;OCW PATCH SET UPDATE : 12.1.0.2.1 (19392604)
19303936;Database Patch Set Update : 12.1.0.2.1 (19303936)

But first things first, let’s setup the in-memory option first with a test table. The first thing to consider is to create the in-memory area to store the objects. I only want a single table stored in the in-memory area, so I can very simply look at the size of object:

TS@fv12102 > select segment_name, round(bytes/power(1024,2)) "MB" from user_segments where segment_name = 'T2';

SEGMENT_NAME                                     MB
---------------------------------------- ----------
T2                                              168

So, that means that an in-memory area of 200MB should suffice, let’s set it and bounce the instance:

SYS@fv12102 AS SYSDBA> alter system set inmemory_size = 200m scope=spfile sid='*';

System altered.

SYS@fv12102 AS SYSDBA> startup force;
ORACLE instance started.

Total System Global Area 1048576000 bytes
Fixed Size		    2932336 bytes
Variable Size		  226492816 bytes
Database Buffers	  469762048 bytes
Redo Buffers		  139673600 bytes
In-Memory Area		  209715200 bytes
Database mounted.
Database opened.

Line 13 shows the in-memory area is created!

Now I need to set the inmemory attribute for my table T2. This code example shows the inmemory attribute for my T2 table without the in-memory set, then alters the table to belong in the in-memory area, and shows the inmemory attribute again:

TS@fv12102 > select inmemory from user_tables where table_name = 'T2';

INMEMORY
--------
DISABLED

TS@fv12102 > alter table t2 inmemory;

Table altered.

TS@fv12102 > select inmemory from user_tables where table_name = 'T2';

INMEMORY
--------
ENABLED

If we now look at the V$IM_SEGMENTS, the view is empty:

SYS@fv12102 AS SYSDBA> select owner, segment_name, bytes normal_size, bytes_not_populated, inmemory_size, populate_status from v$im_segments;

no rows selected

This is because the inmemory_priority has not been specified when we altered the table to inmemory, which means the default priority NONE is given, which means the table will be loaded into memory once it is touched.

Now, in order to understand how the in-memory area is populated we need to execute a sql trace including waits on the Wnnn process or processes that populate the in-memory area. However, that is not as straightforward as it seems: there are a couple of these (background) processes already in my instance, but there can be created more if it necessary. Luckily, we can use Oracle’s improvements to the diagnosability infrastructure and set sql trace and filter on the process name:

SYS@fv12102 AS SYSDBA> alter system set events 'sql_trace {process: pname = w} level 8';

System altered.

Now that we enabled sql_trace for all Wnnn processes, let’s touch the table, and see what is in the trace!

First touch the table:

TS@fv12102 > select count(*) from t2;

  COUNT(*)
----------
   1000000

That should result in the T2 table being read in the in-memory area:

SYS@fv12102 AS SYSDBA> select owner, segment_name, bytes normal_size, bytes_not_populated, inmemory_size, populate_status from v$im_segments;

OWNER  SEGMENT_NAME		      NORMAL_SIZE BYTES_NOT_POPULATED INMEMORY_SIZE POPULATE_
------ ------------------------------ ----------- ------------------- ------------- ---------
TS     T2				176160768		    0	   53673984 COMPLETED

Bingo!

Now go to the trace directory, and look for w processes:

[oracle@bigmachine [fv12102] trace]$ ls -l *_w*
-rw-r----- 1 oracle oinstall   2145 Nov 14 08:52 fv12102_w000_30938.trc
-rw-r----- 1 oracle oinstall    166 Nov 14 08:52 fv12102_w000_30938.trm
-rw-r----- 1 oracle oinstall   8616 Nov 14 09:10 fv12102_w000_32380.trc
-rw-r----- 1 oracle oinstall    752 Nov 14 09:10 fv12102_w000_32380.trm
-rw-r----- 1 oracle oinstall  35401 Nov 14 09:10 fv12102_w002_31964.trc
-rw-r----- 1 oracle oinstall   2612 Nov 14 09:10 fv12102_w002_31964.trm
-rw-r----- 1 oracle oinstall 232444 Nov 14 09:10 fv12102_w001_30940.trc
-rw-r----- 1 oracle oinstall   7120 Nov 14 09:10 fv12102_w001_30940.trm

Because of the size of the file, it was my guess that file would contain the trace output of the reading of the T2 table into the in-memory area. That is correct.

If a worker process was up before doing an action, we can see it sleeps for 5 seconds in the wait event ‘Space Manager: slave idle wait’, after which it does some (internal) housekeeping, which means reading its CPU usage using getrusage(), after which it will sleep on semtimedop() for 5 seconds again. This is how that looks like when traced with waits:

WAIT #0: nam='Space Manager: slave idle wait' ela= 5000827 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7189216230

*** 2015-11-14 08:52:32.981
WAIT #0: nam='Space Manager: slave idle wait' ela= 5000350 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7194217089

*** 2015-11-14 08:52:37.982
WAIT #0: nam='Space Manager: slave idle wait' ela= 5001004 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7199218359

*** 2015-11-14 08:52:42.982
WAIT #0: nam='Space Manager: slave idle wait' ela= 5000000 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7204218487

When the worker starts working on populating the in-memory cache, it first reads some dictionary views (partobj$, compression$ and hist_head$), after which it begins to read the table into memory:

WAIT #140530969221688: nam='Disk file operations I/O' ela= 21 FileOperation=2 fileno=4 filetype=2 obj#=20434 tim=7207107727
WAIT #140530969221688: nam='direct path read' ela= 78 file number=4 first dba=22416 block cnt=8 obj#=20434 tim=7207125617
WAIT #140530969221688: nam='direct path read' ela= 113 file number=4 first dba=22425 block cnt=7 obj#=20434 tim=7207126486
WAIT #140530969221688: nam='direct path read' ela= 136 file number=4 first dba=22432 block cnt=8 obj#=20434 tim=7207127263
WAIT #140530969221688: nam='direct path read' ela= 60 file number=4 first dba=22441 block cnt=7 obj#=20434 tim=7207128054
WAIT #140530969221688: nam='direct path read' ela= 68 file number=4 first dba=22448 block cnt=8 obj#=20434 tim=7207128718
WAIT #140530969221688: nam='direct path read' ela= 114 file number=4 first dba=22457 block cnt=7 obj#=20434 tim=7207129417
WAIT #140530969221688: nam='direct path read' ela= 106 file number=4 first dba=22464 block cnt=8 obj#=20434 tim=7207130248
WAIT #140530969221688: nam='direct path read' ela= 100 file number=4 first dba=22473 block cnt=7 obj#=20434 tim=7207130878
WAIT #140530969221688: nam='direct path read' ela= 202 file number=4 first dba=22480 block cnt=8 obj#=20434 tim=7207132096
WAIT #140530969221688: nam='direct path read' ela= 173 file number=4 first dba=22489 block cnt=7 obj#=20434 tim=7207132857
WAIT #140530969221688: nam='direct path read' ela= 293 file number=4 first dba=22496 block cnt=8 obj#=20434 tim=7207134040

The event ‘Disk file operations I/O’ is the instrumentation of meta-operations which need to happen before a process can work on a file. In our case, FileOperation=2 means opening a file.

Then we see the event ‘direct path read’! That means that the worker uses the modern reading method of reading data, instead of the older method indicated by ‘db file scattered read’.

The reason this is important, is the ‘scattered read’ method does not scale very well. The basic setup of the ‘scattered read’ method is:
1. Determine chunk to read (limited by DB_FILE_MULTIBLOCK_READ, extent border, space management blocks and blocks already in the cache).
2. Read chunk.
3. Repeat until high water mark is reached.
Even if this is done via asynchronous IO, there is never read more than one chunk at a time.

If we look at the ‘direct path read’ method, then it does this in a simplified way:
1. Determine number of IOs allowable in flight (default/minimum two!)
2. Submit reads for chunks of the segment to scan, amount of IOs is determined at 1.
3. Wait for minimal 1 IO to return.
4. Repeat until high water mark is reached.

Essentially, using the ‘direct path read’ method a single process can take advantage of the available bandwidth of modern storage and asynchronous IO and issue multiple IOs.

During tracing I noticed the worker processes writing to the database’s diagnostics destination, however in a directory ‘log’, to a file ‘imdb_ORACLE_SID.log’ which seem to contain information about chunks read by worker processes:

Sat Nov 14 09:51:10 2015
c:0o:20434d:20434/St.RDBA: 0xac00 No.blks: 896
c:0o:20434d:20434/St.RDBA: 0xaf84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xb384 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xb784 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xbb84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xbf84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xc384 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xc784 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xcb84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xcf84 No.blks: 644
c:0o:20434d:20434/rows buff: 460730

And in a directory ‘imdb’ inside the ‘log’ directory, which looks like the same data in an XML format:

<msg time='2015-11-14T08:52:45.889+01:00' org_id='oracle' comp_id='rdbms'
 msg_id='kdmlScanAndCreateCU:1537:858585840' type='UNKNOWN' group='IM_populate'
 level='16' host_id='bigmachine.local' host_addr='127.0.0.1'>
 <txt>c:0o:20434d:20434/St.RDBA: 0x5790 No.blks: 8
 </txt>
</msg>

It reveals a code location kdmlScanAndCreateCU, which probably means Kernel Data in-Memory for the first letters.

In most circumstances, and in my opinion, this information should be printed when some kind of debug flag is set, not ‘just because’, because writing these statistics costs time. I also don’t know if these files are maintained by ADR. If not, doing a lot of in-memory transactions can fill it up rather quickly.

In the previous post on the decision between buffered and direct path reads I showed the decision is depended on the version. Up to and including version 11.2.0.2 the size of a segment needs to be five times small table threshold in order to be considered for direct path reads, and starting from 11.2.0.3 the database starts considering direct path reads starting from small table threshold. The lower limit just discussed is small table threshold or five times small table threshold with lower versions, upper limit is called “very large object threshold” (VLOT) and is five times the size of the buffercache, which is the threshold after which a table scan always is going via direct path.

There seems to be an odd situation with 11.2.0.3 and up that if a table has a size between small table threshold and five times small table threshold, it is always read via direct path, while if the table is bigger than five times small table threshold, it is depended on the amount of blocks in the cache, and switches to buffered reads if the amount of blocks in the cache is 99% or more. This seems to be the opposite of what is logical; a smaller sized table is likely a better candidate than a larger table for caching.

The last thing the previous blogpost showed was the impact of dirty blocks on the buffered/direct path decision. Up to 11.2.0.2 a read that should go via direct path is done buffered when the amount of dirty blocks reaches around 75%. Starting from version 11.2.0.3, a read that should go via direct path is done buffered when the amount of dirty blocks for a table reaches approximately 50%. This is vital information for data warehouses!

A PL/SQL procedure was included so these tests could be to measure this for yourself. If you have any doubts, follow the link above and paste the procedure to test it for yourself.

The purpose of this blogpost is to further investigate the table direct path or buffered decision in various Oracle database versions when compression is used. I used compression using “COMPRESS FOR OLTP”, in order to see if compression changes the decisions.

The first measurement is reading, doing a full scan of a compressed table:

Version 11.2.0.1 (_STT 3979; _STT*5:19895)

TS@v11201 > @test_direct_path
table: T2_OLTP size: 64512 blocks, 10000000 rows.
cached - physical reads cache/direct: (10)		6419/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (20)		6487/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (30)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (40)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (50)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (60)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (70)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (80)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (90)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (100)		 6382/0
full scan - physical reads cache/direct:			  0/0

This pattern is repeated up to version 12.1.0.2: if the amount of blocks is 99% or more, the database switches from direct path to buffered reads for table scans. So for simple reading, the decision does not seem different than for non-compressed tables.

When dirty blocks are included, it is different:

Version 11.2.0.1:

TS@v11201 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  0/63911

If a large number of the blocks, or all of them, in the cache are dirty, a full scan does not flip to buffered reads at 99%.

However, it is interesting to see that this changes with version 11.2.0.2:

TS@v11202 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  461/0

Starting from version 11.2.0.2, the threshold of 99% for a full table scan to switch from direct path to buffered seems to be true for clean blocks and dirty blocks. The same is true for versions 11.2.0.3 and 11.2.0.4.

However, starting from 12.1.0.1, a full scan on a compressed table is done via direct path even if all the blocks are in the cache and dirty, which seems identical to the version 11.2.0.1 behaviour:

TS@v12101 > @test_direct_path
table: T2_OLTP size: 65536 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/65174
full scan - physical reads cache/direct: (20)			 0/65174
full scan - physical reads cache/direct: (30)			 0/65174
full scan - physical reads cache/direct: (40)			 0/65174
full scan - physical reads cache/direct: (50)			 0/65174
full scan - physical reads cache/direct: (60)			 0/65174
full scan - physical reads cache/direct: (70)			 0/65174
full scan - physical reads cache/direct: (80)			 0/65174
full scan - physical reads cache/direct: (90)			 5/65174
full scan - physical reads cache/direct: (100)			  0/65174

So, the conclusion overall is there is a difference between regular heap tables and compressed heap tables. I tested OLTP compression, the [nsmtio] tracing indicates this is the same for HCC compression (hybrid columnar compression). In general, for simple reading there doesn’t seem to be a difference between normal heap tables and compressed heap tables. However, I tested on idle systems, on “real” systems with more “stress” on the buffercache management there might be more mechanisms in play.

When there are dirty blocks (blocks changed sitting in the buffercache waiting to be written by the database writer), there is different behaviour with respect to the buffered read or direct path read choice the instance makes. In general, with dirty blocks involved, compressed tables are less likely to be read into the buffercache.

It seems that with compressed tables and version 11.2.0.1 and 12.1.0.1 and up there is no switch to buffered reads even if all the blocks are in the cache, but dirty. The versions in between there, which are version 11.2.0.2, 11.2.0.3 and 11.2.0.4 do switch to buffered reads when 99% or more blocks are in the cache, regardless if they are dirty.

Also, please mind I explicitly mentioned tables. For a full scan on an index, indicated by the ‘FAST FULL INDEX SCAN’ line in an explain plan, entirely different rules are in play.

Since the direct path feature for serial processes was discovered after it became available in Oracle 11.2.0.1 (as far as I know, I haven’t checked Oracle 11.1), there have been a lot of blog posts on when this happens. A lot of these do not specify the Oracle version, which is a failure in my opinion. There are different decisions made in different versions.

The purpose of this blogpost is to show the results of my tests on when the Oracle database engine switches from buffered to direct path reads and vice versa. There probably are decisions made by the database engine for this feature based on internal kept statistics, like general activity and object usage, which means my tests might be different in your database. For that purpose I included an anonymous PL/SQL block in this post so you can replay the same test in your own database, except for the table, which you have to create yourself.

There are a few basics rules that are applied by the database engine for the buffered or direct path read full segment scan decision:
– If the segment size is smaller than 5 * _SMALL_TABLE_THRESHOLD (11.2.0.1 and 11.2.0.2) or _SMALL_TABLE_THRESHOLD (11.2.0.3 and higher) it will be read into the buffercache (also known as a buffered read), any block already read into the buffercache will be read from the cache.
– If the segment size is bigger than 5 * the buffer cache size (set by _VERY_LARGE_OBJECT_THRESHOLD, which value is set at 500 by default, which seems to be 500% of the buffercache), it will be read via direct path, any blocks that are in the cache are ignored. If blocks of this segment are dirty, these are checkpointed to disk before the segment is read, which is visible with the events ‘messages’ and ‘enq: KO – fast object checkpoint’.
– If a segment is between above mentioned sizes, the database engine makes a decision. In other words: you can get different behaviour in different scenario’s. In order to give this size a name, the most logical name to indicate this size is a medium table. Anytime a direct path read is chosen, any dirty blocks are checkpointed to disk, visible with the events ‘messages’ and ‘enq: KO – fast object checkpoint’.

Starting from 11.2.0.3, the direct path decision starting point changed from 5 * _SMALL_TABLE_THRESHOLD to _SMALL_TABLE_THRESHOLD. When a nsmtio trace is taken, an extra line appears in the trace when a segment is sized between _SMALL_TABLE_THRESHOLD and 5 * _SMALL_TABLE_THRESHOLD:
Partial trace lines for a medium segment sized bigger than 5 * _SMALL_TABLE_THRESHOLD (12.1.0.2):

NSMTIO: kcbism: islarge 1 next 0 nblks 21128 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 21128 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 21128 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

On the first line, nblks indicates the segment size, and kcbstt indicates _SMALL_TABLE_THRESHOLD.

Partial trace lines for a medium segment sized smaller than 5 * _SMALL_TABLE_THRESHOLD (121.1.0.2):

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: hit age_diff adjts last_ts nbuf nblk has_val kcbisdbfc cache_it 898 8181 7283 182931 3668 1 0 1
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

The first line shows the segment size (3668) being just above the _SMALL_TABLE_THRESHOLD (3658).
What is interesting to mention is the line with kcbcmt1. Things like hit, age_diff, adjts, last_ts seem to indicate additional statistics on (recent) activity are included.

For the results shown below, I created an anonymous PL/SQL block that is included at the end of the blogpost. Also, what I did is I flushed the buffercache so the measurement starts off with an empty cache. If you are running this on a real used database, you shouldn’t flush the buffercache, and probably you will see small numbers of read blocks during reading.

Version 11.2.0.1, _SMALL_TABLE_THRESHOLD = 1436 (_STT*5=7180), segment size 7040.

table: T_OVER_STT size: 7040 blocks, 330171 rows.
cached - physical reads cache/direct: (10)		749/0
full scan - physical reads cache/direct:			  6129/0
cached - physical reads cache/direct: (20)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (30)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (40)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (50)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (60)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (70)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (80)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (90)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

First 10% is read with into the buffer cache, then a full scan is initiated. The full scan reads it buffered to, because the segment size is smaller than 5 * _SMALL_TABLE_THRESHOLD. Next, all other scans are read via the cache because it has been read into the cache already, which is why all the other scans do not cause physical reads.

Version 11.2.0.2, _SMALL_TABLE_THRESHOLD = 1492 (_STT*5=7460), segment size 1920.

table: T_JUST_OVER_STT size: 1920 blocks,  rows.
cached - physical reads cache/direct: (10)		239/0
full scan - physical reads cache/direct:			  1623/0
cached - physical reads cache/direct: (20)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (30)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (40)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (50)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (60)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (70)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (80)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (90)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

Here the same thing is happening, 11.2.0.2 behaves identical to 11.2.0.1 with a segment that is smaller than 5 * _SMALL_TABLE_THRESHOLD. First the first 10% of the blocks is read, then a full scan reads the other blocks in the cache, after which all the other scans read from the cache, so no physical reads are shown.

Version 11.2.0.3, _SMALL_TABLE_THRESHOLD = 1482 (_STT*5=7410), segment size 1664.

table: T_OVER_STT size: 1664 blocks, 74404 rows.
cached - physical reads cache/direct: (10)		251/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (20)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (30)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (40)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (50)		252/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (60)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (70)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (80)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (90)		252/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (100)		 44/0
full scan - physical reads cache/direct:			  0/1548

Here we see the database behave differently! The same pattern as shown earlier is executed: first a percentage of the table is read set fixed to doing a buffered read, then a full scan is executed, which now scans via direct path for a segment that is smaller than 5 * _SMALL_TABLE_THRESHOLD. Another interesting thing to see is that despite the last cached/buffered scan reading ALL the blocks in the cache, a full scan remains scanning via direct path. Please mind we are scanning a segment bigger than _SMALL_TABLE_THRESHOLD, but smaller the 5 * _SMALL_TABLE_THRESHOLD.

Careful testing shows that all the versions up to Oracle 12.1.0.2 behave consistent like this. A final consideration, as mentioned earlier is the testing in my lab is on idle databases, often just started up for the sake of the test, so dynamics and (internally kept) runtime statistics might show something else in a really used database.

How about a segment that is truly bigger than 5 * _SMALL_TABLE_THRESHOLD?
Version 11.2.0.1, _SMALL_TABLE_THRESHOLD=1436, 5*_STT=7180, segment size 21504.

table: T2 size: 21504 blocks, 1000000 rows.
cached - physical reads cache/direct: (10)		2135/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (20)		2142/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (30)		2016/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (40)		2149/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (50)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (60)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (70)		2168/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (80)		2168/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (90)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (100)		 2044/0
full scan - physical reads cache/direct:			  0/0

Here we see a table that is bigger than 5 * _SMALL_TABLE_THRESHOLD, and we see that the size makes a difference! When the amount of blocks in the cache is close to 100%, the full scan might choose a buffered scan instead of a direct path read. In fact, if I change the starting percentage to 90% and the increase to 1%, we can see the switching point is at 99%:

table: T2 size: 21504 blocks, 1000000 rows.
cached - physical reads cache/direct: (90)		18898/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (91)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (92)		124/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (93)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (94)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (95)		128/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (96)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (97)		252/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (98)		128/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (99)		256/0
full scan - physical reads cache/direct:			  132/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

When 99% of the blocks is read into the buffercache, you can see the full scan following it choosing a buffered scan method, and reading the remaining 132 blocks into the cache.

This behaviour is consistent up to Oracle version 12.1.0.2. Just to make sure we are all on the same page again, this is in my test databases. I love to hear if you see something different in real life.

Another thing that influences direct path reads is the number of dirty blocks. If v_read_update in the procedure below is set to UPDATE, instead of scanning a certain percentage into the buffercache, a certain percentage of blocks is updated, after which the same full scan is executed, and then the update is rolled back. Please mind I took a table that is bigger than 5 * _SMALL_TABLE_THRESHOLD in the tests below. When the procedure is set to UPDATE, only the full scan is shown, not the update. The number between brackets is the percentage of the table that has been updated prior to the full scan.

Version 11.2.0.1

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 0/20941
full scan - physical reads cache/direct: (70)			 0/20941
full scan - physical reads cache/direct: (80)			 4084/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.2

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 0/20941
full scan - physical reads cache/direct: (70)			 0/20941
full scan - physical reads cache/direct: (80)			 4084/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.3

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.4

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 12.1.0.1

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 12.1.0.2

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

During repeated tests, the above figures show there is a difference between Oracle versions 11.2.0.1/11.2.0.2 and version 11.2.0.3 and above. Versions 11.2.0.1 and 11.2.0.2 seem to have their switching point from direct path to buffered at approximately 75% dirty blocks, while versions 11.2.0.3 and higher have their switching point at approximately 50%.

I am convinced there are more things that influence this. In fact, I had to re-run the tests a couple of times (flushing the buffercache before each run of course) in order to come up with the figures above. This means that sometimes this can show fluctuations, even in my idle database.

This is the anonymous PL/SQL procedure to test this in your own database:

declare
	-- this is a direct path read tester.
	-- please mind the user executing this anonymous PL/SQL block must have the following rights:
	-- - alter session
	-- - select on v$mystat and v$statname
	--
        -- -> because user_tables is used, statistics must be correct!
        --
	-- fill out the following details
	-- table to read -> it's best to flush the buffer cache if you want to start with a low percentage of blocks in the cache!
	v_table varchar2(30) := 'YOUR_TABLE_HERE';
	-- percentage to start at for reading blocks in the cache.
	v_current_pct number := 10;
	-- the amount to add to the percentage after the run has been executed.
	v_step number := 10;
	-- read or update (uppercase).
	v_read_update varchar2(10) := 'READ';
	--
	v_nr_blocks number;
	v_numrows number;
	v_readrows number;
	v_rd_cache number;
	v_rd_direct number;
	v_tmp number;
	v_read_to_cache varchar2(200) := 'select /*+ full('||v_table||') */ count(*) from '||v_table||' where rownum < :1';
	v_update varchar2(200) := 'update '||v_table||' set id=id+1 where rownum < :1';
	v_try_read varchar2(200) := 'select /*+ full('||v_table||') */ count(*) from '||v_table;
	v_ret number;
begin
	select blocks into v_nr_blocks from user_segments where segment_type = 'TABLE' and segment_name = v_table;
	select num_rows into v_numrows from user_tables where table_name = v_table;
	dbms_output.put_line('table: '||v_table||' size: '||v_nr_blocks||' blocks, '||v_numrows||' rows.');
	if v_read_update = 'READ' then
	while v_current_pct <= 100 loop
		v_readrows := (v_nr_blocks / 100) * v_current_pct * (v_numrows / v_nr_blocks);
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate 'alter session set events ''10949 trace name context forever, level 1''';
		execute immediate v_read_to_cache into v_ret using v_readrows;
		execute immediate 'alter session set events ''10949 trace name context forever, level 0''';
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('cached - physical reads cache/direct: ('||v_current_pct||')              '||v_rd_cache||'/'||v_rd_direct);
		--
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate v_try_read into v_ret;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('full scan - physical reads cache/direct:                          '||v_rd_cache||'/'||v_rd_direct);
		v_current_pct := v_current_pct + v_step;
	end loop;
	end if;
	if v_read_update = 'UPDATE' then
	v_current_pct := 10;
	v_step := 10;
	dbms_output.put_line('--update');
	while v_current_pct <= 100 loop
		v_readrows := (v_nr_blocks / 100) * v_current_pct * (v_numrows / v_nr_blocks);
		execute immediate v_update using v_readrows;
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate v_try_read into v_ret;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('full scan - physical reads cache/direct: ('||v_current_pct||')                    '||v_rd_cache||'/'||v_rd_direct);
		execute immediate 'rollback';
		v_current_pct := v_current_pct + v_step;
	end loop;
	end if;
end;
/

This script should be run as the owner of the table you want to investigate for direct path reads.
The Oracle user must have ALTER SESSION rights to be able to set an event to disable direct path reads, and it needs select rights on V$MYSTAT and V$STATNAME. And obviously select rights on the table it wants to read. Another important thing is the table must have accurate statistics. If you gather statistics after you have run the PL/SQL procedure, you should flush the buffer cache, and flush the shared pool in order to flush the cached cursors and statistics.
It first reads a certain percentage of the table with event 10949 set, which disables direct path reads, then the event is disabled and a full table scan is executed. Then the variable for the percentage is increased, and the buffered scan is executed again, and the full scan following it, etc.

Change the v_table definition to the table you want to scan.
What you additionally can do, is v_current_pct (10 by default) as the starting amount to read buffered, and the amount to increase the percentage with in v_step (10 by default). One of the things I done is change v_current_pct to 90 and v_step to 1.

%d bloggers like this: