Archive

Oracle

In the previous post on the decision between buffered and direct path reads I showed the decision is depended on the version. Up to and including version 11.2.0.2 the size of a segment needs to be five times small table threshold in order to be considered for direct path reads, and starting from 11.2.0.3 the database starts considering direct path reads starting from small table threshold. The lower limit just discussed is small table threshold or five times small table threshold with lower versions, upper limit is called “very large object threshold” (VLOT) and is five times the size of the buffercache, which is the threshold after which a table scan always is going via direct path.

There seems to be an odd situation with 11.2.0.3 and up that if a table has a size between small table threshold and five times small table threshold, it is always read via direct path, while if the table is bigger than five times small table threshold, it is depended on the amount of blocks in the cache, and switches to buffered reads if the amount of blocks in the cache is 99% or more. This seems to be the opposite of what is logical; a smaller sized table is likely a better candidate than a larger table for caching.

The last thing the previous blogpost showed was the impact of dirty blocks on the buffered/direct path decision. Up to 11.2.0.2 a read that should go via direct path is done buffered when the amount of dirty blocks reaches around 75%. Starting from version 11.2.0.3, a read that should go via direct path is done buffered when the amount of dirty blocks for a table reaches approximately 50%. This is vital information for data warehouses!

A PL/SQL procedure was included so these tests could be to measure this for yourself. If you have any doubts, follow the link above and paste the procedure to test it for yourself.

The purpose of this blogpost is to further investigate the table direct path or buffered decision in various Oracle database versions when compression is used. I used compression using “COMPRESS FOR OLTP”, in order to see if compression changes the decisions.

The first measurement is reading, doing a full scan of a compressed table:

Version 11.2.0.1 (_STT 3979; _STT*5:19895)

TS@v11201 > @test_direct_path
table: T2_OLTP size: 64512 blocks, 10000000 rows.
cached - physical reads cache/direct: (10)		6419/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (20)		6487/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (30)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (40)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (50)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (60)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (70)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (80)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (90)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (100)		 6382/0
full scan - physical reads cache/direct:			  0/0

This pattern is repeated up to version 12.1.0.2: if the amount of blocks is 99% or more, the database switches from direct path to buffered reads for table scans. So for simple reading, the decision does not seem different than for non-compressed tables.

When dirty blocks are included, it is different:

Version 11.2.0.1:

TS@v11201 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  0/63911

If a large number of the blocks, or all of them, in the cache are dirty, a full scan does not flip to buffered reads at 99%.

However, it is interesting to see that this changes with version 11.2.0.2:

TS@v11202 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  461/0

Starting from version 11.2.0.2, the threshold of 99% for a full table scan to switch from direct path to buffered seems to be true for clean blocks and dirty blocks. The same is true for versions 11.2.0.3 and 11.2.0.4.

However, starting from 12.1.0.1, a full scan on a compressed table is done via direct path even if all the blocks are in the cache and dirty, which seems identical to the version 11.2.0.1 behaviour:

TS@v12101 > @test_direct_path
table: T2_OLTP size: 65536 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/65174
full scan - physical reads cache/direct: (20)			 0/65174
full scan - physical reads cache/direct: (30)			 0/65174
full scan - physical reads cache/direct: (40)			 0/65174
full scan - physical reads cache/direct: (50)			 0/65174
full scan - physical reads cache/direct: (60)			 0/65174
full scan - physical reads cache/direct: (70)			 0/65174
full scan - physical reads cache/direct: (80)			 0/65174
full scan - physical reads cache/direct: (90)			 5/65174
full scan - physical reads cache/direct: (100)			  0/65174

So, the conclusion overall is there is a difference between regular heap tables and compressed heap tables. I tested OLTP compression, the [nsmtio] tracing indicates this is the same for HCC compression (hybrid columnar compression). In general, for simple reading there doesn’t seem to be a difference between normal heap tables and compressed heap tables. However, I tested on idle systems, on “real” systems with more “stress” on the buffercache management there might be more mechanisms in play.

When there are dirty blocks (blocks changed sitting in the buffercache waiting to be written by the database writer), there is different behaviour with respect to the buffered read or direct path read choice the instance makes. In general, with dirty blocks involved, compressed tables are less likely to be read into the buffercache.

It seems that with compressed tables and version 11.2.0.1 and 12.1.0.1 and up there is no switch to buffered reads even if all the blocks are in the cache, but dirty. The versions in between there, which are version 11.2.0.2, 11.2.0.3 and 11.2.0.4 do switch to buffered reads when 99% or more blocks are in the cache, regardless if they are dirty.

Also, please mind I explicitly mentioned tables. For a full scan on an index, indicated by the ‘FAST FULL INDEX SCAN’ line in an explain plan, entirely different rules are in play.

Since the direct path feature for serial processes was discovered after it became available in Oracle 11.2.0.1 (as far as I know, I haven’t checked Oracle 11.1), there have been a lot of blog posts on when this happens. A lot of these do not specify the Oracle version, which is a failure in my opinion. There are different decisions made in different versions.

The purpose of this blogpost is to show the results of my tests on when the Oracle database engine switches from buffered to direct path reads and vice versa. There probably are decisions made by the database engine for this feature based on internal kept statistics, like general activity and object usage, which means my tests might be different in your database. For that purpose I included an anonymous PL/SQL block in this post so you can replay the same test in your own database, except for the table, which you have to create yourself.

There are a few basics rules that are applied by the database engine for the buffered or direct path read full segment scan decision:
– If the segment size is smaller than 5 * _SMALL_TABLE_THRESHOLD (11.2.0.1 and 11.2.0.2) or _SMALL_TABLE_THRESHOLD (11.2.0.3 and higher) it will be read into the buffercache (also known as a buffered read), any block already read into the buffercache will be read from the cache.
– If the segment size is bigger than 5 * the buffer cache size (set by _VERY_LARGE_OBJECT_THRESHOLD, which value is set at 500 by default, which seems to be 500% of the buffercache), it will be read via direct path, any blocks that are in the cache are ignored. If blocks of this segment are dirty, these are checkpointed to disk before the segment is read, which is visible with the events ‘messages’ and ‘enq: KO – fast object checkpoint’.
– If a segment is between above mentioned sizes, the database engine makes a decision. In other words: you can get different behaviour in different scenario’s. In order to give this size a name, the most logical name to indicate this size is a medium table. Anytime a direct path read is chosen, any dirty blocks are checkpointed to disk, visible with the events ‘messages’ and ‘enq: KO – fast object checkpoint’.

Starting from 11.2.0.3, the direct path decision starting point changed from 5 * _SMALL_TABLE_THRESHOLD to _SMALL_TABLE_THRESHOLD. When a nsmtio trace is taken, an extra line appears in the trace when a segment is sized between _SMALL_TABLE_THRESHOLD and 5 * _SMALL_TABLE_THRESHOLD:
Partial trace lines for a medium segment sized bigger than 5 * _SMALL_TABLE_THRESHOLD (12.1.0.2):

NSMTIO: kcbism: islarge 1 next 0 nblks 21128 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 21128 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 21128 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

On the first line, nblks indicates the segment size, and kcbstt indicates _SMALL_TABLE_THRESHOLD.

Partial trace lines for a medium segment sized smaller than 5 * _SMALL_TABLE_THRESHOLD (121.1.0.2):

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: hit age_diff adjts last_ts nbuf nblk has_val kcbisdbfc cache_it 898 8181 7283 182931 3668 1 0 1
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

The first line shows the segment size (3668) being just above the _SMALL_TABLE_THRESHOLD (3658).
What is interesting to mention is the line with kcbcmt1. Things like hit, age_diff, adjts, last_ts seem to indicate additional statistics on (recent) activity are included.

For the results shown below, I created an anonymous PL/SQL block that is included at the end of the blogpost. Also, what I did is I flushed the buffercache so the measurement starts off with an empty cache. If you are running this on a real used database, you shouldn’t flush the buffercache, and probably you will see small numbers of read blocks during reading.

Version 11.2.0.1, _SMALL_TABLE_THRESHOLD = 1436 (_STT*5=7180), segment size 7040.

table: T_OVER_STT size: 7040 blocks, 330171 rows.
cached - physical reads cache/direct: (10)		749/0
full scan - physical reads cache/direct:			  6129/0
cached - physical reads cache/direct: (20)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (30)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (40)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (50)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (60)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (70)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (80)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (90)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

First 10% is read with into the buffer cache, then a full scan is initiated. The full scan reads it buffered to, because the segment size is smaller than 5 * _SMALL_TABLE_THRESHOLD. Next, all other scans are read via the cache because it has been read into the cache already, which is why all the other scans do not cause physical reads.

Version 11.2.0.2, _SMALL_TABLE_THRESHOLD = 1492 (_STT*5=7460), segment size 1920.

table: T_JUST_OVER_STT size: 1920 blocks,  rows.
cached - physical reads cache/direct: (10)		239/0
full scan - physical reads cache/direct:			  1623/0
cached - physical reads cache/direct: (20)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (30)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (40)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (50)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (60)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (70)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (80)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (90)		0/0
full scan - physical reads cache/direct:			  0/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

Here the same thing is happening, 11.2.0.2 behaves identical to 11.2.0.1 with a segment that is smaller than 5 * _SMALL_TABLE_THRESHOLD. First the first 10% of the blocks is read, then a full scan reads the other blocks in the cache, after which all the other scans read from the cache, so no physical reads are shown.

Version 11.2.0.3, _SMALL_TABLE_THRESHOLD = 1482 (_STT*5=7410), segment size 1664.

table: T_OVER_STT size: 1664 blocks, 74404 rows.
cached - physical reads cache/direct: (10)		251/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (20)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (30)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (40)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (50)		252/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (60)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (70)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (80)		126/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (90)		252/0
full scan - physical reads cache/direct:			  0/1548
cached - physical reads cache/direct: (100)		 44/0
full scan - physical reads cache/direct:			  0/1548

Here we see the database behave differently! The same pattern as shown earlier is executed: first a percentage of the table is read set fixed to doing a buffered read, then a full scan is executed, which now scans via direct path for a segment that is smaller than 5 * _SMALL_TABLE_THRESHOLD. Another interesting thing to see is that despite the last cached/buffered scan reading ALL the blocks in the cache, a full scan remains scanning via direct path. Please mind we are scanning a segment bigger than _SMALL_TABLE_THRESHOLD, but smaller the 5 * _SMALL_TABLE_THRESHOLD.

Careful testing shows that all the versions up to Oracle 12.1.0.2 behave consistent like this. A final consideration, as mentioned earlier is the testing in my lab is on idle databases, often just started up for the sake of the test, so dynamics and (internally kept) runtime statistics might show something else in a really used database.

How about a segment that is truly bigger than 5 * _SMALL_TABLE_THRESHOLD?
Version 11.2.0.1, _SMALL_TABLE_THRESHOLD=1436, 5*_STT=7180, segment size 21504.

table: T2 size: 21504 blocks, 1000000 rows.
cached - physical reads cache/direct: (10)		2135/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (20)		2142/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (30)		2016/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (40)		2149/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (50)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (60)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (70)		2168/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (80)		2168/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (90)		2040/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (100)		 2044/0
full scan - physical reads cache/direct:			  0/0

Here we see a table that is bigger than 5 * _SMALL_TABLE_THRESHOLD, and we see that the size makes a difference! When the amount of blocks in the cache is close to 100%, the full scan might choose a buffered scan instead of a direct path read. In fact, if I change the starting percentage to 90% and the increase to 1%, we can see the switching point is at 99%:

table: T2 size: 21504 blocks, 1000000 rows.
cached - physical reads cache/direct: (90)		18898/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (91)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (92)		124/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (93)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (94)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (95)		128/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (96)		256/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (97)		252/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (98)		128/0
full scan - physical reads cache/direct:			  0/20941
cached - physical reads cache/direct: (99)		256/0
full scan - physical reads cache/direct:			  132/0
cached - physical reads cache/direct: (100)		 0/0
full scan - physical reads cache/direct:			  0/0

When 99% of the blocks is read into the buffercache, you can see the full scan following it choosing a buffered scan method, and reading the remaining 132 blocks into the cache.

This behaviour is consistent up to Oracle version 12.1.0.2. Just to make sure we are all on the same page again, this is in my test databases. I love to hear if you see something different in real life.

Another thing that influences direct path reads is the number of dirty blocks. If v_read_update in the procedure below is set to UPDATE, instead of scanning a certain percentage into the buffercache, a certain percentage of blocks is updated, after which the same full scan is executed, and then the update is rolled back. Please mind I took a table that is bigger than 5 * _SMALL_TABLE_THRESHOLD in the tests below. When the procedure is set to UPDATE, only the full scan is shown, not the update. The number between brackets is the percentage of the table that has been updated prior to the full scan.

Version 11.2.0.1

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 0/20941
full scan - physical reads cache/direct: (70)			 0/20941
full scan - physical reads cache/direct: (80)			 4084/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.2

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 0/20941
full scan - physical reads cache/direct: (70)			 0/20941
full scan - physical reads cache/direct: (80)			 4084/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.3

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 11.2.0.4

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 12.1.0.1

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

Version 12.1.0.2

table: T2 size: 21504 blocks, 1000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/20941
full scan - physical reads cache/direct: (20)			 0/20941
full scan - physical reads cache/direct: (30)			 0/20941
full scan - physical reads cache/direct: (40)			 0/20941
full scan - physical reads cache/direct: (50)			 0/20941
full scan - physical reads cache/direct: (60)			 8420/0
full scan - physical reads cache/direct: (70)			 0/0
full scan - physical reads cache/direct: (80)			 0/0
full scan - physical reads cache/direct: (90)			 0/0
full scan - physical reads cache/direct: (100)			  0/0

During repeated tests, the above figures show there is a difference between Oracle versions 11.2.0.1/11.2.0.2 and version 11.2.0.3 and above. Versions 11.2.0.1 and 11.2.0.2 seem to have their switching point from direct path to buffered at approximately 75% dirty blocks, while versions 11.2.0.3 and higher have their switching point at approximately 50%.

I am convinced there are more things that influence this. In fact, I had to re-run the tests a couple of times (flushing the buffercache before each run of course) in order to come up with the figures above. This means that sometimes this can show fluctuations, even in my idle database.

This is the anonymous PL/SQL procedure to test this in your own database:

declare
	-- this is a direct path read tester.
	-- please mind the user executing this anonymous PL/SQL block must have the following rights:
	-- - alter session
	-- - select on v$mystat and v$statname
	--
        -- -> because user_tables is used, statistics must be correct!
        --
	-- fill out the following details
	-- table to read -> it's best to flush the buffer cache if you want to start with a low percentage of blocks in the cache!
	v_table varchar2(30) := 'YOUR_TABLE_HERE';
	-- percentage to start at for reading blocks in the cache.
	v_current_pct number := 10;
	-- the amount to add to the percentage after the run has been executed.
	v_step number := 10;
	-- read or update (uppercase).
	v_read_update varchar2(10) := 'READ';
	--
	v_nr_blocks number;
	v_numrows number;
	v_readrows number;
	v_rd_cache number;
	v_rd_direct number;
	v_tmp number;
	v_read_to_cache varchar2(200) := 'select /*+ full('||v_table||') */ count(*) from '||v_table||' where rownum < :1';
	v_update varchar2(200) := 'update '||v_table||' set id=id+1 where rownum < :1';
	v_try_read varchar2(200) := 'select /*+ full('||v_table||') */ count(*) from '||v_table;
	v_ret number;
begin
	select blocks into v_nr_blocks from user_segments where segment_type = 'TABLE' and segment_name = v_table;
	select num_rows into v_numrows from user_tables where table_name = v_table;
	dbms_output.put_line('table: '||v_table||' size: '||v_nr_blocks||' blocks, '||v_numrows||' rows.');
	if v_read_update = 'READ' then
	while v_current_pct <= 100 loop
		v_readrows := (v_nr_blocks / 100) * v_current_pct * (v_numrows / v_nr_blocks);
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate 'alter session set events ''10949 trace name context forever, level 1''';
		execute immediate v_read_to_cache into v_ret using v_readrows;
		execute immediate 'alter session set events ''10949 trace name context forever, level 0''';
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('cached - physical reads cache/direct: ('||v_current_pct||')              '||v_rd_cache||'/'||v_rd_direct);
		--
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate v_try_read into v_ret;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('full scan - physical reads cache/direct:                          '||v_rd_cache||'/'||v_rd_direct);
		v_current_pct := v_current_pct + v_step;
	end loop;
	end if;
	if v_read_update = 'UPDATE' then
	v_current_pct := 10;
	v_step := 10;
	dbms_output.put_line('--update');
	while v_current_pct <= 100 loop
		v_readrows := (v_nr_blocks / 100) * v_current_pct * (v_numrows / v_nr_blocks);
		execute immediate v_update using v_readrows;
		select value into v_rd_cache from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		select value into v_rd_direct from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		execute immediate v_try_read into v_ret;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads cache';
		v_rd_cache := v_tmp - v_rd_cache;
		select value into v_tmp from v$mystat, v$statname where v$mystat.statistic#=v$statname.statistic# and name = 'physical reads direct';
		v_rd_direct := v_tmp - v_rd_direct;
		dbms_output.put_line('full scan - physical reads cache/direct: ('||v_current_pct||')                    '||v_rd_cache||'/'||v_rd_direct);
		execute immediate 'rollback';
		v_current_pct := v_current_pct + v_step;
	end loop;
	end if;
end;
/

This script should be run as the owner of the table you want to investigate for direct path reads.
The Oracle user must have ALTER SESSION rights to be able to set an event to disable direct path reads, and it needs select rights on V$MYSTAT and V$STATNAME. And obviously select rights on the table it wants to read. Another important thing is the table must have accurate statistics. If you gather statistics after you have run the PL/SQL procedure, you should flush the buffer cache, and flush the shared pool in order to flush the cached cursors and statistics.
It first reads a certain percentage of the table with event 10949 set, which disables direct path reads, then the event is disabled and a full table scan is executed. Then the variable for the percentage is increased, and the buffered scan is executed again, and the full scan following it, etc.

Change the v_table definition to the table you want to scan.
What you additionally can do, is v_current_pct (10 by default) as the starting amount to read buffered, and the amount to increase the percentage with in v_step (10 by default). One of the things I done is change v_current_pct to 90 and v_step to 1.

This blogpost is about how to install and run the Oracle database in docker. Please mind this is not an officially supported virtualisation platform for the Oracle database. This is a proof of concept setup.

Linux host setup.
In my setup, I used a linux host, freshly installed with Oracle Linux 6.7, which is going to be used as docker server. Please mind you need to leave diskspace (or a disk device) unused for the commonly documented docker setup with the btrfs driver. The root filesystem is using a ext4 filesystem by default. For the proof of concept setup, a 20G diskspace for the Operating system only is enough. I used the minimal linux installation.

The first step is to add a disk, or use a partition and format it with btrfs and mount it:

# lsblk
# yum install btrfs-progs
# mkfs.btrfs -L btrfs1 /dev/sdb
# vi /etc/fstab: LABEL=btrfs1 /var/lib/docker btrfs defaults 0 1
# mkdir /var/lib/docker
# mount -a

1- list blockdevices. this showed me my added block device for docker was /dev/sdb.
2- because of the minimal install, I need to add the btrfs userspace programs.
3- create a btrfs filesystem on /dev/sdb, and label it btrfs1.
4- edit /etc/fstab, make it mount the btrfs filesystem to /var/lib/docker.
5- create the mount point for the btrfs filesystem.
6- mount all the mounts listed in /etc/fstab.

The second step is to install and preliminary configure docker:

# vi /etc/yum.repos.d/public-yum-ol6.repo: enable public_ol6_addons
# yum install docker
# vi /etc/sysconfig/docker: add "-s btrfs" to other_args
# service docker start
# chkconfig docker on
# printf "limit memlock unlimited unlimited\n" > /etc/init/docker.conf

1- enable public_ol6_addons, in which the docker rpm sits.
2- install docker.
3- make docker use the btrfs driver.
4- start docker.
5- enable autostart for docker.
6- reset the memlock ulimit for the containers (needed for the Oracle database SGA later on).

Now we can go into the true docker part of it. Docker lives, as far as I can see at this point, in the /var/lib/docker directory. In order to build my oracle database image, I create a ‘dockerfiles’ directory, and in it a ‘build-oracle-12102’ directory:

mkdir -p /var/lib/docker/dockerfiles/build-oracle-12102

Next up, the dockerfile (aptly named Dockerfile by default) needs to be made in the build-oracle-12102 directory to build an image:

FROM	oraclelinux:6
MAINTAINER frits.hoogland@gmail.com
RUN	groupadd -g 54321 oinstall
RUN	groupadd -g 54322 dba
RUN	useradd -m -g oinstall -G oinstall,dba -u 54321 oracle
RUN	yum -y install oracle-rdbms-server-12cR1-preinstall perl wget unzip
RUN	mkdir /u01
RUN	chown oracle:oinstall /u01
USER	oracle
WORKDIR /home/oracle
ENV	mosUser=you@example.com mosPass=supersecret DownList=1,2
RUN	wget https://dl.dropboxusercontent.com/u/7787450/getMOSPatch.sh
RUN	wget https://dl.dropboxusercontent.com/u/7787450/responsefile_oracle12102.rsp
RUN	echo "226P;Linux x86-64" > /home/oracle/.getMOSPatch.sh.cfg
RUN	sh /home/oracle/getMOSPatch.sh patch=17694377
RUN	unzip p17694377_121020_Linux-x86-64_1of8.zip
RUN	unzip p17694377_121020_Linux-x86-64_2of8.zip
RUN	rm p17694377_121020_Linux-x86-64_1of8.zip p17694377_121020_Linux-x86-64_2of8.zip
RUN	/home/oracle/database/runInstaller -silent -force -waitforcompletion -responsefile /home/oracle/responsefile_oracle12102.rsp -ignoresysprereqs -ignoreprereq
USER	root
RUN	/u01/app/oraInventory/orainstRoot.sh
RUN	/u01/app/oracle/product/12.1.0.2/dbhome_1/root.sh -silent
RUN	rm -rf /home/oracle/responsefile_oracle12102.rsp /home/oracle/getMOSPatch.sh /home/oracle/database
USER	oracle
WORKDIR	/home/oracle
RUN     mkdir -p /u01/app/oracle/data
RUN     wget https://dl.dropboxusercontent.com/u/7787450/manage-oracle.sh
RUN     chmod 700 /home/oracle/manage-oracle.sh
RUN     wget https://dl.dropboxusercontent.com/u/7787450/db_install.dbt
EXPOSE  1521
CMD	/home/oracle/manage-oracle.sh

1- FROM; this is the (operating system) image which is used as a base image. If it is not available locally, it will be pulled from the central docker repository. oraclelinux:6 means Oracle Linux version 6. This serves as the base image on which all the other things are built.
2- MAINTAINER; the maintainer of the new image, obviously.
3/4/5- RUN; the groups and user for the oracle database (oinstall, dba, oracle) are created. This looks like an redundant step, as this is supposed to be done by the oracle-dbms-server-12cR1-preinstall rpm. However, this does not work. By manually creating them this way, the assignment is alright.
6- RUN; yum -y installs all the necessary packages. Oddly, it seems unzip is not in the preinstall package?
7- RUN; /u01 is the directory in which the database software is installed.
8- RUN; chown user:group is used to change the user and group permissions of /u01.
9- USER; this changes the current user for execution to ‘oracle’.
10- WORKDIR; this sets the work directory.
11- ENV; this sets a few environment variables needed by the getMOSPatch.sh script. Please mind you need to fill out your own My Oracle Support username (mosUser) and password (mosPass). Leave DownList as it is.
12- RUN; this downloads a modified version of the getMOSPatch.sh script by Maris Elsins (thank you Maris and John Piwowar). The only modification I made is to make the downloads pre-selectable by setting the DownList environment variable.
13- RUN; this downloads a response file for the installation.
14- RUN; this echoes the language for the patches to download.
15- RUN; sh /home/oracle/getMOSPatch.sh patch=17694377 runs getMOSPatch.sh and downloads file 1 and 2 of patch 17694377, which is the patch for the Oracle database version 12.1.0.2.
16/17- RUN; unzips the two files as the result of the patch download in step 15.
18- RUN; cleanup the zip files after they have been extracted.
19- RUN; runs the installer with the template downloaded, and a few extra switches.
20- USER; here we switch to root, because after installation there are two scripts which need to be run as root.
21/22- RUN; runs the two post-install root scripts.
23- RUN; cleans up the getMOSPatch.sh script, the template and the database directory, which contains the database installation media.
24/25- USER/WORKDIR; change back to the oracle user.
26- RUN; create a directory on which we hook the persistent storage of the database
27- RUN; download manage-oracle.sh, the main script that sets up the listener and the database.
28- RUN; sets the mode of the manage-oracle.sh script to 700 (read/write/execute for the oracle user).
29- RUN; download the database installation template.
30- EXPOSE; make port 1521 (the listener port) available externally.
31- CMD; make /home/oracle/manage-oracle.sh the main command (pid 1 in the container).

After creating the Dockerfile, we can build an image with the description in the Dockerfile:

# cd /var/lib/docker/dockerfiles/build-oracle-12102
# docker build -t "oracle-12102" .

(mind the trailing dot)

This will take some time, as it is downloading the operating system base image, the software and does the installation. If all goes well, you end up with two images:

# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
oracle-12102        latest              28c18b083371        42 seconds ago      12.25 GB
oraclelinux         6                   cfc75fa9f295        11 days ago         156.2 MB

Which are the image we started with (oraclelinux), and our fresh created image “oracle-12102”.

In order to start the container, we need to prepare a directory for the database files, which is mounted into the container, and is persistent (so we can actually store and save everything, which is one of the basic things a database is used for). For this I create a directory, and set the userid and group id of the Oracle user (inside the container!):

# mkdir -p /var/lib/docker/db/frits
# chown 54321:54321 /var/lib/docker/db/frits

Now that we made all preparations, we can start a container, in which the database will run. The way I created the script (manage-oracle.sh), the container will create or start a database with the hostname set for the container. At this point I don’t know if that is the best choice. It seems like a good way to automate currently. This is how a container is started based on the image we just created. The startup line looks rather long, because a couple of settings need to be made, which are saved by docker:

# docker run --ipc=host --volume=/var/lib/docker/db/frits:/u01/app/oracle/data --name frits --hostname frits --detach=true oracle-12102
25efb5d26aad31e7b06a8e2707af7c25943e2e42ec5c432dc9fa55f0da0bdaef

If you wonder what the long hexadecimal string is, this is the full image hash. Now the container is starting. There are a couple of important things to see in the docker run line:
–ipc=host This makes the host’s IPC available to the container. Otherwise my container only could see 32MB of shared memory.
–volume=/var/lib/docker/db/frits:/u01/app/oracle/data This maps the docker hosts directory /var/lib/docker/db/frits available in the container at /u01/app/oracle/data (–volume=host:container). This is persistent storage, unlike the container’s image.
–name=frits This gives the docker container the name frits.
–hostname=frits This makes the container set the hostname to frits.
–detach=true This makes the container not be connected to the starting shell, because otherwise the shell signals will be honoured by the container, meaning the CTRL-C will stop the container.
oracle-12102 This is the image on which the container is based.

If you want to see what is going on in the container, the first thing is to use the docker logs command. This will show the standard out of the main running process. The container will stop if the main running process terminates (CMD in our Dockerfile, which is starting manage-oracle.sh). In the manage-oracle.sh I put a tail on the alert.log of the database after starting the listener, so you can use ‘docker logs’ to look how the database is doing (-f means follow):

# docker logs -f frits

LSNRCTL for Linux: Version 12.1.0.2.0 - Production on 11-AUG-2015 15:33:30

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Starting /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 12.1.0.2.0 - Production
System parameter file is /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=frits)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 12.1.0.2.0 - Production
Start Date                11-AUG-2015 15:33:30
Uptime                    0 days 0 hr. 0 min. 5 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
The listener supports no services
The command completed successfully

SQL*Plus: Release 12.1.0.2.0 Production on Tue Aug 11 15:33:36 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 1048576000 bytes
Fixed Size		    2932336 bytes
Variable Size		  247464336 bytes
Database Buffers	  658505728 bytes
Redo Buffers		  139673600 bytes
Database mounted.
Database opened.
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Database Characterset is WE8MSWIN1252
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process AQPC
Tue Aug 11 15:33:55 2015
AQPC started with pid=30, OS id=88
Starting background process CJQ0
Tue Aug 11 15:33:57 2015
CJQ0 started with pid=61, OS id=150
Completed: ALTER DATABASE OPEN
Tue Aug 11 15:34:10 2015
===========================================================
Dumping current patch information
===========================================================
No patches have been applied
===========================================================

If you followed the examples in the blogpost, this is not what you would see, because there isn’t a database. Instead, if you look with docker logs, you will see:

# docker logs -f frits

LSNRCTL for Linux: Version 12.1.0.2.0 - Production on 11-AUG-2015 15:44:51

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Starting /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 12.1.0.2.0 - Production
System parameter file is /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=frits)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 12.1.0.2.0 - Production
Start Date                11-AUG-2015 15:44:51
Uptime                    0 days 0 hr. 0 min. 5 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/frits/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=frits)(PORT=1521)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
The listener supports no services
The command completed successfully
Creating and starting Oracle instance
2% complete
3% complete
5% complete

Which is the dbca working to create a database for you.

If you wonder what is going on inside the container, you can create a shell in it using the ‘docker exec’ command. This can not be done if the container is stopped, only if the container is running:

# docker exec -ti frits bash
[oracle@frits ~]$

Unlike an attached start, a docker exec bash session can exit without interrupting the container.

In order to see running containers, use docker ps:

# docker ps
CONTAINER ID        IMAGE                 COMMAND                CREATED             STATUS              PORTS               NAMES
304d273d9424        oracle-12102:latest   "/bin/sh -c /home/or   2 minutes ago       Up 2 minutes        1521/tcp            frits

The container can be stopped:

# docker stop frits
frits

It is not visible anymore as running container with docker ps:

# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

However, you can see stopped containers when you add ‘-a’ to ps:

# docker ps -a
CONTAINER ID        IMAGE                 COMMAND                CREATED             STATUS                     PORTS               NAMES
304d273d9424        oracle-12102:latest   "/bin/sh -c /home/or   25 minutes ago      Exited (0) 2 minutes ago                       frits

You can start a stopped container with docker start:

# docker start -a=false frits
frits

If you want containers to start on boot, you must automate the docker start command on the operating system.

Please mind the exposed port on the container (1521, the listener) is local to the container, and not visible outside of the docker host. If you want to make the port visible outside of the docker host, you must publish it from the container to the host. In order to do that, you must redefine the container. This requires you to drop it first:

# docker stop frits
frits
# docker rm frits
frits
# docker run --ipc=host --volume=/var/lib/docker/db/frits:/u01/app/oracle/data --name frits --hostname frits --detach=true --publish=1521:1521 oracle-12102
84421ae6ee344df7d78aba5f0f5314894137f120a05b94870fcc0f805287b632

Because the database is on persistent storage, it can just startup again.

Oracle DBAs who are so old that they remember the days before Oracle 11.2 probably remember the tuning efforts for latches. I can still recall the latch number for cache buffers chains from the top of my head: number 98. In the older days this was another number, 157.

But it seems latches have become less of a problem in the modern days of Oracle 11.2 and higher. Still, when I generate heavy concurrency I can see some latch waits. (I am talking about you and SLOB mister Closson).

I decided to look into latches on Oracle 12.1.0.2 instance on Oracle Linux 7. This might also be a good time to go through how you think they work for yourself, it might be different than you think or have been taught.

In order to understand how latching works, I searched for Oracle related traces. I could find event 10005, but it did not return anything latching related. My current understanding is that event 10005 is Oracle KST tracing, for which the results are put in X$TRACE.

Luckily, I could get a great headstart by studying the work of Andrey Nikolaev. However, that work seems to be strictly Solaris based.

I decided to take a look on how this works on Linux. In order to do this, I did setup a system for the specific purpose of this test. Disclaimer: The techniques below are for educational purposes only, and should never be done on a real database!

The work of Andrey shows kslgetl() as the overal latching function, which does:
kslgetl
– sskgslgf (immediate latch get)
– kslges (wait latch get)
— kskthbwt
— kslwlmod (setup wait list)
— sskgslgf (immediate latch get)
— skgpwwait (sleep latch get)
— semop

In order to do predictable latch gets and generate latch misses, in a very predictable way, I used the method that Andrey shows on his website (which he attributed to Tanel), which is using oradebug call to call the latch get function. In order to get latch waits, you need at least two processes doing something with a latch, one holding the latch, and another one requesting it. In order to facilitate this, I setup two sqlplus / as sysdba sessions.

Taking a latch manually can be done using the kslgetl or the ksl_get_shared_latch functions. Freeing a latch is done using the kslfre function. The kslfre function just takes the latch address as argument. The kslgetl and ksl_get_shared_latch functions take the following arguments:
1-latch address
2-immediate get (0 means yes, 1 means no)
3-where (X$KSLLW.INDX)
4-why (X$KSLWSC.INDX)
5-mode (8=shared,16=exclusive; only for ksl_get_shared_latch function)

Immediate mode get for a non shared latch.
An immediate mode get just tries to fetch a latch once. In order to manually do an immediate latch get, I done:

Session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 1

This means session 1 has gotten latch 0x60023a80. I gotten the latch in willing to wait mode, but that does not really matter here. The session did get the latch.

If you want to check this, simply use V$LATCHHOLDER view to verify (in another session):

SQL> select * from v$latchholder;

        PID        SID LADDR            NAME                                                                   GETS     CON_ID
---------- ---------- ---------------- ---------------------------------------------------------------- ---------- ----------
        38        134 0000000060023A80 cache table scan latch                                                  709          0

Please mind that with a latch manually gotten, you need to free the latch before you try to do anything else with your session, otherwise you encounter a (non critical) ORA-600. Freeing a latch is done using oradebug call kslfre and one argument: the latch address.

Session 2:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 0

The ‘Function returned 0’ means the immediate latch get failed. As you can see this was an immediate get because the second argument is 0.

In order to understand which functions are involved, I first used the perf record linux utility. However, because the immediate get does not spin, and a latch get by all means has speed/low overhead as one of the principal design criterions, I could not see the function.

This meant I needed to go to one of the tools I have used extensively in the past: gdb (the GNU debugger). You need to attach to the Oracle database server shared process locally on the database server. Here is what I did:

# gdb -p 4600
...
(gdb) set pagination off
(gdb) rbreak ^ksl.*
...
Breakpoint 262 at 0x8ea7b0
<function, no debug info> ksl_event_stats_rollup;
(gdb) commands 1-262
type commands for breakpoint(s) 1-262, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

The first gdb function turns off having to press enter for every full screen of output of gdb, the second function breaks on all functions in the oracle executable that start with ‘ksl’. The commands command creates commands that are executed in gdb if breakpoints 1-262 are encountered, which is ‘c’: continue.

Now, with the debugger set, I executed the kslgetl function again:

Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 247, 0x000000000c8e10a0 in kslws_check_waitstack ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 242, 0x000000000c8ddcb0 in kslgetl ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()

It is important to understand Oracle does a lot of other stuff outside the latch get via kslgetl. Most of the stuff above are functions which start with kslwt, which is the Oracle wait interface. A couple of times a shared latch is taken (as can be seen by the function ksl_get_shared_latch), and freed (kslfre). The important part here is: kslgetl is executed once, and did not go into any other function to try to get the latch.

Getting a latch in willing to wait mode for a non shared latch.
Now let’s do something a bit more exciting: getting a taken latch in willing to wait mode. The first session can do exactly the same, just take the latch. The second session needs to be changed a little bit to indicate it is willing to wait:

SQL> oradebug cell kslgetl 0x60023a80 1 0 2442

This will call additional functions. In order to understand what these functions are, I used perf record, perf report and perf script.

I was able to create a smaller, more specific gdb script to see what is going on:

break kslgetl
  commands
    silent
    printf "kslgetl laddr:%x, willing:%d, where:%d, why:%d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslges
  commands
    silent
    printf "kslges %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

This is how the gdb output looks like when the latch get in willing to wait mode is executed:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
kslwlmod 13311368, -1780327896, 1610758784, 1
skgpwwait 13311608, -1767360, -1780326976, 0
sskgpwwait 13311608, -1767360, -1780326976, 0
semop 360451, 13310840, 1, -1

Interestingly, if the latch is not taken, this is how the latch get sequence looks like:

kslgetl laddr:60023a80, willing:1, where:0, why:2442

In other words, for getting a non shared latch in willing to wait mode:

1-the function kslgetl is called, which tries to fetch the latch.
If the latch can be taken, the function returns, if not:
2-the function kslges (kernel service latch get spinning) is called, which supposedly also tries to take the same latch.
If the latch still can not be taken, the next function is:
3-the function kslwlmod (kernel service latch waiting list modify) is entered.
In this function the process registers itself as waiting in the post/wait list.
4-the function skgpwwait (system kernel generic post/wait wait) is entered.
This function sets up the waiting for the process so it can be posted.
5-the function sskgpwwait (system system kernel generic post/wait wait)
My current understanding is the ‘ss’ function contain the platform specific code for database functions.
6-the (operating system) function semop (semaphore operation) is called.
This will make the process sleep waiting on a semaphore (operating system signalling mechanism). This way, the process will not be runnable on the CPU unless the semaphore is posed.

The information that is missing here, is the spinning. The earlier work of Andrey Nikolaev showed that in the Solaris port of the database, a distinct function (sskgslgf [immediate] and sskgslspin [spin]) was used to get the latch, which made it easy to count.

Some searching around revealed that a CPU register reveals this information. Add this to the above gdb script:

break *0xc29b51
  commands
    silent
    printf " kslges loop: %d\n", $ecx
    c
  end

And try to get a non shared taken latch in willing to wait mode:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
 kslges loop: 19999
 kslges loop: 19998
...
 kslges loop: 1
 kslges loop: 0
kslwlmod 1208568840, -1780327896, 1610758784, 1
skgpwwait 1208569080, -1281169344, -1780326976, 0
sskgpwwait 1208569080, -1281169344, -1780326976, 0
semop 360451, 1208568312, 1, -1

So…this reveals that getting a non shared latch in willing to wait mode will spin 10*_spin_count. In other words: not _spin_count, which is 2000 by default. What is even more interesting, is for the described type of latch, there is no (short) timed sleep done; the kslges function spins 10*_spin_count times for the latch, then puts itself on the waiting list, and goes to sleep on a semaphore. The way the latching mechanism works has been described (by Oracle AFAIK) as that a process will spin for _spin_count times trying to get a latch, then goes to sleep for some time, and then spins trying to get the latch, goes to sleep, etc.

I strace’d the process holding the latch to see if it is doing a semctl systemcall to signal the waiting process if the latch is freed, and indeed that is what is happening. This is different from the “old days” where processes spinning on latches (cache buffers chains latches most of the time) were really hammering the system and would eat up a lot of the CPU slices trying to get a latch.

Part of the classic latching problem, specifically the cache buffers chains latching problem, quite probably is mitigated by having shared latches, which were introduced in Oracle in different versions for different ports.

Watch out for a second blogpost where I do the same investigation for shared latches.

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.

#!/usr/local/bin/stap

global latency

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()])
}

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

...
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 79542
>kcbgtcr
<kcbgtcr, latency(us): 4
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 13
>kcbgtcr
<kcbgtcr, latency(us): 1
...

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

#!/usr/local/bin/stap
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("<kcbgtcr, latency(us): %d\n", latency[pid()])
}

probe end {
	if ( @count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")
		println(@hist_log(latency_histogram))
	}
}

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:

...

==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

#!/usr/local/bin/stap
#
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
	}
	printf("\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiosq0") {
	printf("=PARSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiexe") {
	printf("=EXECUTE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opifch2") {
	printf("=FETCH\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiclo") {
	printf("=CLOSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	printf("=KSLWTBCTX\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	printf("=KSLWTECTX\n")
}
probe process("/lib64/libaio.so.1").function("io_submit") {
	printf("=IO_SUBMIT\n")
}
probe process("/lib64/libaio.so.1").function("io_getevents_0_4") {
	printf("=IO_GETEVENTS\n")
}
probe process("/lib64/libc.so.6").function("pread64") {
	printf("=PREAD\n")
}

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
=IO_SUBMIT
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
=IO_GETEVENTS
=IO_GETEVENTS
169481 oracle_5487_tes(5487):              > kcflwi
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTBCTX
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTECTX
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;

  COUNT(*)
----------
   1000000

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;
    169999

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1’ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0’.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on 11.2.0.3 and 12.1.0.2, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on 11.2.0.3, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.

Conclusions

The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on 12.1.0.2 gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

In my previous article I started exploring the memory usage of a process on a recent linux kernel (2.6.39-400.243.1 (UEK2)), recent means “recent for the Enterprise Linux distributions” in this context, linux kernel developers would point out that the kernel itself is at version 3.19 (“stable version” at the time of writing of this blogpost).

The previous article showed that every process has its own address space, and that different allocations exists for execution. These allocations can be seen in the proc pseudo filesystem, in a process specific file called ‘maps’. The process itself needs some administration area’s which are anonymous allocations which are marked with [heap] and [stack], and some a few others called [vdso] and [vsyscall]. Every process executes something (not all people realise that: a process can wait for something, but essentially is always executing). So there always will be an executable in a (regular) process’ address space. In a lot of cases, the executable uses shared libraries. In that case, the libraries are loaded in the address space of the process too, alongside the executable.

The executable contains (at least) two sections; the code segment, which is readonly and potentially shared, and the data segment, which is read write and gets truly allocated instead of used shared with the parent if the process needs to write. In a lot of cases, the executable uses shared libraries, which means it uses functions which are stored in that library. A library also needs to be loaded, and also contains multiple sections, which can be read only or read write, and are shared unless the process needs to write to that segment.

For completeness, here’s the complete maps output of a process executing the ‘cat’ executable again:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

If you look closely, you will see that I didn’t explain one type of allocation yet: the anonymous allocations. The anonymous allocations are visible in lines 4, 11, 13 and 16. The anonymous mappings directly following the data segment of either an executable or a library mapped into a process address space is called the BSS. The data segment and the BSS store static variables, however the data segment stores initialised variables, the BSS stores uninitialised variables. The anonymous mapping for the BSS section might exist for a library, as seen above, or might not exist; not all Oracle database executable libraries use an anonymous memory mapping for example. Actually, there is one other memory allocation visible, /usr/lib/locale/locale-archive, which is a file for locale (multi-language support) functions in the C library, which is out of scope for this article.

When a process requests memory to store something, the system call malloc() (memory allocation) can be called. This system call inspects the size of the allocation, and will allocate memory from either the process’ heap (the memory mapping with [heap], using the system call brk()) or it will allocate space using a new anonymous memory segment, using the system call mmap(). If you follow the link with malloc(), you can read the source code of the malloc() call. There are different malloc()’s, which fulfil different purposes (embedded devices have different requirements than huge servers), the implementation that Enterprise Linuxes use is one called ptmalloc2, which is based on a version written bij Doug Lea. If you read the comments in the source code, specifically at ‘Why use this malloc?’, you will see that it tries to be smart with requests up to 128KB (for memory re-usability, to avoid fragmentation and memory wastage), which are allocated from the heap. If an allocation is larger than 128KB, it will use the system memory mapping facilities.

Okay, this brings us back at the original question: how much memory does this process take? I hope you recall from the first blogpost that Linux tries to share as much memory as possible, and when a new process is created, the allocations in the address space of this new process are pointers to the memory areas of the parent process. Let’s first use a utility a lot of people are using: top.

top - 10:53:40 up 9 days, 14:11,  2 users,  load average: 1.34, 1.36, 1.37
Tasks: 1124 total,   1 running, 1123 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  0.8%sy,  0.1%ni, 98.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98807316k total, 97411444k used,  1395872k free,   400288k buffers
Swap: 25165820k total,  3560852k used, 21604968k free, 27573200k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 16391 oracle    -2   0 2369m 4960 4812 S  0.7  0.0  91:41.37 asm_vktm_+asm1
 16407 oracle    20   0 2388m  20m 8580 S  0.7  0.0  46:16.40 asm_dia0_+asm1

I edited the output a bit to show only two process from an ASM instance.
The columns that show information on memory are VIRT, RES, SHR, %MEM.

VIRT is described in the man-page of top as ‘The total amount of virtual memory used by the task’. This means it’s ALL the memory visible in the addressing space of the process. A useful utility to get the contents of the virtual memory allocations for a process is pmap, let’s use it for process 16391, which is asm_vktm_+asm1:

$ pmap -x 16391
16391:   asm_vktm_+ASM1
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000  246644    2520       0 r-x--  oracle
000000000f6dd000    1952      44      20 rw---  oracle
000000000f8c5000     140       0       0 rw---    [ anon ]
000000001042c000     356       0       0 rw---    [ anon ]
0000000060000000    4096       0       0 rw-s-  SYSV00000000 (deleted)
0000000061000000 2080768       0       0 rw-s-  SYSV00000000 (deleted)
00000000e0000000      12       4       4 rw-s-    [ shmid=0x4e78003 ]
00007fa275589000      72      12       0 r-x--  libnfsodm12.so
00007fa27559b000    2044       0       0 -----  libnfsodm12.so
00007fa27579a000       8       0       0 rw---  libnfsodm12.so
00007fa27579c000    1604       0       0 r-x--  libshpksse4212.so
00007fa27592d000    2044       0       0 -----  libshpksse4212.so
00007fa275b2c000      72       0       0 rw---  libshpksse4212.so
00007fa275b3e000      20       4       0 r-x--  libcxgb3-rdmav2.so
00007fa275b43000    2044       0       0 -----  libcxgb3-rdmav2.so
00007fa275d42000       4       0       0 rw---  libcxgb3-rdmav2.so
...snip...
00007fa27abd5000       8       0       0 r-x--  libodmd12.so
00007fa27abd7000    2044       0       0 -----  libodmd12.so
00007fa27add6000       4       0       0 rw---  libodmd12.so
00007fa27add7000     128     116       0 r-x--  ld-2.12.so
00007fa27adf8000     512      12      12 rw---    [ anon ]
00007fa27ae78000     212       4       4 r--s-  passwd
00007fa27aead000    1260      68      68 rw---    [ anon ]
00007fa27aff3000       4       4       0 rw-s-  hc_+ASM1.dat
00007fa27aff4000       8       0       0 rw---    [ anon ]
00007fa27aff6000       4       0       0 r----  ld-2.12.so
00007fa27aff7000       4       0       0 rw---  ld-2.12.so
00007fa27aff8000       4       0       0 rw---    [ anon ]
00007fff7b116000     132       8       8 rw---    [ stack ]
00007fff7b1ff000       4       4       0 r-x--    [ anon ]
ffffffffff600000       4       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB         2426668    5056     156

The column ‘Kbytes’ represents the full size of the executable, libraries, shared memory, anonymous mappings and other mappings of this process. For completeness sake: 2426668/1024=2369.79, which matches the 2369 in the top output. This is all the memory this process can see, and could use. Does this tell us anything on what memory the process 16391 actually takes? No. (parts of) the Oracle executable’s allocations are potentially shared, the shared memory (SYSV00000000 (deleted) and [ shmid=0x4e78003 ], which represent the Oracle SGA) is shared, the memory allocations for the libraries are potentially shared. The anonymous memory mappings have been defined, but the actual allocation is not visible in the Kbytes column. What this column in top does for me, is tell the approximate SGA size, especially if the SGA is larger (meaning multiple gigabytes).

The second memory column in top is RES. RES is described as: ‘The non-swapped physical memory a task is using’. RES is sometimes referred to as RSS, and called ‘resident set size’. As we can see from the total, the RSS is way lesser than the virtual memory size. One important thing in the RES description of top is that it described that swapped memory pages are not counted for the RES/RSS value. RES/RSS corresponds to the actual used (“touched”) memory by a process, and is directly usable. If you look back to the RSS column of pmap above, you see the oracle executable’s two mappings, one has a RSS size of 2520, and one has a RSS size of 44. But…if you remember that the code/readonly segment is potentially shared with other process, and then look at the 2520 value (which is of the oracle memory segment with the rights r-x–, which means the code segment), I hope you understand this just means this process (vktm) read a subset of the entire executable, and more importantly: the RSS size does not reflect physical memory uniquely allocated by this process.

If we look at the shared memory segments, it’s interesting to see what happens during normal life of a database session. I think it is needless to say that you should calculate shared memory outside of process memory usage, since it’s a distinct memory set that is truly shared by all the processes that are created for the instance.

This is a session which has just started:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
...
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     348     348 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848    2316    2316 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592       0       0 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]
...

This instance has a SGA set to 4G. Because the session just started, it only touched 2316(KB) of the SGA. Next, I do a big (buffered!) full table scan, requiring the session to put a lot of blocks into the buffercache. After the scan, look at the shared memory segment using pmap again:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
...
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     384     384 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848 2279040 2279040 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592   66564   66564 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]
...

The session has touched half of the SGA shared memory segment (visible in the RSS column of the 6th line). This is logical if you understand what is going on: the process does a buffered table scan, which means the blocks read from disk need to be stored in the buffer cache, which is one of the memory structures in the Linux shared memory segments. However, if you look strictly at the top utility output of a database that has just started up, you see the RSS size of the all the processes that work with the buffercache growing. This phenomenon has lead to a repetitive question on the Oracle Linux forums if Oracle database processes are leaking memory. Of course the correct answer is that the RSS size just grows because the process just touches more of the shared memory (=SGA) that has been mapped into its address space. It will stop increasing once it touched all the memory it could touch.

%MEM is the RES/RSS size expressed as a percentage of the total physical memory.

SHR is the amount of shared memory. The manpage of top says ‘It simply reflects memory that could be potentially shared with other processes’. Do not confuse this with shared memory segments mapped into the process’ address space. Empirical tests show the SHR value always seems to be lower than the RSS size, which means it seems to track the RSS value of memory, and shows RSS (touched) memory that could be shared (which seems to contain both touched memory from the shared memory segments, as well as privately mapped executable and libraries). At least from the perspective of an Oracle performance and troubleshooting perspective I can’t see any benefit from using this value.

The conclusion of this part on memory usage is that both the virtual set size (VIRT/VSZ) and resident set size (RES/RSS) are no figures you can add up to indicate physical memory usage of a single process or a group of processes.

The virtual set size gives you the total amount of virtual memory available to a single process, which includes the executable, and potentially shared libraries, anonymous memory mappings and files mapped into the address space and shared memory. In a lot of cases, you get an indication of the total SGA size of the instance, because the shared memory segments which contain the SGA are entirely mapped into the address space of the process.

The resident set size shows you how much of the memory and files mapped into the address space are actually “touched”, and directly usable. Some of the memory usage result in memory pages private to the process, because of writing to the memory and the Copy On Write mechanism of the Linux memory manager. A lot of other memory mappings can be used, increasing the resident set size, while these are shared with other processes. A third potential component is the actual usage of memory as a result of anonymous memory mappings (versus the total allocation, which can be much more), which are private to the process.

Follow

Get every new post delivered to your Inbox.

Join 2,793 other followers

%d bloggers like this: