Archive

Tag Archives: trace

Actually, this is a follow up post from my performance deep dive into tablespace encryption. After having investigated how tablespace encryption works, this blogpost is looking at the other encryption option, column encryption. A conclusion that can be shared upfront is that despite they basically perform the same function, the implementation and performance consequences are quite different.

Column encryption gives you the ability to choose to encrypt per individual column, that’s kind of obvious. However, having to choose which columns to encrypt is what I see as the biggest downside of this encryption option. In most cases, especially with boxed applications, it is quite hard to try to figure out which columns you exactly want to encrypt in order to protect your sensitive data. Which columns do exactly contain your primary sensitive data, and which columns do contain secondary sensitive data (data derived from sensitive data). Do you, when you have to apply encryption, know what EXACTLY is defined as sensitive data, and what isn’t? I bet there isn’t a clear technical description.

A logical reaction then would be ‘couldn’t I then just encrypt all columns’? Well, that is what tablespace encryption is for, isn’t it? To summarise this: I do think the correct use of column encryption in reality is hard to implement and this very limited in usefulness, in most cases tablespace encryption should be used.

Okay…for this test I created a table with two columns, of which one is encrypted:

SQL> create table column_encryption (id number, a varchar2(10) encrypt);
SQL> insert into column_encryption values (1, 'AAAAAAAAAA');
SQL> commit;

The same table, but without encryption:

SQL> create table no_column_encryption (id number, a varchar2(10) );
SQL> insert into no_column_encryption values (1, 'AAAAAAAAAA');
SQL> commit;

And the same table with a lot of rows:

SQL> create table column_encryption_large (id number, a varchar2(10) encrypt);
SQL> begin
 	     for counter in 1..32000000 loop
 		     insert into column_encryption_large values ( counter, dbms_random.string('l',10) );
 	     end loop;
 end;
/

Let’s follow the path of the previous TDE post, and profile the execution of a SQL on the big table to see the impact of column encryption. The first test is a ‘select count(*) from column_encryption_large’ in one session, and ‘perf record -g -p PID’ in another. If you need more explanation on how to run it, please look at the previous blogpost. This is the output of ‘perf report –sort comm –max-stack 2’:

# perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 1418165467
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |--29.21%-- kdstf00000010000100kmP
            |--12.58%-- kdbulk
            |--3.32%-- gup_pte_range
            |--2.58%-- kdst_fetch0
            |--2.54%-- kcbgtcr
            |--2.25%-- __blk_bios_map_sg
            |--2.21%-- kcbhvbo
            |--2.18%-- unlock_page
            |--1.98%-- ktrgcm
            |--1.93%-- do_direct_IO
            |--1.86%-- kcbldrget
            |--1.52%-- kcoapl

This shows IO related functions, both Oracle and operating system level; kdstf is kernel data scan table full for example, gup_pte_range, do_direct_IO, unlock_page and __blk_bios_map_sg are Linux kernel functions. Most notably there are no encryption related functions, which is a big difference with tablespace encryption!
This is actually very logical if you understand the differences between column encryption and tablespace encryption. First let’s look at a block dump from a data block from segment in an encrypted tablespace:

Block dump from cache:
Dump of buffer cache at level 4 for pdb=0 tsn=5 rdba=907
Block dump from disk:
Encrypted block <5, 907> content will not be dumped. Dumping header only.
buffer tsn: 5 rdba: 0x0000038b (1024/907)
scn: 0x0.4e9af4 seq: 0x01 flg: 0x16 tail: 0x9af40601
frmt: 0x02 chkval: 0xf23a type: 0x06=trans data

Yes…you read that right: the block is encrypted, so it will not be dumped. Luckily, you can set the undocumented parameter “_sga_clear_dump” to true to make Oracle dump the block:

SQL> alter session set "_sga_clear_dump"=true;
SQL> alter system dump datafile 5 block 907;

This will make Oracle dump the block. The dump will show the decrypted version of the tablespace level encrypted block:

Block header dump:  0x0000038b
 Object id on Block? Y
 seg/obj: 0x17bc3  csc: 0x00.4e9aed  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x388 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0007.01d.000001d0  0x00000987.0390.27  --U-    1  fsc 0x0000.004e9af4
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x0000038b
data_block_dump,data header at 0x7f140f335374
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f140f335374
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f8a
avsp=0x1f76
tosp=0x1f76
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f8a
block_row_dump:
tab 0, row 0, @0x1f8a
tl: 14 fb: --H-FL-- lb: 0x1  cc: 1
col  0: [10]  41 41 41 41 41 41 41 41 41 41
end_of_block_dump

For the count(*), there is no need to read the data, the only thing needed is to read the row directory to fetch the number of rows (row 19). However, to do that, the block must be decrypted.

Now look at a block dump of a column encrypted data block:

Block header dump:  0x0000032b
 Object id on Block? Y
 seg/obj: 0x1821d  csc: 0x00.676d7e  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x328 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.007.000078a9  0x00000117.2246.07  --U-    1  fsc 0x0000.00676d7f
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x0000032b
data_block_dump,data header at 0x7f140f333264
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f140f333264
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f5d
avsp=0x1f49
tosp=0x1f49
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f5d
block_row_dump:
tab 0, row 0, @0x1f5d
tl: 59 fb: --H-FL-- lb: 0x1  cc: 2
col  0: [ 2]  c1 02
col  1: [52]
 fd e0 87 66 55 f7 e6 43 de be 31 f6 71 4f 7f 4e f1 75 fb 88 98 9d 13 ed 8e
 cb 69 02 bc 29 51 bd 21 ea 22 04 6b 70 e9 ec 01 9d d6 e4 5a 84 01 1d 90 b0
 e9 01
end_of_block_dump

The block and the row directory can be read normally without any need for decryption. The only thing encrypted is the column (“a”). That perfectly explains the absence of any functions that indicate decryption, because there isn’t any decryption taking place!

Now let’s rewrite the SQL to touch the data, and thus involve decryption: ‘select avg(length(a)) from column_encryption_large’. This way the row needs to be decrypted and read. This is how the output of a perf recording looks like:

# perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 65K of event 'cycles'
# Event count (approx.): 229042607170
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |--24.73%-- ztchsh1h
            |--14.91%-- ztchsh1n
            |--6.10%-- y8_ExpandRijndaelKey
            |--5.90%-- ownGetReg
            |--5.50%-- __intel_ssse3_rep_memcpy
            |--4.99%-- ztchsh1f
            |--4.28%-- ztcxi
            |--2.60%-- ipp_is_GenuineIntel
            |--1.88%-- _intel_fast_memcpy
            |--1.74%-- _intel_fast_memcpy.P
            |--1.52%-- kspgip
            |--1.16%-- kgce_init

The functions starting with ‘ztc’ are probably related to security (“zecurity”), and also probably related to decryption. The function name “y8_ExpandRijndaelKey” is clearly related to encryption. When you look up the function address of “ownGetReg”, it’s close to the “y8_ExpandRijndaelKey” function. The last group of functions are memcpy related functions, that seems consistent with decrypting: moving data.

On the performance side, it’s clear that the majority of the time is spend in the functions ztchsh1h and ztchsh1n. In order to understand more about these functions, let’s expand the stack:

# perf report --sort comm
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 65K of event 'cycles'
# Event count (approx.): 229035032972
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |
            |--25.01%-- ztchsh1h
            |          |
            |          |--99.63%-- ztchsh1n
            |          |          |
            |          |          |--50.85%-- ztchsh1f
            |          |          |          ztchf
            |          |          |          ztcxf
            |          |          |          ztcx
            |          |          |          kztsmohmwl
            |          |          |          kztsmhmwl
            |          |          |          kzekmetc
            |          |          |          kzecsqen
            |          |          |          kzecctex
            |          |          |          evaopn2
            |          |          |          evaopn2
            |          |          |          qesaAggNonDistSS
            |          |          |          kdstf00001010000000km
            |          |          |          kdsttgr
            |          |          |          qertbFetch
            |          |          |          qergsFetch
            |          |          |          opifch2
            |          |          |          kpoal8
------------------------------------------------------
            |--14.90%-- ztchsh1n
            |          |
            |          |--85.25%-- ztchsh1f
            |          |          ztchf
            |          |          ztcxf
            |          |          ztcx
            |          |          kztsmohmwl
            |          |          kztsmhmwl
            |          |          kzekmetc
            |          |          kzecsqen
            |          |          kzecctex
            |          |          evaopn2
            |          |          evaopn2
            |          |          qesaAggNonDistSS
            |          |          kdstf00001010000000km
            |          |          kdsttgr
            |          |          qertbFetch
            |          |          qergsFetch
            |          |          opifch2
            |          |          kpoal8

I fetched the stack of the two functions in which the most time was spend. The most important thing to see is that the encryption now takes place as part of processing the fetched data (qesaAggNonDistSS probably has something to do with aggregating data, evaopn2 probably is a function to evaluate operands) rather than performing the (logical) IO; mind the absence of the kcbgtcr function.

The reason for doing the decryption during operand evaluation rather than during doing the IO is because the data is stored encrypted in the block, and thus also in the buffer cache. So during IO time, there is no need to decrypt anything. This also has another rather important consequence: for every row that has an encrypted column that is processed, decryption needs to take place. There does not seem to be any caching of the decrypted value for column encryption, which is logical from a security point of view, but has a severe performance consequence.

When doing a pin tools debugtrace on the above SQL for the processing of a single row (the table ‘column_encryption’, rather than ‘column_encryption_large’), applying the sed filters, and then grepping for a selective set of functions, this is how the processing looks like:

 | | | | > qergsFetch(0x294512030, 0x7f871c9fa2f0, ...)
 | | | | | > qeaeAvg(0x7f8717ce9968, 0xe, ...)
 | | | | | < qeaeAvg+0x000000000063 returns: 0  | | | | | > qertbFetch(0x294512178, 0x7f871ca08a68, ...)
 | | | | | | | | | | > kcbgtcr(0x7ffe2f9b3ae0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > pread64@plt(0x100, 0x1f428c000, ...)
 | | | | | | | | | | < kcbgtcr+0x000000003221 returns: 0x1f428c014  | | | | | | | | | | | | | | > kcbgtcr(0x7ffe2f9b35d0, 0, ...)
 | | | | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1f428c014  | | | | | | > kdsttgr(0x7f871c9f9918, 0, ...)
 | | | | | | | > kdstf00001010000000km(0x7f871c9f9918, 0, ...)
 | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > pread64@plt(0x100, 0x2b1115000, ...)
 | | | | | | | | | | | < kcbgtcr+0x000000003221 returns: 0x1e4aa6014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1e4aa6076
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1e4aa6076  | | | | | | | | > qesaAggNonDistSS(0x7ffe2f9b45d0, 0x7fff, ...)
 | | | | | | | | | > evaopn2(0x294511ef0, 0x294512030, ...)
 | | | | | | | | | | > evaopn2(0x294511e68, 0x294512030, ...)
 | | | | | | | | | | | | | | | | | | | > ztchsh1n(0x7ffe2f9b1ef8, 0x11c4e8d0, ...)
 | | | | | | | | | | | | | | | | | | | > ztchsh1f(0x7ffe2f9b1ef8, 0x7ffe2f9b3100, ...)
 --> 168 times in total of ztchsh1n or ztchsh1f
 | | | | | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2c0  | | | | | | | | | | > evalen(0x294511ef0, 0x7f871c9fa2c0, ...)
 | | | | | | | | | | < evalen+0x000000000147 returns: 0x2
 | | | | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2d0  | | | | | | | | | > qeaeAvg(0x7f8717ce9968, 0xb, ...)
 | | | | | | | | | < qeaeAvg+0x000000000063 returns: 0x7f8717ce99c9
 | | | | | | | | < qesaAggNonDistSS+0x000000000193 returns: 0x7fff  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1dec30014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1dec30072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1dec30072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1deca4014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1deca4072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1deca4072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1e4be0014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1e4be0072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1e4be0072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1dedb2014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1dedb2072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1dedb2072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | < kdst_fetch0+0x0000000011c9 returns: 0
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0
 | | | | | | | < kdstf00001010000000km+0x00000000035d returns: 0x7fff
 | | | | | | < kdsttgr+0x00000000085f returns: 0x7fff
 | | | | | < qertbFetch+0x000000001301 returns: 0x7fff  | | | | | > qeaeAvg(0x7f8717ce9968, 0x294511f78, ...)
 | | | | | < qeaeAvg+0x000000000063 returns: 0x2  | | | | | | > evaopn2(0x294511f78, 0, ...)
 | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2e0
 | | | | < qergsFetch+0x000000000f25 returns: 0

This is how the explain plan of the ‘select avg(length(a)) from column_encryption’ SQL:

----------------------------------------------------------------------------------------
| Id  | Operation	   | Name	       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |		       |       |       |     3 (100)|	       |
|   1 |  SORT AGGREGATE    |		       |     1 |    53 |	    |	       |
|   2 |   TABLE ACCESS FULL| COLUMN_ENCRYPTION |     1 |    53 |     3	 (0)| 00:00:01 |
----------------------------------------------------------------------------------------

If you look back to the grepped debugtrace, and the execution plan:
Line 1: the sort aggregate rowsource (qergsFetch).
Line 4: the table acces full (qertbFetch).
Line 5: here a logical read (kcbgtcr) is issued, and because the block didn’t exist in the cache, was physically read (line 6: pread64). This is the segment header, the “real” scan of data blocks has not been started yet.
Line 11: this is the ultra fast full table scan (kdstf00001010000000km). My guess is this function is a full table scan function with certain decisions hard coded, instead of choices made on runtime, so the in-CPU execution prediction runs into less branch mispredictions.
Line 12: this is the part of the full table scan for fetching (loading) the data (kdst_fetch). What is special here is that a multiblock read is done, the kcbgtcr function triggers a single physical read for multiple logical blocks, which are later fetched per block (kdst_fetch and kcbgtcr functions starting from line 32, 38, etc).
Line 19: this function executes row based functions and aggregates the results per block/fetch (qesaAggNonDistSS).
Line 20: as part of fetching the row and executing functions, the row value is evaluated first (evaopn2).
Line 21/22: here the column is decrypted (made visible by the ztchsh1n/ztchsh1f functions, not necessarily the decryption functions theirselves).
Line 26/29: here probably the length (evalen) and average (qeaeAvg) row functions are executed.
Line 32: the next block is processed, but no rows are found, and thus no need to execute rowsource (qe*) functions afterwards.

So, what do we know at this point regarding column encryption?
– Columns that are encrypted are stored encrypted in the block in the buffer cache.
– Which means they have to be decrypted every time the column values are read, which is different from tablespace encryption, for which a block is encrypted, and is decrypted whenever a block is read into the buffer cache.
– Functions related to column encryption specifically (different functions than seen with tablespace encryption) take roughly 40% of the time in my case.

Can the time spend on column decryption be optimised?
There are multiple ways you can change the way Oracle applies column encryption. There are four encryption types: 3DES168, AES128, AES192 and AES256. The default is AES192.
Here are query timings of doing a select avg(length(a)) from TABLE on my “large” table with 32 million rows:

3DES168 4:53
AES256 1:09
AES192 1:06
AES128 1:03

A way to optimise column encryption is to specify not to use an extra verification by specifying ‘nomac’ at the encryption definition of the column. This saves space (by default, extra space is used for the message abstract that is used by the verification for every column). These are the timings with the ‘nomac’ addition to disable encryption verification:

3DES168 3:59
AES256 0:26
AES192 0:23
AES128 0:22

This shows a significant reduction of time. However, if no encryption at all is applied to the column, the query timing is 0:03.

Internals background information
The functions ztchsh1n/ztchsh1 are related to verification (again, if you read the tablespace encryption blogpost, where the most time consuming functions were verification too). Once ‘nomac’ is specified with the encryption definition of the column, the ztchsh* function vanish, and the top time consuming functions are y8_ExpandRijndaelKey and ownGetReg, which clearly are directly related to decryption. The performance gain of ‘nomac’ is lesser with 3DES168 encryption.

Conclusion
I think tablespace encryption is the encryption method of choice for a normal implementation. In most cases it will be too much work to exactly figure out which columns to encrypt. If you still consider column encryption, you also should be aware that the column value is stored encrypted in the block and (as a consequence) in the cache. Every use of the encrypted column involves encryption or decryption, for which the overhead is significant, even with ‘nomac’ specified to disable (additional) verification.

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

In the previous post on the decision between buffered and direct path reads I showed the decision is depended on the version. Up to and including version 11.2.0.2 the size of a segment needs to be five times small table threshold in order to be considered for direct path reads, and starting from 11.2.0.3 the database starts considering direct path reads starting from small table threshold. The lower limit just discussed is small table threshold or five times small table threshold with lower versions, upper limit is called “very large object threshold” (VLOT) and is five times the size of the buffercache, which is the threshold after which a table scan always is going via direct path.

There seems to be an odd situation with 11.2.0.3 and up that if a table has a size between small table threshold and five times small table threshold, it is always read via direct path, while if the table is bigger than five times small table threshold, it is depended on the amount of blocks in the cache, and switches to buffered reads if the amount of blocks in the cache is 99% or more. This seems to be the opposite of what is logical; a smaller sized table is likely a better candidate than a larger table for caching.

The last thing the previous blogpost showed was the impact of dirty blocks on the buffered/direct path decision. Up to 11.2.0.2 a read that should go via direct path is done buffered when the amount of dirty blocks reaches around 75%. Starting from version 11.2.0.3, a read that should go via direct path is done buffered when the amount of dirty blocks for a table reaches approximately 50%. This is vital information for data warehouses!

A PL/SQL procedure was included so these tests could be to measure this for yourself. If you have any doubts, follow the link above and paste the procedure to test it for yourself.

The purpose of this blogpost is to further investigate the table direct path or buffered decision in various Oracle database versions when compression is used. I used compression using “COMPRESS FOR OLTP”, in order to see if compression changes the decisions.

The first measurement is reading, doing a full scan of a compressed table:

Version 11.2.0.1 (_STT 3979; _STT*5:19895)

TS@v11201 > @test_direct_path
table: T2_OLTP size: 64512 blocks, 10000000 rows.
cached - physical reads cache/direct: (10)		6419/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (20)		6487/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (30)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (40)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (50)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (60)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (70)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (80)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (90)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (100)		 6382/0
full scan - physical reads cache/direct:			  0/0

This pattern is repeated up to version 12.1.0.2: if the amount of blocks is 99% or more, the database switches from direct path to buffered reads for table scans. So for simple reading, the decision does not seem different than for non-compressed tables.

When dirty blocks are included, it is different:

Version 11.2.0.1:

TS@v11201 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  0/63911

If a large number of the blocks, or all of them, in the cache are dirty, a full scan does not flip to buffered reads at 99%.

However, it is interesting to see that this changes with version 11.2.0.2:

TS@v11202 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  461/0

Starting from version 11.2.0.2, the threshold of 99% for a full table scan to switch from direct path to buffered seems to be true for clean blocks and dirty blocks. The same is true for versions 11.2.0.3 and 11.2.0.4.

However, starting from 12.1.0.1, a full scan on a compressed table is done via direct path even if all the blocks are in the cache and dirty, which seems identical to the version 11.2.0.1 behaviour:

TS@v12101 > @test_direct_path
table: T2_OLTP size: 65536 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/65174
full scan - physical reads cache/direct: (20)			 0/65174
full scan - physical reads cache/direct: (30)			 0/65174
full scan - physical reads cache/direct: (40)			 0/65174
full scan - physical reads cache/direct: (50)			 0/65174
full scan - physical reads cache/direct: (60)			 0/65174
full scan - physical reads cache/direct: (70)			 0/65174
full scan - physical reads cache/direct: (80)			 0/65174
full scan - physical reads cache/direct: (90)			 5/65174
full scan - physical reads cache/direct: (100)			  0/65174

So, the conclusion overall is there is a difference between regular heap tables and compressed heap tables. I tested OLTP compression, the [nsmtio] tracing indicates this is the same for HCC compression (hybrid columnar compression). In general, for simple reading there doesn’t seem to be a difference between normal heap tables and compressed heap tables. However, I tested on idle systems, on “real” systems with more “stress” on the buffercache management there might be more mechanisms in play.

When there are dirty blocks (blocks changed sitting in the buffercache waiting to be written by the database writer), there is different behaviour with respect to the buffered read or direct path read choice the instance makes. In general, with dirty blocks involved, compressed tables are less likely to be read into the buffercache.

It seems that with compressed tables and version 11.2.0.1 and 12.1.0.1 and up there is no switch to buffered reads even if all the blocks are in the cache, but dirty. The versions in between there, which are version 11.2.0.2, 11.2.0.3 and 11.2.0.4 do switch to buffered reads when 99% or more blocks are in the cache, regardless if they are dirty.

Also, please mind I explicitly mentioned tables. For a full scan on an index, indicated by the ‘FAST FULL INDEX SCAN’ line in an explain plan, entirely different rules are in play.

A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;

  COUNT(*)
----------
   1000000

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;
    169999

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1’ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0’.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on 11.2.0.3 and 12.1.0.2, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on 11.2.0.3, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.

Conclusions

The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on 12.1.0.2 gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

I guess everybody who is working with Oracle databases and has been involved with Oracle Exadata in any way knows about smartscans. It is the smartscan who makes the magic happen of full segment scans with sometimes enormously reduced scan times. The Oracle database does smartscans which something that is referred to as ‘offloading’. This is all general known information.

But how does that work? I assume more people are like me, and are anxious to understand how that exactly works. But the information on smartscans is extremely scarce. Of course there is the Oracle public material, which looks technical, but is little/nothing more than marketing. On My Oracle Support, I can’t find anything on the inner working. Even in the ‘Expert Oracle Exadata’ book (which I still regard as the best source of Exadata related information) there is no material on the mechanics of smartscans.

I’ve written a couple of articles on smartscans, of which this article already lays some groundwork, it describes the phases I could see with the available information at that time: oss_open, oss_ioctl followed by oss_wait and oss_cread followed by oss_wait. This is actually a summary of a smartscan, but a very brief one. In this article I described that a smartscan can only happen with a full segment scan (kdstf* functions, Oracle’s fast full scan routines) and if it chooses to use direct path (kcbld* functions, direct path loader) access, which is actually mandatory to get smart scans.

The following investigation is done on an Exadata X2-2 Quarter rack, with Image version: 12.1.1.1.0.131219, and database version 12.1.0.1.3.

In order to get more understand on smartscans, we can use Oracle’s new debugging syntax. The part we are going to look at is called ‘KXD’:

SYS@db12c2 AS SYSDBA> oradebug doc component kxd

  KXD			       Exadata specific Kernel modules (kxd)
    KXDAM		       Exadata Disk Auto Manage (kxdam)
    KCFIS		       Exadata Predicate Push (kcfis)
    NSMTIO		       Trace Non Smart I/O (nsmtio)
    KXDBIO		       Exadata Block level Intelligent Operations (kxdbio)
    KXDRS		       Exadata Resilvering Layer (kxdrs)
    KXDOFL		       Exadata Offload (kxdofl)
    KXDMISC		       Exadata Misc (kxdmisc)

In order to get the tracing of the Exadata (database-) kernel modules, along with regular sql tracing with waits (to understand when something is happening, use the following events:

FRITS@db12c2 > alter session set events 'trace[kxd.*]';

Session altered.

FRITS@db12c2 > alter session set events 'sql_trace level 8';

Session altered.

Now execute a SQL that does smartscans. I’ve made sure table ‘T’ is big enough to invoke a direct path full table scan:

FRITS@db12c2 > select count(*) from t;

  COUNT(*)
----------
   1000000

Now take a peek in the trace file! The first part is normal looking:

PARSING IN CURSOR #139755290955216 len=22 dep=0 uid=201 oct=3 lid=201 tim=1593707470857 hv=2763161912 ad='224d632b8' sqlid='cyzznbykb509s'
select count(*) from t
END OF STMT
PARSE #139755290955216:c=2000,e=2417,p=0,cr=1,cu=0,mis=1,r=0,dep=0,og=1,plh=2966233522,tim=1593707470856
EXEC #139755290955216:c=0,e=25,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=1593707470920
WAIT #139755290955216: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14 tim=1593707470968
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 215 name|mode=1263468550 2=131242 0=2 obj#=14 tim=1593707471374
WAIT #139755290955216: nam='reliable message' ela= 1035 channel context=10126085744 channel handle=10164799536 broadcast message=10179010104 obj#=14 tim=1593707472530
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 108 name|mode=1263468550 2=131242 0=1 obj#=14 tim=1593707472684
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 101 name|mode=1263468545 2=131242 0=2 obj#=14 tim=1593707472829

We see the parsing of the simple select statement, and the execution, which yields some waits which are always there (the sqlnet message wait), and then some ‘enq: KO – fast object checkpoint’ waits, indicating a checkpoint, which is a sign of a direct path read.

The next part is interesting, because this is the smartscan-specific tracing:

Caching: Global context initialized 0x7f1b50ca6d20
kcfis_alloc_so 0x24d8994d8
In kcfis initialize: new init: app_state: 0x7f1b50ca67b0 app_type: 1
kcfis rcv update : op: 7 val: 1 so_numses 1 ovhdmem 0 mdmem 0 bufmem 0
kcfis_reinitialize: initializing queues
Set work des: global_ctx: 0x7f1b50ca6d20 app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
Automem enabled: app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
No match found for mmwds. Allocated wds 0x7f1b50ca6768, mmwds 0x22446a1a8
Cache version is 1 start cache version is 1
kcfis rcv update : op: 1 val: 1496 so_numses 1 ovhdmem 0 mdmem 1496 bufmem 0
kcfis rcv update : op: 1 val: 68 so_numses 1 ovhdmem 0 mdmem 1564 bufmem 0
oss_state->oss_context is 0x109d2db0
kcfis rcv update : op: 3 val: 69656 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 0
kcfis_initialize done

What we see here is the kcfis layer initialising memory. It’s interesting to see where the allocations are done. In general, on this system, the allocations in the 0x7f1bxxxxxxxx are in the PGA heap, kcfis_alloc_so/0x24d8994d8 in the SGA, and the mmwds/0x22446a1a8 is in the SGA too.
This information can be obtained by dumping heaps. Dumping the heaps at level 7 will show sga, session, pga, call and uga heaps. Another way to get insight into a memory locations is using Tanel Poder’s fcha script (Find Chunk Address). Please mind that if you are reading the kxd trace file and want to look up the addresses in a second window in a sqlplus / as sysdba session, this will allow you to see the SGA chunks, but probably not the PGA chunks, because these are private to the traced session.

Let’s get on to the next chunk of trace lines:

In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=(nil)
kcfis_translate: source: 1
kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 reqid=1 cver=1 source=1
OSSIPC:SKGXP:[109be210.0]{0}: (25797 <- 13487)SKGXPDOAINVALCON: connection 0x109c8c30 admno 0x5ac93f1d scoono 0x321bdad5 acconn 0x63cd0678 getting closed. inactive: 0
OSSIPC:SKGXP:[109be210.9]{obj}: SKGXPCNH: 0x109c8390 SKGXPCON_OPEN (2) sconno 321bdad4 accono 1206dea3 admno 75f4dfcd ospid 13487 ANT
OSSIPC:SKGXP:[109be210.10]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.11]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.12]{obj}:        SSKGXPT 0x109c83d0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 22774 lerr 0
OSSIPC:SKGXP:[109be210.13]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.14]{obj}:        SSKGXPT 0x109c84a0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 51931 lerr 0
OSSIPC:SKGXP:[109be210.15]{obj}:   next seqno 32768 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
OSSIPC:SKGXP:[109be210.16]{obj}: SKGXPCNH: 0x109c8c30 SKGXPCON_CLOSED (1) sconno 321bdad5 accono 63cd0678 admno 7b54ac1c ospid 13487 ANT
OSSIPC:SKGXP:[109be210.17]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.18]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.19]{obj}:        SSKGXPT 0x109c8c70 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 29832 lerr 0
OSSIPC:SKGXP:[109be210.20]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.21]{obj}:        SSKGXPT 0x109c8d40 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 65116 lerr 0
OSSIPC:SKGXP:[109be210.22]{obj}:   next seqno 32765 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188499, 13) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723752448, 106496) Mirr_num: 0 reqid=1 cver=1 source=1
kcfis_get_disk_for_translation: Appliance 192.168.12.8/ does not exist
kcfis rcv update : op: 5 val: 8192 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 8192
kcfis_open_appliance: 
throttle: initialized for appliance 0x7f1b5078b3a8
kcfis_find_appliance_fd: appliance fd not found. appliance 192.168.12.8/ Cached appliance fd count 0
kcfis_open_appliance_fd: 0x7f1b5078b3a8
WAIT #139755290955216: nam='cell smart table scan' ela= 120 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707489569
Predicate device intelligent IO opened. fd 5
kcfis_init_appliance_fd: appliance fd 0x7f1b50da2b88 initialized for appliance 0x7f1b5078b3a8
kcfis_add_disk: Adding disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 under appliance = 0x7f1b5078b3a8
initialize disk for disk o/192.168.12.8/DATA_CD_06_enkcel04
kcfis_initialize_disk_fd: Disk initialized. appliance: 192.168.12.8/ disk: o/192.168.12.8/DATA_CD_06_enkcel04 fd: 4 disknumber: 1 incarnation: 6 prev_disknumber: 0 num_inits: 1 init_cache_ver: 1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188499, numblks = 13, disk_handle = 4, diskoffs = 513723752448, len = 106496, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a000 reqid=1 cver=1 source=1
Default: calc numbufs mem 1048576
Final: calc numbufs mem 1048576 buflen 1048576
Num buffers: 1 buf per appliance: 1 num active appliance: 1
Appliance 0x7f1b5078b3a8 active. Active count 1

There’s a lot to see here. The next step in doing a smartscan is the translation of the data dictionary information on the segment to be smartscanned to cell server and grid disk extents. The data dictionary information is shown in line 3; ‘kcfis_get_new_request’, in line 4; ‘kcfis_get_translation:before’ the session tries to translate the data dictionary information to cell server and grid disk. Because the kcfis context is just initialised, there is no cell related information yet. For that reason, the information is requested from the cell server (the OSSIPC:SKGXP lines). Please mind this request will send all disk related information to the kcfis context of the process. With this information, the process can make the translation, as can be seen in line 20: ‘kcfis_get_translation:after’, it shows where the extent is located in the well known exadata notation: ‘o/cell ip/grid disk name’, together with offset and chunk length. Next the disk needs to be initialised (a disk is a combination of cell server or ‘appliance’ and grid disk), and given a hash value, as is indicated by ‘kcfis_get_disk_for_translation’ in line 21. However, in order for a disk to be initialised, the cell server or appliance must be initialised too. That is what the process is indicating in line 21: Appliance 192.168.12.8/ does not exist. The appliance (cell server) is initialised/opened, which is what is shown in lines 23-29. The wait here is initialising a connection with the cell server, waiting for an acknowledgement. Now the appliance is initialised, the disk is initialised, as indicated in lines 30-32. Next line 33 shows the translation which was started earlier new is finally been done. Line 34-37 show something about the buffering which seems to be arranged per appliance, and is 1MB.

Now that the process has initialised the appliance and the disk, the next translation is done. The translations are done per extent of the segment, and the disk is depended on the placing of the extent. Please mind the maximum size of the allocation is depended on the AU (allocation unit) size of ASM, which is set to 4MB by default with Exadata. If a non-initialised appliance is encountered, it is initialised and opened, and if a non-initialised disk is encountered, this is initialised.

The translation looks like this if everything is initialised:

kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 reqid=2 cver=1 source=1
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188513, 15) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723867136, 122880) Mirr_num: 0 reqid=2 cver=1 source=1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188513, numblks = 15, disk_handle = 4, diskoffs = 513723867136, len = 122880, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a220 reqid=2 cver=1 source=1

Another thing which is important to notice is the ‘reqid’, which obviously means ‘request id’. This process is repeated until the complete segment is translated into requests.

Once the requests are translated, the next step in the smartscan is to send (‘push’) the requests to the appliances. This apparently is called a ‘payload map’.

kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160
kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d0a000
Pushing request : disknumber = 1, offset = 513723752448, len = 106496 rdba: 55714451 version 0 reqid=1 cver=1
disk=0x7f1b507908e0 state=1

The first line is the start of the pushing of the payload maps. The next line shows a specific appliance being chosen. What is shown next is a line showing ‘disk’ and the hash value of the disk. At the start of a maps push, the disk lines are followed by two lines saying ‘trans_req_element’ and ‘Pushing request’. These two lines probably are some kind of state object for the request, and the actual pushing of the request. Here we see the request id back which we saw in the translation phase.

This is repeated, until some of the disk lines are starting to get followed immediately by another disk line:

disk=0x7f1b50cfdf80 state=1
trans_req_element = 0x7f1b50d12c40
Pushing request : disknumber = 7, offset = 513817985024, len = 1032192 rdba: 55716738 version 0 reqid=67 cver=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
trans_req_element = 0x7f1b50d2ffa0
Pushing request : disknumber = 9, offset = 513890353152, len = 4161536 rdba: 55729668 version 0 reqid=174 cver=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1

Further down in the trace file, the trans_req_element and Pushing request lines are becoming scarce:

disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d31920
Pushing request : disknumber = 1, offset = 498446925824, len = 4161536 rdba: 41208836 version 0 reqid=186 cver=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1

Inspection of these lines show that the process is going through a strict sequence of disks of that appliance, and picks up one request per disk which (obviously) belongs to that disk. If the requests are not evenly divided between the disks, some disks will have all the requests already pushed to that disk, while other disks still need additional requests. In that case, the disk which already have their requests pushed will not get a request, so no trans_req_element/Pushing request combination. The process goes through this until all the requests for that appliance are pushed.

After the push of all the requests for that appliance, the following is happening:

kcfis_create_maps_payload. alloc_len 4088 num maps 55
throttle: mappayload: maps_to_push 7 iosize_being_pushed 150953984
kcfis_metadata_payload_len: app_state 0x7f1b50ca67b0 appliance 0x7f1b5078b3a8 payload_len 5968 payload_hdr_len 96 sessiondata_payload_len 144
metadata_payload_len 1536 fmetadata_payload_len 100 maps_len 3992 exthdr_len 8 planpayload_len 48 oflgrppayload_plen 40
kcfis_create_metadata_payload. appliance 0x7f1b5078b3a8 payload 0x7f1b50da3190 payload_memlen 5968 maps 0x7f1b50d07fe0 mapslen 3992
kcfis_create_metadata_payload: pushing sessiondata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing capability payload: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: dop: 1kcfis_create_metadata_payload: pushing metadata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing fast metadata: appliance 0x7f1b5078b3a8
kcfis_push: pushing metadata to appliance 0x7f1b5078b3a8. metadata 0x7f1b50da3190
kcfis_issue_ioctl: payload_type 1
WAIT #139755290955216: nam='cell smart table scan' ela= 178 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707571673
Ioctl completed. Payload type 1
Ioctl quarantine response 1 for appliance 0x7f1b5078b3a8
appliance 0x7f1b5078b3a8 : cellsrv pid: 13487: predicate: /box/predicate735745

The first line shows 55 maps have been pushed to the appliance. The other lines are various memory locations which are needed for gathering the reads which will be send back by the appliances. Further things which seem important are line 11, which issues a ioctl (IO control) request to the appliance, and waits for acknowledgement. Mind the wait is always ‘cell smart table scan’. Line 15 shows this request gets a predicate, which is ‘/box/predicate735745’.

This is repeated for every appliance.

Then the next thing happens:

kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
throttle: allowing map push: appliance 0x7f1b5078b3a8, maps_to_push 7
disk=0x7f1b50c953f0 state=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
kcfis_create_maps_payload. alloc_len 200 num maps 0

The maps payload push is done again for all the appliances, without any request being pushed. The last line confirms no maps/requests having been pushed: num maps 0. However, there is one line which hasn’t been there before: line 2 ‘throttle’, more specifically important in this line is ‘allowing map push’, the previous throttle during ‘kcfis_create_maps_payload’ had the remark ‘mappayload’.

This means that at this point the physical extents to be scanned on the appliances (cell servers) have been identified, translated to appliance, grid disk, offset and size, and the requests for these extents have been send to the appliances. The last snippet actually means that the appliance is notified to start preparing for sending results back.

After the appliances have been notified, memory is initialised again.

Default: calc numbufs mem 4194304
Final: calc numbufs mem 4194304 buflen 1048576
Alloc buffer: target_freebufs 4 allocated_freebufs 0
Get additional mem: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768
Starting work area: app_state: 0x7f1b50ca67b0 wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Started work area: wds: 0x7f1b50ca6768 mmwds 0x22446a1a8
Get additional mem for pga_aggregate_target: max 4195552 min 1048888, wds 0x7f1b50ca6768 mmwds 0x22446a1a8
cur size 0
Change req: expected size 4196352 cur size 0 max_mem (KB) 4098
Max memory allocation ok: max 4195552, expected 4196352, cur 0
Memlen allowed 4195552 io_buflen 1048576 chunk_len 1048888
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 4212028
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 4244796
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 4212028
Calling oss_cread: appliance 0x7f1b5078b3a8 app_buffer: 0x7f1b50659000 databuf: 0x7f1b50559000 buflen: 1048576 (posted to 192.168.12.8/)
appliance 0x7f1b5078b3a8 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50c94e78 app_buffer: 0x7f1b50549000 databuf: 0x7f1b50449000 buflen: 1048576 (posted to 192.168.12.11/)
appliance 0x7f1b50c94e78 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d059a0 app_buffer: 0x7f1b50439000 databuf: 0x7f1b50339000 buflen: 1048576 (posted to 192.168.12.9/)
appliance 0x7f1b50d059a0 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d02948 app_buffer: 0x7f1b50a7a000 databuf: 0x7f1b5097a000 buflen: 1048576 (posted to 192.168.12.10/)
appliance 0x7f1b50d02948 total creads 1 new creads 1 read seqno 0 pending reads 0
kcfis wait: buf: 0x7f1b50549000 app_state: 0x7f1b50ca67b0 err: (0) Success
kcfis wait: buf: 0x7f1b50659000 app_state: 0x7f1b50ca67b0 err: (0) Success
WAIT #139755290955216: nam='cell smart table scan' ela= 59 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707578411
kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160

First memory areas are initialised (lines 1-24), then we see lines showing ‘Calling oss_cread’. The oss_cread call is the call to the appliances to start sending a resultset back. Please mind that despite the calls addressing specific extents, this is a smartscan, so resultsets are send back instead of Oracle blocks. Also, since this is exadata using the iDB/RDS protocol over infiniband, the appliances can use RDMA to send the results back, which means the cells can fill the memory in the server process’ memory directly.

After oss_cread being called, the ‘kcfis_create_maps_payload’ routine (shown in the snippet above the last snippet) being executed to every appliance apparently to indicate all the disks being enabled, possibly trying to throttle activity, and/or to indicate requests will be called from this session. This seems to be repeated for every roundtrip during the entire smartscan for all the appliances that are still needed.

Whenever a result (result set) is ready to be processed, the following sequence happens:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

This shows the reap of result set returned by an appliance. Line 5-6 show appliance specific information. Line 7 is showing important information; ‘nelem’ shows the amount of extents (called ‘elements’ in this context) for which the result or results are returned. Of course ‘nelem’ means ‘number of elements’. In this case it’s 1 (resultset from a specific element/extent). Line 15 shows the actual extent from which the result set came back, because the reqid is exposed, the reqid was defined during the translation phase. This snippet ends with a WAIT line (again: all the waits are ‘cell smart table scan’). I consider this a cyclic process: first the ‘kcfis_create_maps_payload’, then calling oss_cread for one or multiple cells, then a wait, or the above processing of results for one or multiple appliances, and a wait.

I’ve created this snippet to be as simple as possible, in real life result sets of multiple appliances could be reaped (in my case I had to remove a second result). The processing of the result set is done in a few stages, so a resultset is not processed per appliance, but the processing stages are done for all the result sets of all the appliances.

Also this example shows only one request in the reaped result. There can be multiple requests (reqid’s/extents) returned.

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50439000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50d059a0 read seqno 1 pending reads 1
appliance 0x7f1b50d059a0 total creads 0 re-adjusted: read seqno 1 pending reads 1
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50439000, databuf: 0x7f1b50339000, nelem: 2
Dump of memory from 0x00007F1B5033B000 to 0x00007F1B5033B040
7F1B5033B000 0000A23C 02748C02 5BB31713 0402099A  [<.....t....[....]
7F1B5033B010 00988672 001E0001 0000007E 0000007E  [r.......~...~...]
7F1B5033B020 0000007E 00000000 00000000 00000000  [~...............]
7F1B5033B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50339080
kcfis_validate_translation: request 0x7f1b50d0b980
Req completed is : err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1 block_id=41192450
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b5033b000 afn 9 rdba 41192450 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 41192450 good: 1
Request 0x7f1b50d0b980 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b503390c8
kcfis_validate_translation: request 0x7f1b50d114e0
Req completed is : err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1 block_id=53758466
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b50342e98 afn 12 rdba 53758466 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 53758466 good: 1
Request 0x7f1b50d114e0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1
memptr (nil) len 0, blockid = 41192450
kcfis_get_next_data: elem = 1, err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1
memptr 0x7f1b5033b000 len 32408, blockid = 53758466
kcfis_get_next_data: dptr: 0x7f1b5033b000 len: 64816 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50439000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50439000
WAIT #139755290955216: nam='cell smart table scan' ela= 10 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707616790

(example with ‘nelem’=2)

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b5078b3a8 read seqno 1 pending reads 0
appliance 0x7f1b5078b3a8 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 1
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03522293 5BB316ED 0402099A  [<...."R....[....]
7F1B5055B010 009849CB 001E0001 0000000D 0000000D  [.I..............]
7F1B5055B020 0000000D 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d0a000
Req completed is : err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1 block_id=55714451
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55714451 dlen: 3480 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 55714451 good: 1
Request 0x7f1b50d0a000 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1
memptr (nil) len 0, blockid = 55714451
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 3480 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

(example of result sets returned of two appliances, for which the returned results are processed)

Also, when the extent to be scanned is larger, the result set is processed in multiple steps:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 2 pending reads 1
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 2 pending reads 1
throttle: received: maps_to_push 0 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 3
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03523080 5BB3178D 0402099A  [<....0R....[....]
7F1B5055B010 0098FBC9 001E0001 00000080 00000080  [................]
7F1B5055B020 00000080 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d145c0
Req completed is : err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1 block_id=55718016
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_validate_translation: REQ3: splitting from bottom
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55718016 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 55718016 good: 1
Request 0x7f1b50d145c0 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b505590c8
kcfis_validate_translation: request 0x7f1b50d13d40
Req completed is : err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1 block_id=41195392
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_oss_block_verify: bp: 0x7f1b50563098 afn 9 rdba 41195392 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 41195392 good: 1
Request 0x7f1b50d13d40 done
kcfis_process_completed_buffer: 2 elem: 0x7f1b50559110
kcfis_validate_translation: request 0x7f1b50d13f60
Req completed is : err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1 block_id=59912064
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5056b130 afn 14 rdba 59912064 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 59912064 good: 1
Request 0x7f1b50d13f60 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1
memptr (nil) len 0, blockid = 55718016
kcfis_get_next_data: elem = 1, err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1
memptr 0x7f1b5055b000 len 32920, blockid = 41195392
kcfis_get_next_data: elem = 2, err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1
memptr 0x7f1b5055b000 len 65840, blockid = 59912064
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 98760 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707638819

(example with nelem=3, with bigger extents, for which the result is splitted)

Once the result is reaped for a certain appliance, a new oss_cread call must be done in order for another result to be pushed to the server process. Request (reqid or extents) are not sent back in order.

Once the requests are exhausted for a certain appliance, the appliance is excluded from the ‘kcfis_create_maps_payload’ procedure.

Once all smartscan is complete, the memory areas are cleaned up, and the sessions are closed. Apparently a session or some session state information is kept per grid disk, which are all closed (closing of one appliance is shown in this snippet):

kcfis_read DONE - ret NULL
kcfis_finalize: app_state 0x7f1b50ca67b0 permflags 0x9000 clnflags 0x1f
Appliance 0x7f1b5078b3a8 in-active. Active count 3
Appliance 0x7f1b50c94e78 in-active. Active count 2
Appliance 0x7f1b50d059a0 in-active. Active count 1
Appliance 0x7f1b50d02948 in-active. Active count 0
Caching: session at the end of scan 1
Work area gc: 0x7f1b50ca6768, app_state: 0x7f1b50ca67b0
Freeing read buffer chunk (GC) 0x7f1b50448eb0. count 4
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
Freeing read buffer chunk (GC) 0x7f1b50338eb0. count 3
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
Freeing read buffer chunk (GC) 0x7f1b50558eb0. count 2
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
Freeing read buffer chunk (GC) 0x7f1b50979eb0. count 1
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 0
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 32768
Freeing memory to top level heap: num_freed 4
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 0
Work area cleanup start: global_ctx: 0x7f1b50ca6d20
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 app_state: 0x7f1b50ca67b0 refcnt: 0
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 refcnt: 0
Closed work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Deallocating kcfis wds: 0x7f1b50ca6768
Cleanup work area for app_state: 0x7f1b50ca67b0
kcfis_finalize_cached_sessions: global_ctx 0x7f1b50ca6d20
Caching: in kcfis_finalize_cached_sessions global ctx 0x7f1b50ca6d20 total cached 1 cached in ctx 1
Deallocating session. app state 0x7f1b50ca67b0 num cached sessions 0
In kcfis_deallocate_session: app_state: 0x7f1b50ca67b0
KCFIS: [NSMTIO]:SQL for this (non)Smart I/O session is: 
select count(*) from t
kcfis_cache_appliance_fd: Cached appliance fd 0x7f1b50da2b88  for appliance 0x7f1b5078b3a8
Appliance 0x7f1b5078b3a8 was not active. Active count 0
kcfis_deallocate_session: Freeing disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507908e0 name = o/192.168.12.8/DATA_CD_01_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507903f0 name = o/192.168.12.8/DATA_CD_05_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50d04fc0 name = o/192.168.12.8/DATA_CD_10_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfe960 name = o/192.168.12.8/DATA_CD_03_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfdf80 name = o/192.168.12.8/DATA_CD_09_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfc930 name = o/192.168.12.8/DATA_CD_00_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc8118 name = o/192.168.12.8/DATA_CD_11_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc7ae0 name = o/192.168.12.8/DATA_CD_08_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc5ab0 name = o/192.168.12.8/DATA_CD_02_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc3a80 name = o/192.168.12.8/DATA_CD_04_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc30a0 name = o/192.168.12.8/DATA_CD_07_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis rcv update : op: 6 val: 8192 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 24576

And the appliances are notified the smartscan has ended:

kcfis_close_all_appliance_fds. Cache count 4
kcfis_close_appliance_fd: appliance_fd 0x7f1b50da2b88
WAIT #139755290955216: nam='cell smart table scan' ela= 241 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707978094
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5db00
WAIT #139755290955216: nam='cell smart table scan' ela= 237 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707978356
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5df58
WAIT #139755290955216: nam='cell smart table scan' ela= 280 cellhash#=674246789 p2=0 p3=0 obj#=61458 tim=1593707978668
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5e3b0
WAIT #139755290955216: nam='cell smart table scan' ela= 242 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707978935

Summary.
This blogpost tries to summarise (..) the different steps in an Exadata smartscan. One of the most important things which this shows is that anything which is done, is covered by a single wait event (‘cell smart table scan’). In other words: profiling this wait event tells you little to nothing about what is actually happening, except that a smartscan is being processed. In other words: if you get high ‘cell smart table scan’ waits, either for all the waits or for a few, the only way to pinpoint what it is the wait is showing waiting time is executing this trace again, and see which step it is. Of course you can pinpoint if the waiting is happening for a specific cell/appliance by looking at cellhash# in the wait line.

Disclaimer:
Please mind I tried to use my knowledge on Oracle and Exadata processing together with the information the trace provided to build this description. If you encounter anything which is incorrect, please comment on this post, and I try to get it fixed. No bits where harmed during testing.

%d bloggers like this: