Archive

Exadata

Last week I’ve gotten a question on how storage indexes (SI) behave when the table for which the SI is holding data is changed. Based on logical reasoning, it can be two things: the SI is invalidated because the data it’s holding is changed, or the SI is updated to reflect the change. Think about this for yourself, and pick a choice. I would love to hear if you did choose the correct one.

First let’s do a step back and lay some groundwork first. The tests done in this blogpost are done on an actual Exadata (V2 hardware), with Oracle version 11.2.0.4.6 (meaning bundle patch 6). The Exadata “cellos” (Cell O/S) version is 11.2.3.3.1.140529.1 on both the compute nodes and the storage nodes.

A storage index is a memory structure used by the cell daemon, which is the storage server process on the storage layer of Exadata. By default (although I’ve never seen it different yet) a SI contains minimum and maximum values for up to eight columns of the table it is describing. The memory structure is transient. It describes a region of one megabyte of a table. A storage index is populated during the fetching of data for a smart scan, based on the filter predicates of the query causing the smart scan. In essence the whole SI management is done automatically by the cell daemon (“essence” means you can play around with some undocumented settings on the database level, and there are some undocumented settings and events you can set on the cell server level).

Okay, back to the the original question. Let’s test this. First we need a table large enough to be smartscanned. I’ve got a table called ‘BIGTAB_NOHCC’ (as you probably guessed, there is a table ‘BIGTAB’ in the schema I am using, which is hybrid columnar compressed). This table consists of 2671736 blocks of 8KB, which means it got a total size of 21G. This is big enough for my database instance (big enough is relative to the buffer cache size) to get a smart scan.

For this test, I use the session statistics. In v$sesstat there is a statistic called ‘cell physical IO bytes saved by storage index’, which tells us how many bytes we saved from being scanned because of the use of storage indexes. I also show some output of a (single) storage server, although a normal Exadata rack typically will have 3, 7 or 14 storage servers in use.

First of all, in order to get an idea and be pretty sure my query will populate storage indexes, I stop and start the cell daemon on the storage servers. I use ‘service celld restart’ for that.

Next, in order to get storage index information for the table I use, I need some metadata. The metadata I need are:
a) data object id

select data_object_id from dba_objects where owner = 'MARTIN' and object_name = 'BIGTAB_NOHCC';

b) tablespace number

select ts# from v$tablespace t, dba_segments s, dba_objects o where t.name=s.tablespace_name and s.segment_name=o.object_name and o.data_object_id = 18716;

c) database unique id from the x$ksqdn

select ksqdngunid from x$ksqdn;

(thanks to the progeeking website for this information)

Next, I run the query on my BIGTAB_NOHCC table. Please mind it’s fundamental to have a filter predicate, and that the execution plan shows we are doing a full table scan. Only with a filter predicate the storage server has a reason to build a storage index:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

Now let’s see if the storage layer built a storage index:

CellCLI> alter cell events = "immediate cellsrv.cellsrv_storidx('dumpridx','all',18716,5,1126366144);
Dump sequence #1 has been written to /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Cell enkcel01 successfully altered

This is why it’s handy to have the aforementioned data on the table: you can dump the specific storage indexes for a single table. The three numbers after ‘all’ are data_object_id, tablespace number and the ksqdngunid from the x$ksqdn table.

Let’s look in the file which the cell event produced:

# less /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Trace file /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
ORACLE_HOME = /opt/oracle/cell
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.39-400.128.17.el5uek
Version:        #1 SMP Tue May 27 13:20:24 PDT 2014
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.3.1_LINUX.X64_140529.1

*** 2014-07-12 13:16:17.030

*** 2014-07-12 13:16:17.030
UserThread: LWPID: 27455 userId: 58 kernelId: 58 pthreadID: 0x7f2863fe7940
2014-07-12 13:21:36.499123 :0005C9DE: $$$ Dumping storage idx summary for griddisk DATA_CD_08_enkcel01:
2014-07-12 13:21:36.499217 :0005C9E0: Dump sequence #1:

*** 2014-07-12 13:21:36.499
2014-07-12 13:21:36.499212 :0005C9DF: 
***************************
2014-07-12 13:21:36.499249 :0005C9E1: Dumping RIDX summary for objd 18716, tsn 5, dbid 1126366144

2014-07-12 13:21:36.499249*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-12 13:21:36.499249*: RIDX: strt 0 end 2048 offset 6215958528 size 1048576 rgnIdx 5928 RgnOffset 0 scn: 0x099a.5cc294b8 hist: 2
2014-07-12 13:21:36.499249*: RIDX validation history: 
2014-07-12 13:21:36.499249*: 0:FullRead 1:Undef 2:Undef 3:Undef 4:Undef 5:Undef 6:Undef 7:Undef 8:Undef 9:Undef
2014-07-12 13:21:36.499249*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-12 13:21:36.499249*: lo: c3 c 34 11 0 0 0 0
2014-07-12 13:21:36.499249*: hi: c3 5a 5 18 0 0 0 0

Here you see the typical heading of an Oracle trace file, next the announcement of the dump (“Dumping storage idx summary”,”Dumping RIDX summary”). The real storage index information starts with “RIDX(0x7f27b4ddab64)”. Starting with that and until the “lo” and “hi” values, you are looking at an actual storage index which holds data for a single column. You can see which column by looking at the “Col id” in the square brackets: 1. It’s interesting to note that there is a scn (system change number) included. The storage index shows if there are NULL values in the column (in this case it says HASNONNULLVALUES, so we don’t have any NULL values in the 1MB chunk in the column this storage index describes), and, of course, the low and high values in the Oracle internal data format.

So, despite any indication on the database layer, the query built storage indexes! That should mean that executing the same query again will result in actually using the storage indexes which were just build:

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

Yes! This using the storage index we caused the be built up previously!

With this being established, we can try to get an answer to our question: how does the storage index react to DML on the table it is describing?

We build the storage index, and used it. Now let’s update the ‘id’ field for which the storage index was build, and redo our query test:

MARTIN:dbm011> update bigtab_nohcc set id = id + 1;

16000000 rows updated.

MARTIN:dbm011> commit;

Commit complete.

Okay, now let’s redo the select again, and take stats before and after on storage index usage!

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

What we see here with the storage index session statistic, is the update statement didn’t use the storage index (which is obvious, we updated the data, which happened on the database layer, so we didn’t use a smartscan) because this number is the same as the last time we looked at it before the update statement.
When we executed the select query on the table with a filter on the id column again, there is no storage index usage, because the storage index session statistic didn’t increase.

Actually, this query built new storage indexes. When this query is executed again, we can use these:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				   14534426624

(actually, it sometimes can take more than one execution for the storage indexes to be created again, my guess would be some heuristics are used to try to come up with the best candidates for storage indexes)

Let’s try another case: update a column in the table for which storage indexes are created on another column.
First make sure storage indexes are build:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

For completeness sake, I verified the storage indexes for this table by using the dump shown previously. A simple grep on the dump shows this query created storage indexes for only column 1, the id column:

grep -e RIDX_VALID -e 'Col id' /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_4.trc
014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddac40) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddad1c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddadf8) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddaed4) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddb08c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
...etc.

Now let’s update another field:

MARTIN@dbm011 > update bigtab_nohcc set spcol = spcol + 1;

16000000 rows updated.

MARTIN@dbm011 > commit;

Commit complete.

And query the storage index use, and do our query with filter predicate again:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

So…this strongly suggests the update invalidated our storage index, even when the column was not in the storage index.
Just for completeness sake, let’s issue the select statement again to see if the storage index was build up again:

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    9808936960

Yes, there it is again, the previous statement didn’t benefit from the storage indexes, but build them, and now we take advantage of it again.

Conclusion.
I ran these tests on a quarter rack without any other usage. The method I used for measuring how storage indexes behave was to execute on the database layer, and see the result on the storage layer and in database layer session statistics. It’s important to realise that despite this being quite strong evidence, there isn’t definite proof on my conclusion. Also, the behaviour described can change in future storage server versions.

My conclusion is that the storage server uses a system change number to validate if the storage indexes are valid. If not, storage indexes are not used. If blocks contained in a storage index progresses it’s system change number (in other words: are updated), the storage index is invalidated right away (during my tests, when the update starts, no or few storage indexes remain).

So, the ones that picked storage indexes being invalidated were right. As far as I could detect, there is no updating of storage indexes.

At the Accenture Enkitec Group we have a couple of Exadata racks for Proof of Concepts (PoC), Performance validation, research and experimenting. This means the databases on the racks appear and vanish more than (should be) on an average customer Exadata rack (to be honest most people use a fixed few existing databases rather than creating and removing a database for every test).

Nevertheless we gotten in a situation where the /etc/oratab file was not in sync with the databases registered in the cluster registry. This situation can happen for a number reasons. For example, if you clone a database (RMAN duplicate), you end up with a cloned database (I sincerely hope), but this database needs to be manually registered in the cluster registry. This is the same with creating a standby database (for which one of the most used methods is to use the clone procedure with a couple of changes).

However, above reasons are quite obvious. But there is a another reason which is way less obvious: bug 17172091 (Oracle restart may update oratab incorrectly after a node reboot) can also cause your oratab to get out of sync with the databases/instances in the cluster registry. Additional information: Oracle versions confirmed being affected are 11.2.0.3 and 11.2.0.2. This is bug is reported to be fixed with Grid Infra PSU 11.2.0.3.9 and 11.2.0.4.2. (yes I am aware of the inconsistency between versions affected and fixed versions, this is from the Oracle bug information available; thanks to Tanel Põder for finding this bug).

In order to recreate the oratab, you need to go through the cluster registry information, and compare it with your oratab. Especially if you’ve got a lot of databases, and/or single instance databases in different nodes, this can be quite some work. To relieve work for that situation, I created a little shell script to parse the cluster registry database information and get the db_unique_name (which is what the first field actually is in the oratab file, thanks to Randy Johnson) and oracle home path information and output this in “oratab format” (db_unique_name:oracle home path:N). Needless to say, this script just outputs it to STDOUT. If you want to use this information, redirect it to a file, or copy and past it in oratab yourself.

for resource in $(crsctl status resource -w "((TYPE = ora.database.type) AND (LAST_SERVER = $(hostname -s)))" | grep ^NAME | sed 's/.*=//'); do 
	full_resource=$(crsctl status resource -w "((NAME = $resource) AND (LAST_SERVER = $(hostname -s)))" -f)
	db_name=$(echo "$full_resource" | grep ^DB_UNIQUE_NAME | awk -F= '{ print $2 }')
	ora_home=$(echo "$full_resource" | grep ^ORACLE_HOME= | awk -F= '{ print $2 }')
	printf "%s:%s:N\n" $db_name $ora_home 
done

I guess everybody who is working with Oracle databases and has been involved with Oracle Exadata in any way knows about smartscans. It is the smartscan who makes the magic happen of full segment scans with sometimes enormously reduced scan times. The Oracle database does smartscans which something that is referred to as ‘offloading’. This is all general known information.

But how does that work? I assume more people are like me, and are anxious to understand how that exactly works. But the information on smartscans is extremely scarce. Of course there is the Oracle public material, which looks technical, but is little/nothing more than marketing. On My Oracle Support, I can’t find anything on the inner working. Even in the ‘Expert Oracle Exadata’ book (which I still regard as the best source of Exadata related information) there is no material on the mechanics of smartscans.

I’ve written a couple of articles on smartscans, of which this article already lays some groundwork, it describes the phases I could see with the available information at that time: oss_open, oss_ioctl followed by oss_wait and oss_cread followed by oss_wait. This is actually a summary of a smartscan, but a very brief one. In this article I described that a smartscan can only happen with a full segment scan (kdstf* functions, Oracle’s fast full scan routines) and if it chooses to use direct path (kcbld* functions, direct path loader) access, which is actually mandatory to get smart scans.

The following investigation is done on an Exadata X2-2 Quarter rack, with Image version: 12.1.1.1.0.131219, and database version 12.1.0.1.3.

In order to get more understand on smartscans, we can use Oracle’s new debugging syntax. The part we are going to look at is called ‘KXD':

SYS@db12c2 AS SYSDBA> oradebug doc component kxd

  KXD			       Exadata specific Kernel modules (kxd)
    KXDAM		       Exadata Disk Auto Manage (kxdam)
    KCFIS		       Exadata Predicate Push (kcfis)
    NSMTIO		       Trace Non Smart I/O (nsmtio)
    KXDBIO		       Exadata Block level Intelligent Operations (kxdbio)
    KXDRS		       Exadata Resilvering Layer (kxdrs)
    KXDOFL		       Exadata Offload (kxdofl)
    KXDMISC		       Exadata Misc (kxdmisc)

In order to get the tracing of the Exadata (database-) kernel modules, along with regular sql tracing with waits (to understand when something is happening, use the following events:

FRITS@db12c2 > alter session set events 'trace[kxd.*]';

Session altered.

FRITS@db12c2 > alter session set events 'sql_trace level 8';

Session altered.

Now execute a SQL that does smartscans. I’ve made sure table ‘T’ is big enough to invoke a direct path full table scan:

FRITS@db12c2 > select count(*) from t;

  COUNT(*)
----------
   1000000

Now take a peek in the trace file! The first part is normal looking:

PARSING IN CURSOR #139755290955216 len=22 dep=0 uid=201 oct=3 lid=201 tim=1593707470857 hv=2763161912 ad='224d632b8' sqlid='cyzznbykb509s'
select count(*) from t
END OF STMT
PARSE #139755290955216:c=2000,e=2417,p=0,cr=1,cu=0,mis=1,r=0,dep=0,og=1,plh=2966233522,tim=1593707470856
EXEC #139755290955216:c=0,e=25,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=1593707470920
WAIT #139755290955216: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14 tim=1593707470968
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 215 name|mode=1263468550 2=131242 0=2 obj#=14 tim=1593707471374
WAIT #139755290955216: nam='reliable message' ela= 1035 channel context=10126085744 channel handle=10164799536 broadcast message=10179010104 obj#=14 tim=1593707472530
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 108 name|mode=1263468550 2=131242 0=1 obj#=14 tim=1593707472684
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 101 name|mode=1263468545 2=131242 0=2 obj#=14 tim=1593707472829

We see the parsing of the simple select statement, and the execution, which yields some waits which are always there (the sqlnet message wait), and then some ‘enq: KO – fast object checkpoint’ waits, indicating a checkpoint, which is a sign of a direct path read.

The next part is interesting, because this is the smartscan-specific tracing:

Caching: Global context initialized 0x7f1b50ca6d20
kcfis_alloc_so 0x24d8994d8
In kcfis initialize: new init: app_state: 0x7f1b50ca67b0 app_type: 1
kcfis rcv update : op: 7 val: 1 so_numses 1 ovhdmem 0 mdmem 0 bufmem 0
kcfis_reinitialize: initializing queues
Set work des: global_ctx: 0x7f1b50ca6d20 app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
Automem enabled: app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
No match found for mmwds. Allocated wds 0x7f1b50ca6768, mmwds 0x22446a1a8
Cache version is 1 start cache version is 1
kcfis rcv update : op: 1 val: 1496 so_numses 1 ovhdmem 0 mdmem 1496 bufmem 0
kcfis rcv update : op: 1 val: 68 so_numses 1 ovhdmem 0 mdmem 1564 bufmem 0
oss_state->oss_context is 0x109d2db0
kcfis rcv update : op: 3 val: 69656 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 0
kcfis_initialize done

What we see here is the kcfis layer initialising memory. It’s interesting to see where the allocations are done. In general, on this system, the allocations in the 0x7f1bxxxxxxxx are in the PGA heap, kcfis_alloc_so/0x24d8994d8 in the SGA, and the mmwds/0x22446a1a8 is in the SGA too.
This information can be obtained by dumping heaps. Dumping the heaps at level 7 will show sga, session, pga, call and uga heaps. Another way to get insight into a memory locations is using Tanel Poder’s fcha script (Find Chunk Address). Please mind that if you are reading the kxd trace file and want to look up the addresses in a second window in a sqlplus / as sysdba session, this will allow you to see the SGA chunks, but probably not the PGA chunks, because these are private to the traced session.

Let’s get on to the next chunk of trace lines:

In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=(nil)
kcfis_translate: source: 1
kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 reqid=1 cver=1 source=1
OSSIPC:SKGXP:[109be210.0]{0}: (25797 <- 13487)SKGXPDOAINVALCON: connection 0x109c8c30 admno 0x5ac93f1d scoono 0x321bdad5 acconn 0x63cd0678 getting closed. inactive: 0
OSSIPC:SKGXP:[109be210.9]{obj}: SKGXPCNH: 0x109c8390 SKGXPCON_OPEN (2) sconno 321bdad4 accono 1206dea3 admno 75f4dfcd ospid 13487 ANT
OSSIPC:SKGXP:[109be210.10]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.11]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.12]{obj}:        SSKGXPT 0x109c83d0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 22774 lerr 0
OSSIPC:SKGXP:[109be210.13]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.14]{obj}:        SSKGXPT 0x109c84a0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 51931 lerr 0
OSSIPC:SKGXP:[109be210.15]{obj}:   next seqno 32768 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
OSSIPC:SKGXP:[109be210.16]{obj}: SKGXPCNH: 0x109c8c30 SKGXPCON_CLOSED (1) sconno 321bdad5 accono 63cd0678 admno 7b54ac1c ospid 13487 ANT
OSSIPC:SKGXP:[109be210.17]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.18]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.19]{obj}:        SSKGXPT 0x109c8c70 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 29832 lerr 0
OSSIPC:SKGXP:[109be210.20]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.21]{obj}:        SSKGXPT 0x109c8d40 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 65116 lerr 0
OSSIPC:SKGXP:[109be210.22]{obj}:   next seqno 32765 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188499, 13) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723752448, 106496) Mirr_num: 0 reqid=1 cver=1 source=1
kcfis_get_disk_for_translation: Appliance 192.168.12.8/ does not exist
kcfis rcv update : op: 5 val: 8192 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 8192
kcfis_open_appliance: 
throttle: initialized for appliance 0x7f1b5078b3a8
kcfis_find_appliance_fd: appliance fd not found. appliance 192.168.12.8/ Cached appliance fd count 0
kcfis_open_appliance_fd: 0x7f1b5078b3a8
WAIT #139755290955216: nam='cell smart table scan' ela= 120 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707489569
Predicate device intelligent IO opened. fd 5
kcfis_init_appliance_fd: appliance fd 0x7f1b50da2b88 initialized for appliance 0x7f1b5078b3a8
kcfis_add_disk: Adding disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 under appliance = 0x7f1b5078b3a8
initialize disk for disk o/192.168.12.8/DATA_CD_06_enkcel04
kcfis_initialize_disk_fd: Disk initialized. appliance: 192.168.12.8/ disk: o/192.168.12.8/DATA_CD_06_enkcel04 fd: 4 disknumber: 1 incarnation: 6 prev_disknumber: 0 num_inits: 1 init_cache_ver: 1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188499, numblks = 13, disk_handle = 4, diskoffs = 513723752448, len = 106496, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a000 reqid=1 cver=1 source=1
Default: calc numbufs mem 1048576
Final: calc numbufs mem 1048576 buflen 1048576
Num buffers: 1 buf per appliance: 1 num active appliance: 1
Appliance 0x7f1b5078b3a8 active. Active count 1

There’s a lot to see here. The next step in doing a smartscan is the translation of the data dictionary information on the segment to be smartscanned to cell server and grid disk extents. The data dictionary information is shown in line 3; ‘kcfis_get_new_request’, in line 4; ‘kcfis_get_translation:before’ the session tries to translate the data dictionary information to cell server and grid disk. Because the kcfis context is just initialised, there is no cell related information yet. For that reason, the information is requested from the cell server (the OSSIPC:SKGXP lines). Please mind this request will send all disk related information to the kcfis context of the process. With this information, the process can make the translation, as can be seen in line 20: ‘kcfis_get_translation:after’, it shows where the extent is located in the well known exadata notation: ‘o/cell ip/grid disk name’, together with offset and chunk length. Next the disk needs to be initialised (a disk is a combination of cell server or ‘appliance’ and grid disk), and given a hash value, as is indicated by ‘kcfis_get_disk_for_translation’ in line 21. However, in order for a disk to be initialised, the cell server or appliance must be initialised too. That is what the process is indicating in line 21: Appliance 192.168.12.8/ does not exist. The appliance (cell server) is initialised/opened, which is what is shown in lines 23-29. The wait here is initialising a connection with the cell server, waiting for an acknowledgement. Now the appliance is initialised, the disk is initialised, as indicated in lines 30-32. Next line 33 shows the translation which was started earlier new is finally been done. Line 34-37 show something about the buffering which seems to be arranged per appliance, and is 1MB.

Now that the process has initialised the appliance and the disk, the next translation is done. The translations are done per extent of the segment, and the disk is depended on the placing of the extent. Please mind the maximum size of the allocation is depended on the AU (allocation unit) size of ASM, which is set to 4MB by default with Exadata. If a non-initialised appliance is encountered, it is initialised and opened, and if a non-initialised disk is encountered, this is initialised.

The translation looks like this if everything is initialised:

kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 reqid=2 cver=1 source=1
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188513, 15) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723867136, 122880) Mirr_num: 0 reqid=2 cver=1 source=1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188513, numblks = 15, disk_handle = 4, diskoffs = 513723867136, len = 122880, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a220 reqid=2 cver=1 source=1

Another thing which is important to notice is the ‘reqid’, which obviously means ‘request id’. This process is repeated until the complete segment is translated into requests.

Once the requests are translated, the next step in the smartscan is to send (‘push’) the requests to the appliances. This apparently is called a ‘payload map’.

kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160
kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d0a000
Pushing request : disknumber = 1, offset = 513723752448, len = 106496 rdba: 55714451 version 0 reqid=1 cver=1
disk=0x7f1b507908e0 state=1

The first line is the start of the pushing of the payload maps. The next line shows a specific appliance being chosen. What is shown next is a line showing ‘disk’ and the hash value of the disk. At the start of a maps push, the disk lines are followed by two lines saying ‘trans_req_element’ and ‘Pushing request’. These two lines probably are some kind of state object for the request, and the actual pushing of the request. Here we see the request id back which we saw in the translation phase.

This is repeated, until some of the disk lines are starting to get followed immediately by another disk line:

disk=0x7f1b50cfdf80 state=1
trans_req_element = 0x7f1b50d12c40
Pushing request : disknumber = 7, offset = 513817985024, len = 1032192 rdba: 55716738 version 0 reqid=67 cver=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
trans_req_element = 0x7f1b50d2ffa0
Pushing request : disknumber = 9, offset = 513890353152, len = 4161536 rdba: 55729668 version 0 reqid=174 cver=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1

Further down in the trace file, the trans_req_element and Pushing request lines are becoming scarce:

disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d31920
Pushing request : disknumber = 1, offset = 498446925824, len = 4161536 rdba: 41208836 version 0 reqid=186 cver=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1

Inspection of these lines show that the process is going through a strict sequence of disks of that appliance, and picks up one request per disk which (obviously) belongs to that disk. If the requests are not evenly divided between the disks, some disks will have all the requests already pushed to that disk, while other disks still need additional requests. In that case, the disk which already have their requests pushed will not get a request, so no trans_req_element/Pushing request combination. The process goes through this until all the requests for that appliance are pushed.

After the push of all the requests for that appliance, the following is happening:

kcfis_create_maps_payload. alloc_len 4088 num maps 55
throttle: mappayload: maps_to_push 7 iosize_being_pushed 150953984
kcfis_metadata_payload_len: app_state 0x7f1b50ca67b0 appliance 0x7f1b5078b3a8 payload_len 5968 payload_hdr_len 96 sessiondata_payload_len 144
metadata_payload_len 1536 fmetadata_payload_len 100 maps_len 3992 exthdr_len 8 planpayload_len 48 oflgrppayload_plen 40
kcfis_create_metadata_payload. appliance 0x7f1b5078b3a8 payload 0x7f1b50da3190 payload_memlen 5968 maps 0x7f1b50d07fe0 mapslen 3992
kcfis_create_metadata_payload: pushing sessiondata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing capability payload: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: dop: 1kcfis_create_metadata_payload: pushing metadata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing fast metadata: appliance 0x7f1b5078b3a8
kcfis_push: pushing metadata to appliance 0x7f1b5078b3a8. metadata 0x7f1b50da3190
kcfis_issue_ioctl: payload_type 1
WAIT #139755290955216: nam='cell smart table scan' ela= 178 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707571673
Ioctl completed. Payload type 1
Ioctl quarantine response 1 for appliance 0x7f1b5078b3a8
appliance 0x7f1b5078b3a8 : cellsrv pid: 13487: predicate: /box/predicate735745

The first line shows 55 maps have been pushed to the appliance. The other lines are various memory locations which are needed for gathering the reads which will be send back by the appliances. Further things which seem important are line 11, which issues a ioctl (IO control) request to the appliance, and waits for acknowledgement. Mind the wait is always ‘cell smart table scan’. Line 15 shows this request gets a predicate, which is ‘/box/predicate735745′.

This is repeated for every appliance.

Then the next thing happens:

kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
throttle: allowing map push: appliance 0x7f1b5078b3a8, maps_to_push 7
disk=0x7f1b50c953f0 state=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
kcfis_create_maps_payload. alloc_len 200 num maps 0

The maps payload push is done again for all the appliances, without any request being pushed. The last line confirms no maps/requests having been pushed: num maps 0. However, there is one line which hasn’t been there before: line 2 ‘throttle’, more specifically important in this line is ‘allowing map push’, the previous throttle during ‘kcfis_create_maps_payload’ had the remark ‘mappayload’.

This means that at this point the physical extents to be scanned on the appliances (cell servers) have been identified, translated to appliance, grid disk, offset and size, and the requests for these extents have been send to the appliances. The last snippet actually means that the appliance is notified to start preparing for sending results back.

After the appliances have been notified, memory is initialised again.

Default: calc numbufs mem 4194304
Final: calc numbufs mem 4194304 buflen 1048576
Alloc buffer: target_freebufs 4 allocated_freebufs 0
Get additional mem: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768
Starting work area: app_state: 0x7f1b50ca67b0 wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Started work area: wds: 0x7f1b50ca6768 mmwds 0x22446a1a8
Get additional mem for pga_aggregate_target: max 4195552 min 1048888, wds 0x7f1b50ca6768 mmwds 0x22446a1a8
cur size 0
Change req: expected size 4196352 cur size 0 max_mem (KB) 4098
Max memory allocation ok: max 4195552, expected 4196352, cur 0
Memlen allowed 4195552 io_buflen 1048576 chunk_len 1048888
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 4212028
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 4244796
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 4212028
Calling oss_cread: appliance 0x7f1b5078b3a8 app_buffer: 0x7f1b50659000 databuf: 0x7f1b50559000 buflen: 1048576 (posted to 192.168.12.8/)
appliance 0x7f1b5078b3a8 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50c94e78 app_buffer: 0x7f1b50549000 databuf: 0x7f1b50449000 buflen: 1048576 (posted to 192.168.12.11/)
appliance 0x7f1b50c94e78 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d059a0 app_buffer: 0x7f1b50439000 databuf: 0x7f1b50339000 buflen: 1048576 (posted to 192.168.12.9/)
appliance 0x7f1b50d059a0 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d02948 app_buffer: 0x7f1b50a7a000 databuf: 0x7f1b5097a000 buflen: 1048576 (posted to 192.168.12.10/)
appliance 0x7f1b50d02948 total creads 1 new creads 1 read seqno 0 pending reads 0
kcfis wait: buf: 0x7f1b50549000 app_state: 0x7f1b50ca67b0 err: (0) Success
kcfis wait: buf: 0x7f1b50659000 app_state: 0x7f1b50ca67b0 err: (0) Success
WAIT #139755290955216: nam='cell smart table scan' ela= 59 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707578411
kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160

First memory areas are initialised (lines 1-24), then we see lines showing ‘Calling oss_cread’. The oss_cread call is the call to the appliances to start sending a resultset back. Please mind that despite the calls addressing specific extents, this is a smartscan, so resultsets are send back instead of Oracle blocks. Also, since this is exadata using the iDB/RDS protocol over infiniband, the appliances can use RDMA to send the results back, which means the cells can fill the memory in the server process’ memory directly.

After oss_cread being called, the ‘kcfis_create_maps_payload’ routine (shown in the snippet above the last snippet) being executed to every appliance apparently to indicate all the disks being enabled, possibly trying to throttle activity, and/or to indicate requests will be called from this session. This seems to be repeated for every roundtrip during the entire smartscan for all the appliances that are still needed.

Whenever a result (result set) is ready to be processed, the following sequence happens:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

This shows the reap of result set returned by an appliance. Line 5-6 show appliance specific information. Line 7 is showing important information; ‘nelem’ shows the amount of extents (called ‘elements’ in this context) for which the result or results are returned. Of course ‘nelem’ means ‘number of elements’. In this case it’s 1 (resultset from a specific element/extent). Line 15 shows the actual extent from which the result set came back, because the reqid is exposed, the reqid was defined during the translation phase. This snippet ends with a WAIT line (again: all the waits are ‘cell smart table scan’). I consider this a cyclic process: first the ‘kcfis_create_maps_payload’, then calling oss_cread for one or multiple cells, then a wait, or the above processing of results for one or multiple appliances, and a wait.

I’ve created this snippet to be as simple as possible, in real life result sets of multiple appliances could be reaped (in my case I had to remove a second result). The processing of the result set is done in a few stages, so a resultset is not processed per appliance, but the processing stages are done for all the result sets of all the appliances.

Also this example shows only one request in the reaped result. There can be multiple requests (reqid’s/extents) returned.

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50439000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50d059a0 read seqno 1 pending reads 1
appliance 0x7f1b50d059a0 total creads 0 re-adjusted: read seqno 1 pending reads 1
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50439000, databuf: 0x7f1b50339000, nelem: 2
Dump of memory from 0x00007F1B5033B000 to 0x00007F1B5033B040
7F1B5033B000 0000A23C 02748C02 5BB31713 0402099A  [<.....t....[....]
7F1B5033B010 00988672 001E0001 0000007E 0000007E  [r.......~...~...]
7F1B5033B020 0000007E 00000000 00000000 00000000  [~...............]
7F1B5033B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50339080
kcfis_validate_translation: request 0x7f1b50d0b980
Req completed is : err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1 block_id=41192450
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b5033b000 afn 9 rdba 41192450 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 41192450 good: 1
Request 0x7f1b50d0b980 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b503390c8
kcfis_validate_translation: request 0x7f1b50d114e0
Req completed is : err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1 block_id=53758466
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b50342e98 afn 12 rdba 53758466 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 53758466 good: 1
Request 0x7f1b50d114e0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1
memptr (nil) len 0, blockid = 41192450
kcfis_get_next_data: elem = 1, err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1
memptr 0x7f1b5033b000 len 32408, blockid = 53758466
kcfis_get_next_data: dptr: 0x7f1b5033b000 len: 64816 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50439000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50439000
WAIT #139755290955216: nam='cell smart table scan' ela= 10 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707616790

(example with ‘nelem’=2)

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b5078b3a8 read seqno 1 pending reads 0
appliance 0x7f1b5078b3a8 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 1
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03522293 5BB316ED 0402099A  [<...."R....[....]
7F1B5055B010 009849CB 001E0001 0000000D 0000000D  [.I..............]
7F1B5055B020 0000000D 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d0a000
Req completed is : err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1 block_id=55714451
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55714451 dlen: 3480 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 55714451 good: 1
Request 0x7f1b50d0a000 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1
memptr (nil) len 0, blockid = 55714451
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 3480 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

(example of result sets returned of two appliances, for which the returned results are processed)

Also, when the extent to be scanned is larger, the result set is processed in multiple steps:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 2 pending reads 1
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 2 pending reads 1
throttle: received: maps_to_push 0 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 3
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03523080 5BB3178D 0402099A  [<....0R....[....]
7F1B5055B010 0098FBC9 001E0001 00000080 00000080  [................]
7F1B5055B020 00000080 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d145c0
Req completed is : err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1 block_id=55718016
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_validate_translation: REQ3: splitting from bottom
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55718016 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 55718016 good: 1
Request 0x7f1b50d145c0 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b505590c8
kcfis_validate_translation: request 0x7f1b50d13d40
Req completed is : err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1 block_id=41195392
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_oss_block_verify: bp: 0x7f1b50563098 afn 9 rdba 41195392 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 41195392 good: 1
Request 0x7f1b50d13d40 done
kcfis_process_completed_buffer: 2 elem: 0x7f1b50559110
kcfis_validate_translation: request 0x7f1b50d13f60
Req completed is : err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1 block_id=59912064
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5056b130 afn 14 rdba 59912064 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 59912064 good: 1
Request 0x7f1b50d13f60 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1
memptr (nil) len 0, blockid = 55718016
kcfis_get_next_data: elem = 1, err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1
memptr 0x7f1b5055b000 len 32920, blockid = 41195392
kcfis_get_next_data: elem = 2, err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1
memptr 0x7f1b5055b000 len 65840, blockid = 59912064
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 98760 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707638819

(example with nelem=3, with bigger extents, for which the result is splitted)

Once the result is reaped for a certain appliance, a new oss_cread call must be done in order for another result to be pushed to the server process. Request (reqid or extents) are not sent back in order.

Once the requests are exhausted for a certain appliance, the appliance is excluded from the ‘kcfis_create_maps_payload’ procedure.

Once all smartscan is complete, the memory areas are cleaned up, and the sessions are closed. Apparently a session or some session state information is kept per grid disk, which are all closed (closing of one appliance is shown in this snippet):

kcfis_read DONE - ret NULL
kcfis_finalize: app_state 0x7f1b50ca67b0 permflags 0x9000 clnflags 0x1f
Appliance 0x7f1b5078b3a8 in-active. Active count 3
Appliance 0x7f1b50c94e78 in-active. Active count 2
Appliance 0x7f1b50d059a0 in-active. Active count 1
Appliance 0x7f1b50d02948 in-active. Active count 0
Caching: session at the end of scan 1
Work area gc: 0x7f1b50ca6768, app_state: 0x7f1b50ca67b0
Freeing read buffer chunk (GC) 0x7f1b50448eb0. count 4
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
Freeing read buffer chunk (GC) 0x7f1b50338eb0. count 3
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
Freeing read buffer chunk (GC) 0x7f1b50558eb0. count 2
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
Freeing read buffer chunk (GC) 0x7f1b50979eb0. count 1
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 0
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 32768
Freeing memory to top level heap: num_freed 4
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 0
Work area cleanup start: global_ctx: 0x7f1b50ca6d20
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 app_state: 0x7f1b50ca67b0 refcnt: 0
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 refcnt: 0
Closed work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Deallocating kcfis wds: 0x7f1b50ca6768
Cleanup work area for app_state: 0x7f1b50ca67b0
kcfis_finalize_cached_sessions: global_ctx 0x7f1b50ca6d20
Caching: in kcfis_finalize_cached_sessions global ctx 0x7f1b50ca6d20 total cached 1 cached in ctx 1
Deallocating session. app state 0x7f1b50ca67b0 num cached sessions 0
In kcfis_deallocate_session: app_state: 0x7f1b50ca67b0
KCFIS: [NSMTIO]:SQL for this (non)Smart I/O session is: 
select count(*) from t
kcfis_cache_appliance_fd: Cached appliance fd 0x7f1b50da2b88  for appliance 0x7f1b5078b3a8
Appliance 0x7f1b5078b3a8 was not active. Active count 0
kcfis_deallocate_session: Freeing disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507908e0 name = o/192.168.12.8/DATA_CD_01_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507903f0 name = o/192.168.12.8/DATA_CD_05_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50d04fc0 name = o/192.168.12.8/DATA_CD_10_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfe960 name = o/192.168.12.8/DATA_CD_03_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfdf80 name = o/192.168.12.8/DATA_CD_09_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfc930 name = o/192.168.12.8/DATA_CD_00_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc8118 name = o/192.168.12.8/DATA_CD_11_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc7ae0 name = o/192.168.12.8/DATA_CD_08_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc5ab0 name = o/192.168.12.8/DATA_CD_02_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc3a80 name = o/192.168.12.8/DATA_CD_04_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc30a0 name = o/192.168.12.8/DATA_CD_07_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis rcv update : op: 6 val: 8192 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 24576

And the appliances are notified the smartscan has ended:

kcfis_close_all_appliance_fds. Cache count 4
kcfis_close_appliance_fd: appliance_fd 0x7f1b50da2b88
WAIT #139755290955216: nam='cell smart table scan' ela= 241 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707978094
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5db00
WAIT #139755290955216: nam='cell smart table scan' ela= 237 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707978356
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5df58
WAIT #139755290955216: nam='cell smart table scan' ela= 280 cellhash#=674246789 p2=0 p3=0 obj#=61458 tim=1593707978668
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5e3b0
WAIT #139755290955216: nam='cell smart table scan' ela= 242 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707978935

Summary.
This blogpost tries to summarise (..) the different steps in an Exadata smartscan. One of the most important things which this shows is that anything which is done, is covered by a single wait event (‘cell smart table scan’). In other words: profiling this wait event tells you little to nothing about what is actually happening, except that a smartscan is being processed. In other words: if you get high ‘cell smart table scan’ waits, either for all the waits or for a few, the only way to pinpoint what it is the wait is showing waiting time is executing this trace again, and see which step it is. Of course you can pinpoint if the waiting is happening for a specific cell/appliance by looking at cellhash# in the wait line.

Disclaimer:
Please mind I tried to use my knowledge on Oracle and Exadata processing together with the information the trace provided to build this description. If you encounter anything which is incorrect, please comment on this post, and I try to get it fixed. No bits where harmed during testing.

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

- IO size by number of reads and writes
– IO size versus latency for reads and writes
– IO size versus pending IO count for reads and writes
– IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18″, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)"; 

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2” means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

The purpose of this post is to show what the wait event ‘cell smart table scan’ means, based on reproducible investigation methods.

First of all, if you see the ‘cell smart table scan’ event: congratulations! This means you are using your exadata how it’s intended to be used, which means your full table scan is offloaded to the cells (storage layer), and potentially all kinds of optimisations are happening, like column filtering, predicate filtering, storage indexing, etc.

But what is exactly happening when you see the wait event ‘cell smart table scan’? Can we say anything based on this waits, like you can with other wait events?

If you have access to an exadata database machine, and have a table that is big enough to be smartscanned by the database, issue the following statements:
(Please mind that I’ve found these statements to be harmless, but execute these statements at your own risk. As with any execution of SQL: this uses resources on the Exadata, and could influence the performance)

SQL> alter session set tracefile_identifier='frits'; --make it easy to find the tracefile
SQL> alter session set events 'trace[libcell.client_library.*]';
SQL> alter session set events 'sql_trace level 8';
SQL> --- now issue a full table scan on a table 
SQL> select /*+ noparallel */ count(*) from bigtable1;

  COUNT(*)
----------
  63938560
SQL> exit

Now take the tracefile and look inside!

WAIT #139739525441360: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752114437651

*** 2013-10-14 07:02:01.516
WAIT #139739525441360: nam='SQL*Net message from client' ela= 7078360 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121516483
CLOSE #139739525441360:c=0,e=13,dep=0,type=1,tim=1381752121516628
=====================
PARSING IN CURSOR #139739525440176 len=57 dep=0 uid=0 oct=3 lid=0 tim=1381752121516762 hv=2246504030 ad='85b3e090' sqlid='gmm4k0y2ydvky'
select /*+ noparallel */ count(*) from bigtable1
END OF STMT
PARSE #139739525440176:c=0,e=83,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516760
EXEC #139739525440176:c=0,e=87,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516992
WAIT #139739525440176: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121517053
WAIT #139739525440176: nam='reliable message' ela= 2053 channel context=2700762248 channel handle=2700440152 broadcast message=2669087352 obj#=-1 tim=1381752121519402
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 504 name|mode=1263468550 2=65571 0=1 obj#=-1 tim=1381752121520099
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 330 name|mode=1263468545 2=65571 0=2 obj#=-1 tim=1381752121520594
WAIT #139739525440176: nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=6 filetype=2 obj#=-1 tim=1381752121520664

*** 2013-10-14 07:02:01.525
initializing OSS
ossnet_optip_trace: Host is not NUMA. Returning no IP preference

This is the start of my tracefile (but yours could be a bit different). First the usual SQL*Net waits, then the query which gets parsed and executed, then the wait for ‘enq: KO – fast object checkpoint’, which is an indication we are going to do direct path IO, because the KO enqueue indicates this object is checkpointed to make sure that the version of the blocks on disk is current.

Then there’s ‘initialising OSS’. This is where the communication with the Exadata storage servers is started. OSS probably means ‘Oracle Storage Server’ in this context. The next line tells us something nice: apparently, the code in the oracle database to behave as a client of the Exadata storage server is NUMA (Non Uniform Memory Access) aware. However, on Exadata versions “dash two” (X2-2, X3-2), NUMA is turned off on the compute layer (the kernel boot line in /etc/grub.conf explicitly disables NUMA with “numa=off”).

Now let’s look in the tracefile, but to get a better idea of what is happening, let’s filter the tracefile information. Issue the following command on the tracefile:

grep -e ^oss tracefile.trc | grep -v done | less

The first part looks something like this:

ossnet_optip_trace: Host is not NUMA. Returning no IP preference
oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_09_enkcel01 with guid 217e470e-d3f5298-1ffed28e-a00133d2
oss_open called for o/192.168.12.5/DATA_CD_00_enkcel03 with guid 78804872-173010d6-2c12c6d2-b7bc6780
oss_open called for o/192.168.12.5/DATA_CD_08_enkcel03 with guid fd1f4fce-15180efd-83fbde2f-8465401c
oss_open called for o/192.168.12.4/DATA_CD_00_enkcel02 with guid ead44d03-e64725e4-c25a79b5-bff05885
oss_open called for o/192.168.12.4//box/predicate
oss_open called for o/192.168.12.5/DATA_CD_05_enkcel03 with guid 20c344ef-53a0387b-c2dbd306-ad971476
oss_open called for o/192.168.12.4/DATA_CD_08_enkcel02 with guid e2294076-cce51fa-edaecc3f-a966ca8e
oss_open called for o/192.168.12.4/DATA_CD_09_enkcel02 with guid 3073459f-8f6928a0-c70f77dc-9ed9e840
oss_open called for o/192.168.12.5/DATA_CD_10_enkcel03 with guid 73b44c5a-a44544b2-82933066-8cf0410a

I’ve included the NUMA line again. Next we see ‘oss_open’ lines. For every griddisk there’s a line to indicate that the ‘disk’ is identified. Also, when a storage server is found for the “first time” for a smart scan, we see a line with “/box/predicate”, which probably sets up a communication channel to the storage server to push the predicate information, so the storage server can use that. It seems logical to me that first information about the storage environment is gathered by the foreground process that is going to use it.

The next phase looks like this:

oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 4, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 2, opcode: 99
oss_ioctl open file: 4, opcode: 99
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_wait called for request: 0xdcb52e0
oss_wait called for request: 0xdcb59f0

The phase seems to execute “oss_ioctl”, and after every ioctl there seems to be a corresponding “oss_wait” line. “ioctl” probably means Input Output Control. I am not sure what happens during the ioctl phase. The number of ioctl requests (19 in my case) is too low to have a ioctl per griddisk, and the number of extents is much higher for too.

The next phase is executing “oss_cread”:

oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38

My guess (again) is “cread” means compressed read, and seems to be the indication that the database is reading the results of an offloaded query. When a non-offloaded query is executed, the OSS read function is “oss_read”, but is referred to as “BREAD” in the tracefile (block(mode) read). The offset of oss_cread seems to be always “0”, and the size seems to be “1048576” (1MB) in all cases. The line with ‘failed’ does not seem to indicate a problem.

The IOCTL and CREAD’s seem to be mixed for a little while, after which the CREAD’s are the only ones left.

So, an offloaded query has a open, ioctl and cread phase. Now let’s add in the waits:

grep -e ^oss -e ^WAIT tracefile.trc | grep -v done | less

This is an example with different snippets to indicate what the ‘cell smart table scan’ means:

oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 384 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121550711
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 526 cellhash#=3176594409 p2=0 p3=0 obj#=15909 tim=1381752121558781
oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
WAIT #139739525440176: nam='cell smart table scan' ela= 898 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121631152
oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38
WAIT #139739525440176: nam='cell smart table scan' ela= 9249 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122240642
oss_wait called for request: (nil)
oss_wait called for request: (nil)
WAIT #139739525440176: nam='cell smart table scan' ela= 197 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122250860

What we see, is that during the entire “life” of a smartscan, during the 3 phases the same wait event ‘cell smart table scan’ is used. This means that there’s little that can be derived from the wait event, except for the fact that a smartscan is used.

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
– Less than 50% of the blocks of the table is already in the buffercache.
– Less than 25% of the blocks in the buffercache are dirty.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)':

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500”.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis'; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
– Execute
– A ‘SQL*Net message to client’ wait
– A ‘db file sequential read’ wait (for reading the segment header)
– Potentially a ‘asynch descriptor resize’ wait
– The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

Follow

Get every new post delivered to your Inbox.

Join 2,059 other followers

%d bloggers like this: