The Exadata wait event ‘cell smart table scan’

The purpose of this post is to show what the wait event ‘cell smart table scan’ means, based on reproducible investigation methods.

First of all, if you see the ‘cell smart table scan’ event: congratulations! This means you are using your exadata how it’s intended to be used, which means your full table scan is offloaded to the cells (storage layer), and potentially all kinds of optimisations are happening, like column filtering, predicate filtering, storage indexing, etc.

But what is exactly happening when you see the wait event ‘cell smart table scan’? Can we say anything based on this waits, like you can with other wait events?

If you have access to an exadata database machine, and have a table that is big enough to be smartscanned by the database, issue the following statements:
(Please mind that I’ve found these statements to be harmless, but execute these statements at your own risk. As with any execution of SQL: this uses resources on the Exadata, and could influence the performance)

SQL> alter session set tracefile_identifier='frits'; --make it easy to find the tracefile
SQL> alter session set events 'trace[libcell.client_library.*]';
SQL> alter session set events 'sql_trace level 8';
SQL> --- now issue a full table scan on a table 
SQL> select /*+ noparallel */ count(*) from bigtable1;

  COUNT(*)
----------
  63938560
SQL> exit

Now take the tracefile and look inside!

WAIT #139739525441360: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752114437651

*** 2013-10-14 07:02:01.516
WAIT #139739525441360: nam='SQL*Net message from client' ela= 7078360 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121516483
CLOSE #139739525441360:c=0,e=13,dep=0,type=1,tim=1381752121516628
=====================
PARSING IN CURSOR #139739525440176 len=57 dep=0 uid=0 oct=3 lid=0 tim=1381752121516762 hv=2246504030 ad='85b3e090' sqlid='gmm4k0y2ydvky'
select /*+ noparallel */ count(*) from bigtable1
END OF STMT
PARSE #139739525440176:c=0,e=83,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516760
EXEC #139739525440176:c=0,e=87,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516992
WAIT #139739525440176: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121517053
WAIT #139739525440176: nam='reliable message' ela= 2053 channel context=2700762248 channel handle=2700440152 broadcast message=2669087352 obj#=-1 tim=1381752121519402
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 504 name|mode=1263468550 2=65571 0=1 obj#=-1 tim=1381752121520099
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 330 name|mode=1263468545 2=65571 0=2 obj#=-1 tim=1381752121520594
WAIT #139739525440176: nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=6 filetype=2 obj#=-1 tim=1381752121520664

*** 2013-10-14 07:02:01.525
initializing OSS
ossnet_optip_trace: Host is not NUMA. Returning no IP preference

This is the start of my tracefile (but yours could be a bit different). First the usual SQL*Net waits, then the query which gets parsed and executed, then the wait for ‘enq: KO – fast object checkpoint’, which is an indication we are going to do direct path IO, because the KO enqueue indicates this object is checkpointed to make sure that the version of the blocks on disk is current.

Then there’s ‘initialising OSS’. This is where the communication with the Exadata storage servers is started. OSS probably means ‘Oracle Storage Server’ in this context. The next line tells us something nice: apparently, the code in the oracle database to behave as a client of the Exadata storage server is NUMA (Non Uniform Memory Access) aware. However, on Exadata versions “dash two” (X2-2, X3-2), NUMA is turned off on the compute layer (the kernel boot line in /etc/grub.conf explicitly disables NUMA with “numa=off”).

Now let’s look in the tracefile, but to get a better idea of what is happening, let’s filter the tracefile information. Issue the following command on the tracefile:

grep -e ^oss tracefile.trc | grep -v done | less

The first part looks something like this:

ossnet_optip_trace: Host is not NUMA. Returning no IP preference
oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_09_enkcel01 with guid 217e470e-d3f5298-1ffed28e-a00133d2
oss_open called for o/192.168.12.5/DATA_CD_00_enkcel03 with guid 78804872-173010d6-2c12c6d2-b7bc6780
oss_open called for o/192.168.12.5/DATA_CD_08_enkcel03 with guid fd1f4fce-15180efd-83fbde2f-8465401c
oss_open called for o/192.168.12.4/DATA_CD_00_enkcel02 with guid ead44d03-e64725e4-c25a79b5-bff05885
oss_open called for o/192.168.12.4//box/predicate
oss_open called for o/192.168.12.5/DATA_CD_05_enkcel03 with guid 20c344ef-53a0387b-c2dbd306-ad971476
oss_open called for o/192.168.12.4/DATA_CD_08_enkcel02 with guid e2294076-cce51fa-edaecc3f-a966ca8e
oss_open called for o/192.168.12.4/DATA_CD_09_enkcel02 with guid 3073459f-8f6928a0-c70f77dc-9ed9e840
oss_open called for o/192.168.12.5/DATA_CD_10_enkcel03 with guid 73b44c5a-a44544b2-82933066-8cf0410a

I’ve included the NUMA line again. Next we see ‘oss_open’ lines. For every griddisk there’s a line to indicate that the ‘disk’ is identified. Also, when a storage server is found for the “first time” for a smart scan, we see a line with “/box/predicate”, which probably sets up a communication channel to the storage server to push the predicate information, so the storage server can use that. It seems logical to me that first information about the storage environment is gathered by the foreground process that is going to use it.

The next phase looks like this:

oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 4, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 2, opcode: 99
oss_ioctl open file: 4, opcode: 99
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_wait called for request: 0xdcb52e0
oss_wait called for request: 0xdcb59f0

The phase seems to execute “oss_ioctl”, and after every ioctl there seems to be a corresponding “oss_wait” line. “ioctl” probably means Input Output Control. I am not sure what happens during the ioctl phase. The number of ioctl requests (19 in my case) is too low to have a ioctl per griddisk, and the number of extents is much higher for too.

The next phase is executing “oss_cread”:

oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38

My guess (again) is “cread” means compressed read, and seems to be the indication that the database is reading the results of an offloaded query. When a non-offloaded query is executed, the OSS read function is “oss_read”, but is referred to as “BREAD” in the tracefile (block(mode) read). The offset of oss_cread seems to be always “0”, and the size seems to be “1048576” (1MB) in all cases. The line with ‘failed’ does not seem to indicate a problem.

The IOCTL and CREAD’s seem to be mixed for a little while, after which the CREAD’s are the only ones left.

So, an offloaded query has a open, ioctl and cread phase. Now let’s add in the waits:

grep -e ^oss -e ^WAIT tracefile.trc | grep -v done | less

This is an example with different snippets to indicate what the ‘cell smart table scan’ means:

oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 384 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121550711
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 526 cellhash#=3176594409 p2=0 p3=0 obj#=15909 tim=1381752121558781
oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
WAIT #139739525440176: nam='cell smart table scan' ela= 898 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121631152
oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38
WAIT #139739525440176: nam='cell smart table scan' ela= 9249 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122240642
oss_wait called for request: (nil)
oss_wait called for request: (nil)
WAIT #139739525440176: nam='cell smart table scan' ela= 197 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122250860

What we see, is that during the entire “life” of a smartscan, during the 3 phases the same wait event ‘cell smart table scan’ is used. This means that there’s little that can be derived from the wait event, except for the fact that a smartscan is used.

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: