The Exadata wait event ‘cell smart table scan’
The purpose of this post is to show what the wait event ‘cell smart table scan’ means, based on reproducible investigation methods.
First of all, if you see the ‘cell smart table scan’ event: congratulations! This means you are using your exadata how it’s intended to be used, which means your full table scan is offloaded to the cells (storage layer), and potentially all kinds of optimisations are happening, like column filtering, predicate filtering, storage indexing, etc.
But what is exactly happening when you see the wait event ‘cell smart table scan’? Can we say anything based on this waits, like you can with other wait events?
If you have access to an exadata database machine, and have a table that is big enough to be smartscanned by the database, issue the following statements:
(Please mind that I’ve found these statements to be harmless, but execute these statements at your own risk. As with any execution of SQL: this uses resources on the Exadata, and could influence the performance)
SQL> alter session set tracefile_identifier='frits'; --make it easy to find the tracefile SQL> alter session set events 'trace[libcell.client_library.*]'; SQL> alter session set events 'sql_trace level 8'; SQL> --- now issue a full table scan on a table SQL> select /*+ noparallel */ count(*) from bigtable1; COUNT(*) ---------- 63938560 SQL> exit
Now take the tracefile and look inside!
WAIT #139739525441360: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752114437651 *** 2013-10-14 07:02:01.516 WAIT #139739525441360: nam='SQL*Net message from client' ela= 7078360 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121516483 CLOSE #139739525441360:c=0,e=13,dep=0,type=1,tim=1381752121516628 ===================== PARSING IN CURSOR #139739525440176 len=57 dep=0 uid=0 oct=3 lid=0 tim=1381752121516762 hv=2246504030 ad='85b3e090' sqlid='gmm4k0y2ydvky' select /*+ noparallel */ count(*) from bigtable1 END OF STMT PARSE #139739525440176:c=0,e=83,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516760 EXEC #139739525440176:c=0,e=87,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516992 WAIT #139739525440176: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121517053 WAIT #139739525440176: nam='reliable message' ela= 2053 channel context=2700762248 channel handle=2700440152 broadcast message=2669087352 obj#=-1 tim=1381752121519402 WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 504 name|mode=1263468550 2=65571 0=1 obj#=-1 tim=1381752121520099 WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 330 name|mode=1263468545 2=65571 0=2 obj#=-1 tim=1381752121520594 WAIT #139739525440176: nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=6 filetype=2 obj#=-1 tim=1381752121520664 *** 2013-10-14 07:02:01.525 initializing OSS ossnet_optip_trace: Host is not NUMA. Returning no IP preference
This is the start of my tracefile (but yours could be a bit different). First the usual SQL*Net waits, then the query which gets parsed and executed, then the wait for ‘enq: KO – fast object checkpoint’, which is an indication we are going to do direct path IO, because the KO enqueue indicates this object is checkpointed to make sure that the version of the blocks on disk is current.
Then there’s ‘initialising OSS’. This is where the communication with the Exadata storage servers is started. OSS probably means ‘Oracle Storage Server’ in this context. The next line tells us something nice: apparently, the code in the oracle database to behave as a client of the Exadata storage server is NUMA (Non Uniform Memory Access) aware. However, on Exadata versions “dash two” (X2-2, X3-2), NUMA is turned off on the compute layer (the kernel boot line in /etc/grub.conf explicitly disables NUMA with “numa=off”).
Now let’s look in the tracefile, but to get a better idea of what is happening, let’s filter the tracefile information. Issue the following command on the tracefile:
grep -e ^oss tracefile.trc | grep -v done | less
The first part looks something like this:
ossnet_optip_trace: Host is not NUMA. Returning no IP preference oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7 oss_open called for o/192.168.12.5//box/predicate oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762 oss_open called for o/192.168.12.3//box/predicate oss_open called for o/192.168.12.3/DATA_CD_09_enkcel01 with guid 217e470e-d3f5298-1ffed28e-a00133d2 oss_open called for o/192.168.12.5/DATA_CD_00_enkcel03 with guid 78804872-173010d6-2c12c6d2-b7bc6780 oss_open called for o/192.168.12.5/DATA_CD_08_enkcel03 with guid fd1f4fce-15180efd-83fbde2f-8465401c oss_open called for o/192.168.12.4/DATA_CD_00_enkcel02 with guid ead44d03-e64725e4-c25a79b5-bff05885 oss_open called for o/192.168.12.4//box/predicate oss_open called for o/192.168.12.5/DATA_CD_05_enkcel03 with guid 20c344ef-53a0387b-c2dbd306-ad971476 oss_open called for o/192.168.12.4/DATA_CD_08_enkcel02 with guid e2294076-cce51fa-edaecc3f-a966ca8e oss_open called for o/192.168.12.4/DATA_CD_09_enkcel02 with guid 3073459f-8f6928a0-c70f77dc-9ed9e840 oss_open called for o/192.168.12.5/DATA_CD_10_enkcel03 with guid 73b44c5a-a44544b2-82933066-8cf0410a
I’ve included the NUMA line again. Next we see ‘oss_open’ lines. For every griddisk there’s a line to indicate that the ‘disk’ is identified. Also, when a storage server is found for the “first time” for a smart scan, we see a line with “/box/predicate”, which probably sets up a communication channel to the storage server to push the predicate information, so the storage server can use that. It seems logical to me that first information about the storage environment is gathered by the foreground process that is going to use it.
The next phase looks like this:
oss_ioctl open file: 2, opcode: 99 oss_wait called for request: 0xdca3030 oss_ioctl open file: 4, opcode: 99 oss_wait called for request: 0xdca3030 oss_ioctl open file: 9, opcode: 99 oss_wait called for request: 0xdca3030 oss_ioctl open file: 2, opcode: 99 oss_ioctl open file: 4, opcode: 99 oss_ioctl open file: 9, opcode: 99 oss_wait called for request: 0xdca3030 oss_wait called for request: 0xdcb52e0 oss_wait called for request: 0xdcb59f0
The phase seems to execute “oss_ioctl”, and after every ioctl there seems to be a corresponding “oss_wait” line. “ioctl” probably means Input Output Control. I am not sure what happens during the ioctl phase. The number of ioctl requests (19 in my case) is too low to have a ioctl per griddisk, and the number of extents is much higher for too.
The next phase is executing “oss_cread”:
oss_cread open file: 2, offset: 0, size: 1048576 oss_cread open file: 4, offset: 0, size: 1048576 oss_cread open file: 9, offset: 0, size: 1048576 oss_wait called for request: (nil) oss_wait called for request: (nil) osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38
My guess (again) is “cread” means compressed read, and seems to be the indication that the database is reading the results of an offloaded query. When a non-offloaded query is executed, the OSS read function is “oss_read”, but is referred to as “BREAD” in the tracefile (block(mode) read). The offset of oss_cread seems to be always “0”, and the size seems to be “1048576” (1MB) in all cases. The line with ‘failed’ does not seem to indicate a problem.
The IOCTL and CREAD’s seem to be mixed for a little while, after which the CREAD’s are the only ones left.
So, an offloaded query has a open, ioctl and cread phase. Now let’s add in the waits:
grep -e ^oss -e ^WAIT tracefile.trc | grep -v done | less
This is an example with different snippets to indicate what the ‘cell smart table scan’ means:
oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7 oss_open called for o/192.168.12.5//box/predicate WAIT #139739525440176: nam='cell smart table scan' ela= 384 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121550711 oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762 oss_open called for o/192.168.12.3//box/predicate WAIT #139739525440176: nam='cell smart table scan' ela= 526 cellhash#=3176594409 p2=0 p3=0 obj#=15909 tim=1381752121558781 oss_ioctl open file: 2, opcode: 99 oss_wait called for request: 0xdca3030 WAIT #139739525440176: nam='cell smart table scan' ela= 898 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121631152 oss_cread open file: 2, offset: 0, size: 1048576 oss_cread open file: 4, offset: 0, size: 1048576 oss_cread open file: 9, offset: 0, size: 1048576 oss_wait called for request: (nil) oss_wait called for request: (nil) osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38 WAIT #139739525440176: nam='cell smart table scan' ela= 9249 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122240642 oss_wait called for request: (nil) oss_wait called for request: (nil) WAIT #139739525440176: nam='cell smart table scan' ela= 197 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122250860
What we see, is that during the entire “life” of a smartscan, during the 3 phases the same wait event ‘cell smart table scan’ is used. This means that there’s little that can be derived from the wait event, except for the fact that a smartscan is used.
Pingback: Exadata: measuring IO latencies experienced by the cell server | Frits Hoogland Weblog
Pingback: How Exadata smartscans work | Frits Hoogland Weblog