Archive

Tag Archives: database

I guess everybody who is working with Oracle databases and has been involved with Oracle Exadata in any way knows about smartscans. It is the smartscan who makes the magic happen of full segment scans with sometimes enormously reduced scan times. The Oracle database does smartscans which something that is referred to as ‘offloading’. This is all general known information.

But how does that work? I assume more people are like me, and are anxious to understand how that exactly works. But the information on smartscans is extremely scarce. Of course there is the Oracle public material, which looks technical, but is little/nothing more than marketing. On My Oracle Support, I can’t find anything on the inner working. Even in the ‘Expert Oracle Exadata’ book (which I still regard as the best source of Exadata related information) there is no material on the mechanics of smartscans.

I’ve written a couple of articles on smartscans, of which this article already lays some groundwork, it describes the phases I could see with the available information at that time: oss_open, oss_ioctl followed by oss_wait and oss_cread followed by oss_wait. This is actually a summary of a smartscan, but a very brief one. In this article I described that a smartscan can only happen with a full segment scan (kdstf* functions, Oracle’s fast full scan routines) and if it chooses to use direct path (kcbld* functions, direct path loader) access, which is actually mandatory to get smart scans.

The following investigation is done on an Exadata X2-2 Quarter rack, with Image version: 12.1.1.1.0.131219, and database version 12.1.0.1.3.

In order to get more understand on smartscans, we can use Oracle’s new debugging syntax. The part we are going to look at is called ‘KXD':

SYS@db12c2 AS SYSDBA> oradebug doc component kxd

  KXD			       Exadata specific Kernel modules (kxd)
    KXDAM		       Exadata Disk Auto Manage (kxdam)
    KCFIS		       Exadata Predicate Push (kcfis)
    NSMTIO		       Trace Non Smart I/O (nsmtio)
    KXDBIO		       Exadata Block level Intelligent Operations (kxdbio)
    KXDRS		       Exadata Resilvering Layer (kxdrs)
    KXDOFL		       Exadata Offload (kxdofl)
    KXDMISC		       Exadata Misc (kxdmisc)

In order to get the tracing of the Exadata (database-) kernel modules, along with regular sql tracing with waits (to understand when something is happening, use the following events:

FRITS@db12c2 > alter session set events 'trace[kxd.*]';

Session altered.

FRITS@db12c2 > alter session set events 'sql_trace level 8';

Session altered.

Now execute a SQL that does smartscans. I’ve made sure table ‘T’ is big enough to invoke a direct path full table scan:

FRITS@db12c2 > select count(*) from t;

  COUNT(*)
----------
   1000000

Now take a peek in the trace file! The first part is normal looking:

PARSING IN CURSOR #139755290955216 len=22 dep=0 uid=201 oct=3 lid=201 tim=1593707470857 hv=2763161912 ad='224d632b8' sqlid='cyzznbykb509s'
select count(*) from t
END OF STMT
PARSE #139755290955216:c=2000,e=2417,p=0,cr=1,cu=0,mis=1,r=0,dep=0,og=1,plh=2966233522,tim=1593707470856
EXEC #139755290955216:c=0,e=25,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=1593707470920
WAIT #139755290955216: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14 tim=1593707470968
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 215 name|mode=1263468550 2=131242 0=2 obj#=14 tim=1593707471374
WAIT #139755290955216: nam='reliable message' ela= 1035 channel context=10126085744 channel handle=10164799536 broadcast message=10179010104 obj#=14 tim=1593707472530
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 108 name|mode=1263468550 2=131242 0=1 obj#=14 tim=1593707472684
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 101 name|mode=1263468545 2=131242 0=2 obj#=14 tim=1593707472829

We see the parsing of the simple select statement, and the execution, which yields some waits which are always there (the sqlnet message wait), and then some ‘enq: KO – fast object checkpoint’ waits, indicating a checkpoint, which is a sign of a direct path read.

The next part is interesting, because this is the smartscan-specific tracing:

Caching: Global context initialized 0x7f1b50ca6d20
kcfis_alloc_so 0x24d8994d8
In kcfis initialize: new init: app_state: 0x7f1b50ca67b0 app_type: 1
kcfis rcv update : op: 7 val: 1 so_numses 1 ovhdmem 0 mdmem 0 bufmem 0
kcfis_reinitialize: initializing queues
Set work des: global_ctx: 0x7f1b50ca6d20 app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
Automem enabled: app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
No match found for mmwds. Allocated wds 0x7f1b50ca6768, mmwds 0x22446a1a8
Cache version is 1 start cache version is 1
kcfis rcv update : op: 1 val: 1496 so_numses 1 ovhdmem 0 mdmem 1496 bufmem 0
kcfis rcv update : op: 1 val: 68 so_numses 1 ovhdmem 0 mdmem 1564 bufmem 0
oss_state->oss_context is 0x109d2db0
kcfis rcv update : op: 3 val: 69656 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 0
kcfis_initialize done

What we see here is the kcfis layer initialising memory. It’s interesting to see where the allocations are done. In general, on this system, the allocations in the 0x7f1bxxxxxxxx are in the PGA heap, kcfis_alloc_so/0x24d8994d8 in the SGA, and the mmwds/0x22446a1a8 is in the SGA too.
This information can be obtained by dumping heaps. Dumping the heaps at level 7 will show sga, session, pga, call and uga heaps. Another way to get insight into a memory locations is using Tanel Poder’s fcha script (Find Chunk Address). Please mind that if you are reading the kxd trace file and want to look up the addresses in a second window in a sqlplus / as sysdba session, this will allow you to see the SGA chunks, but probably not the PGA chunks, because these are private to the traced session.

Let’s get on to the next chunk of trace lines:

In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=(nil)
kcfis_translate: source: 1
kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 reqid=1 cver=1 source=1
OSSIPC:SKGXP:[109be210.0]{0}: (25797 <- 13487)SKGXPDOAINVALCON: connection 0x109c8c30 admno 0x5ac93f1d scoono 0x321bdad5 acconn 0x63cd0678 getting closed. inactive: 0
OSSIPC:SKGXP:[109be210.9]{obj}: SKGXPCNH: 0x109c8390 SKGXPCON_OPEN (2) sconno 321bdad4 accono 1206dea3 admno 75f4dfcd ospid 13487 ANT
OSSIPC:SKGXP:[109be210.10]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.11]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.12]{obj}:        SSKGXPT 0x109c83d0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 22774 lerr 0
OSSIPC:SKGXP:[109be210.13]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.14]{obj}:        SSKGXPT 0x109c84a0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 51931 lerr 0
OSSIPC:SKGXP:[109be210.15]{obj}:   next seqno 32768 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
OSSIPC:SKGXP:[109be210.16]{obj}: SKGXPCNH: 0x109c8c30 SKGXPCON_CLOSED (1) sconno 321bdad5 accono 63cd0678 admno 7b54ac1c ospid 13487 ANT
OSSIPC:SKGXP:[109be210.17]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.18]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.19]{obj}:        SSKGXPT 0x109c8c70 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 29832 lerr 0
OSSIPC:SKGXP:[109be210.20]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.21]{obj}:        SSKGXPT 0x109c8d40 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 65116 lerr 0
OSSIPC:SKGXP:[109be210.22]{obj}:   next seqno 32765 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188499, 13) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723752448, 106496) Mirr_num: 0 reqid=1 cver=1 source=1
kcfis_get_disk_for_translation: Appliance 192.168.12.8/ does not exist
kcfis rcv update : op: 5 val: 8192 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 8192
kcfis_open_appliance: 
throttle: initialized for appliance 0x7f1b5078b3a8
kcfis_find_appliance_fd: appliance fd not found. appliance 192.168.12.8/ Cached appliance fd count 0
kcfis_open_appliance_fd: 0x7f1b5078b3a8
WAIT #139755290955216: nam='cell smart table scan' ela= 120 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707489569
Predicate device intelligent IO opened. fd 5
kcfis_init_appliance_fd: appliance fd 0x7f1b50da2b88 initialized for appliance 0x7f1b5078b3a8
kcfis_add_disk: Adding disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 under appliance = 0x7f1b5078b3a8
initialize disk for disk o/192.168.12.8/DATA_CD_06_enkcel04
kcfis_initialize_disk_fd: Disk initialized. appliance: 192.168.12.8/ disk: o/192.168.12.8/DATA_CD_06_enkcel04 fd: 4 disknumber: 1 incarnation: 6 prev_disknumber: 0 num_inits: 1 init_cache_ver: 1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188499, numblks = 13, disk_handle = 4, diskoffs = 513723752448, len = 106496, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a000 reqid=1 cver=1 source=1
Default: calc numbufs mem 1048576
Final: calc numbufs mem 1048576 buflen 1048576
Num buffers: 1 buf per appliance: 1 num active appliance: 1
Appliance 0x7f1b5078b3a8 active. Active count 1

There’s a lot to see here. The next step in doing a smartscan is the translation of the data dictionary information on the segment to be smartscanned to cell server and grid disk extents. The data dictionary information is shown in line 3; ‘kcfis_get_new_request’, in line 4; ‘kcfis_get_translation:before’ the session tries to translate the data dictionary information to cell server and grid disk. Because the kcfis context is just initialised, there is no cell related information yet. For that reason, the information is requested from the cell server (the OSSIPC:SKGXP lines). Please mind this request will send all disk related information to the kcfis context of the process. With this information, the process can make the translation, as can be seen in line 20: ‘kcfis_get_translation:after’, it shows where the extent is located in the well known exadata notation: ‘o/cell ip/grid disk name’, together with offset and chunk length. Next the disk needs to be initialised (a disk is a combination of cell server or ‘appliance’ and grid disk), and given a hash value, as is indicated by ‘kcfis_get_disk_for_translation’ in line 21. However, in order for a disk to be initialised, the cell server or appliance must be initialised too. That is what the process is indicating in line 21: Appliance 192.168.12.8/ does not exist. The appliance (cell server) is initialised/opened, which is what is shown in lines 23-29. The wait here is initialising a connection with the cell server, waiting for an acknowledgement. Now the appliance is initialised, the disk is initialised, as indicated in lines 30-32. Next line 33 shows the translation which was started earlier new is finally been done. Line 34-37 show something about the buffering which seems to be arranged per appliance, and is 1MB.

Now that the process has initialised the appliance and the disk, the next translation is done. The translations are done per extent of the segment, and the disk is depended on the placing of the extent. Please mind the maximum size of the allocation is depended on the AU (allocation unit) size of ASM, which is set to 4MB by default with Exadata. If a non-initialised appliance is encountered, it is initialised and opened, and if a non-initialised disk is encountered, this is initialised.

The translation looks like this if everything is initialised:

kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 reqid=2 cver=1 source=1
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188513, 15) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723867136, 122880) Mirr_num: 0 reqid=2 cver=1 source=1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188513, numblks = 15, disk_handle = 4, diskoffs = 513723867136, len = 122880, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a220 reqid=2 cver=1 source=1

Another thing which is important to notice is the ‘reqid’, which obviously means ‘request id’. This process is repeated until the complete segment is translated into requests.

Once the requests are translated, the next step in the smartscan is to send (‘push’) the requests to the appliances. This apparently is called a ‘payload map’.

kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160
kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d0a000
Pushing request : disknumber = 1, offset = 513723752448, len = 106496 rdba: 55714451 version 0 reqid=1 cver=1
disk=0x7f1b507908e0 state=1

The first line is the start of the pushing of the payload maps. The next line shows a specific appliance being chosen. What is shown next is a line showing ‘disk’ and the hash value of the disk. At the start of a maps push, the disk lines are followed by two lines saying ‘trans_req_element’ and ‘Pushing request’. These two lines probably are some kind of state object for the request, and the actual pushing of the request. Here we see the request id back which we saw in the translation phase.

This is repeated, until some of the disk lines are starting to get followed immediately by another disk line:

disk=0x7f1b50cfdf80 state=1
trans_req_element = 0x7f1b50d12c40
Pushing request : disknumber = 7, offset = 513817985024, len = 1032192 rdba: 55716738 version 0 reqid=67 cver=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
trans_req_element = 0x7f1b50d2ffa0
Pushing request : disknumber = 9, offset = 513890353152, len = 4161536 rdba: 55729668 version 0 reqid=174 cver=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1

Further down in the trace file, the trans_req_element and Pushing request lines are becoming scarce:

disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d31920
Pushing request : disknumber = 1, offset = 498446925824, len = 4161536 rdba: 41208836 version 0 reqid=186 cver=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1

Inspection of these lines show that the process is going through a strict sequence of disks of that appliance, and picks up one request per disk which (obviously) belongs to that disk. If the requests are not evenly divided between the disks, some disks will have all the requests already pushed to that disk, while other disks still need additional requests. In that case, the disk which already have their requests pushed will not get a request, so no trans_req_element/Pushing request combination. The process goes through this until all the requests for that appliance are pushed.

After the push of all the requests for that appliance, the following is happening:

kcfis_create_maps_payload. alloc_len 4088 num maps 55
throttle: mappayload: maps_to_push 7 iosize_being_pushed 150953984
kcfis_metadata_payload_len: app_state 0x7f1b50ca67b0 appliance 0x7f1b5078b3a8 payload_len 5968 payload_hdr_len 96 sessiondata_payload_len 144
metadata_payload_len 1536 fmetadata_payload_len 100 maps_len 3992 exthdr_len 8 planpayload_len 48 oflgrppayload_plen 40
kcfis_create_metadata_payload. appliance 0x7f1b5078b3a8 payload 0x7f1b50da3190 payload_memlen 5968 maps 0x7f1b50d07fe0 mapslen 3992
kcfis_create_metadata_payload: pushing sessiondata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing capability payload: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: dop: 1kcfis_create_metadata_payload: pushing metadata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing fast metadata: appliance 0x7f1b5078b3a8
kcfis_push: pushing metadata to appliance 0x7f1b5078b3a8. metadata 0x7f1b50da3190
kcfis_issue_ioctl: payload_type 1
WAIT #139755290955216: nam='cell smart table scan' ela= 178 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707571673
Ioctl completed. Payload type 1
Ioctl quarantine response 1 for appliance 0x7f1b5078b3a8
appliance 0x7f1b5078b3a8 : cellsrv pid: 13487: predicate: /box/predicate735745

The first line shows 55 maps have been pushed to the appliance. The other lines are various memory locations which are needed for gathering the reads which will be send back by the appliances. Further things which seem important are line 11, which issues a ioctl (IO control) request to the appliance, and waits for acknowledgement. Mind the wait is always ‘cell smart table scan’. Line 15 shows this request gets a predicate, which is ‘/box/predicate735745′.

This is repeated for every appliance.

Then the next thing happens:

kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
throttle: allowing map push: appliance 0x7f1b5078b3a8, maps_to_push 7
disk=0x7f1b50c953f0 state=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
kcfis_create_maps_payload. alloc_len 200 num maps 0

The maps payload push is done again for all the appliances, without any request being pushed. The last line confirms no maps/requests having been pushed: num maps 0. However, there is one line which hasn’t been there before: line 2 ‘throttle’, more specifically important in this line is ‘allowing map push’, the previous throttle during ‘kcfis_create_maps_payload’ had the remark ‘mappayload’.

This means that at this point the physical extents to be scanned on the appliances (cell servers) have been identified, translated to appliance, grid disk, offset and size, and the requests for these extents have been send to the appliances. The last snippet actually means that the appliance is notified to start preparing for sending results back.

After the appliances have been notified, memory is initialised again.

Default: calc numbufs mem 4194304
Final: calc numbufs mem 4194304 buflen 1048576
Alloc buffer: target_freebufs 4 allocated_freebufs 0
Get additional mem: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768
Starting work area: app_state: 0x7f1b50ca67b0 wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Started work area: wds: 0x7f1b50ca6768 mmwds 0x22446a1a8
Get additional mem for pga_aggregate_target: max 4195552 min 1048888, wds 0x7f1b50ca6768 mmwds 0x22446a1a8
cur size 0
Change req: expected size 4196352 cur size 0 max_mem (KB) 4098
Max memory allocation ok: max 4195552, expected 4196352, cur 0
Memlen allowed 4195552 io_buflen 1048576 chunk_len 1048888
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 4212028
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 4244796
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 4212028
Calling oss_cread: appliance 0x7f1b5078b3a8 app_buffer: 0x7f1b50659000 databuf: 0x7f1b50559000 buflen: 1048576 (posted to 192.168.12.8/)
appliance 0x7f1b5078b3a8 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50c94e78 app_buffer: 0x7f1b50549000 databuf: 0x7f1b50449000 buflen: 1048576 (posted to 192.168.12.11/)
appliance 0x7f1b50c94e78 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d059a0 app_buffer: 0x7f1b50439000 databuf: 0x7f1b50339000 buflen: 1048576 (posted to 192.168.12.9/)
appliance 0x7f1b50d059a0 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d02948 app_buffer: 0x7f1b50a7a000 databuf: 0x7f1b5097a000 buflen: 1048576 (posted to 192.168.12.10/)
appliance 0x7f1b50d02948 total creads 1 new creads 1 read seqno 0 pending reads 0
kcfis wait: buf: 0x7f1b50549000 app_state: 0x7f1b50ca67b0 err: (0) Success
kcfis wait: buf: 0x7f1b50659000 app_state: 0x7f1b50ca67b0 err: (0) Success
WAIT #139755290955216: nam='cell smart table scan' ela= 59 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707578411
kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160

First memory areas are initialised (lines 1-24), then we see lines showing ‘Calling oss_cread’. The oss_cread call is the call to the appliances to start sending a resultset back. Please mind that despite the calls addressing specific extents, this is a smartscan, so resultsets are send back instead of Oracle blocks. Also, since this is exadata using the iDB/RDS protocol over infiniband, the appliances can use RDMA to send the results back, which means the cells can fill the memory in the server process’ memory directly.

After oss_cread being called, the ‘kcfis_create_maps_payload’ routine (shown in the snippet above the last snippet) being executed to every appliance apparently to indicate all the disks being enabled, possibly trying to throttle activity, and/or to indicate requests will be called from this session. This seems to be repeated for every roundtrip during the entire smartscan for all the appliances that are still needed.

Whenever a result (result set) is ready to be processed, the following sequence happens:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

This shows the reap of result set returned by an appliance. Line 5-6 show appliance specific information. Line 7 is showing important information; ‘nelem’ shows the amount of extents (called ‘elements’ in this context) for which the result or results are returned. Of course ‘nelem’ means ‘number of elements’. In this case it’s 1 (resultset from a specific element/extent). Line 15 shows the actual extent from which the result set came back, because the reqid is exposed, the reqid was defined during the translation phase. This snippet ends with a WAIT line (again: all the waits are ‘cell smart table scan’). I consider this a cyclic process: first the ‘kcfis_create_maps_payload’, then calling oss_cread for one or multiple cells, then a wait, or the above processing of results for one or multiple appliances, and a wait.

I’ve created this snippet to be as simple as possible, in real life result sets of multiple appliances could be reaped (in my case I had to remove a second result). The processing of the result set is done in a few stages, so a resultset is not processed per appliance, but the processing stages are done for all the result sets of all the appliances.

Also this example shows only one request in the reaped result. There can be multiple requests (reqid’s/extents) returned.

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50439000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50d059a0 read seqno 1 pending reads 1
appliance 0x7f1b50d059a0 total creads 0 re-adjusted: read seqno 1 pending reads 1
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50439000, databuf: 0x7f1b50339000, nelem: 2
Dump of memory from 0x00007F1B5033B000 to 0x00007F1B5033B040
7F1B5033B000 0000A23C 02748C02 5BB31713 0402099A  [<.....t....[....]
7F1B5033B010 00988672 001E0001 0000007E 0000007E  [r.......~...~...]
7F1B5033B020 0000007E 00000000 00000000 00000000  [~...............]
7F1B5033B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50339080
kcfis_validate_translation: request 0x7f1b50d0b980
Req completed is : err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1 block_id=41192450
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b5033b000 afn 9 rdba 41192450 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 41192450 good: 1
Request 0x7f1b50d0b980 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b503390c8
kcfis_validate_translation: request 0x7f1b50d114e0
Req completed is : err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1 block_id=53758466
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b50342e98 afn 12 rdba 53758466 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 53758466 good: 1
Request 0x7f1b50d114e0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1
memptr (nil) len 0, blockid = 41192450
kcfis_get_next_data: elem = 1, err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1
memptr 0x7f1b5033b000 len 32408, blockid = 53758466
kcfis_get_next_data: dptr: 0x7f1b5033b000 len: 64816 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50439000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50439000
WAIT #139755290955216: nam='cell smart table scan' ela= 10 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707616790

(example with ‘nelem’=2)

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b5078b3a8 read seqno 1 pending reads 0
appliance 0x7f1b5078b3a8 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 1
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03522293 5BB316ED 0402099A  [<...."R....[....]
7F1B5055B010 009849CB 001E0001 0000000D 0000000D  [.I..............]
7F1B5055B020 0000000D 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d0a000
Req completed is : err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1 block_id=55714451
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55714451 dlen: 3480 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 55714451 good: 1
Request 0x7f1b50d0a000 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1
memptr (nil) len 0, blockid = 55714451
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 3480 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

(example of result sets returned of two appliances, for which the returned results are processed)

Also, when the extent to be scanned is larger, the result set is processed in multiple steps:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 2 pending reads 1
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 2 pending reads 1
throttle: received: maps_to_push 0 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 3
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03523080 5BB3178D 0402099A  [<....0R....[....]
7F1B5055B010 0098FBC9 001E0001 00000080 00000080  [................]
7F1B5055B020 00000080 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d145c0
Req completed is : err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1 block_id=55718016
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_validate_translation: REQ3: splitting from bottom
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55718016 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 55718016 good: 1
Request 0x7f1b50d145c0 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b505590c8
kcfis_validate_translation: request 0x7f1b50d13d40
Req completed is : err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1 block_id=41195392
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_oss_block_verify: bp: 0x7f1b50563098 afn 9 rdba 41195392 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 41195392 good: 1
Request 0x7f1b50d13d40 done
kcfis_process_completed_buffer: 2 elem: 0x7f1b50559110
kcfis_validate_translation: request 0x7f1b50d13f60
Req completed is : err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1 block_id=59912064
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5056b130 afn 14 rdba 59912064 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 59912064 good: 1
Request 0x7f1b50d13f60 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1
memptr (nil) len 0, blockid = 55718016
kcfis_get_next_data: elem = 1, err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1
memptr 0x7f1b5055b000 len 32920, blockid = 41195392
kcfis_get_next_data: elem = 2, err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1
memptr 0x7f1b5055b000 len 65840, blockid = 59912064
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 98760 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707638819

(example with nelem=3, with bigger extents, for which the result is splitted)

Once the result is reaped for a certain appliance, a new oss_cread call must be done in order for another result to be pushed to the server process. Request (reqid or extents) are not sent back in order.

Once the requests are exhausted for a certain appliance, the appliance is excluded from the ‘kcfis_create_maps_payload’ procedure.

Once all smartscan is complete, the memory areas are cleaned up, and the sessions are closed. Apparently a session or some session state information is kept per grid disk, which are all closed (closing of one appliance is shown in this snippet):

kcfis_read DONE - ret NULL
kcfis_finalize: app_state 0x7f1b50ca67b0 permflags 0x9000 clnflags 0x1f
Appliance 0x7f1b5078b3a8 in-active. Active count 3
Appliance 0x7f1b50c94e78 in-active. Active count 2
Appliance 0x7f1b50d059a0 in-active. Active count 1
Appliance 0x7f1b50d02948 in-active. Active count 0
Caching: session at the end of scan 1
Work area gc: 0x7f1b50ca6768, app_state: 0x7f1b50ca67b0
Freeing read buffer chunk (GC) 0x7f1b50448eb0. count 4
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
Freeing read buffer chunk (GC) 0x7f1b50338eb0. count 3
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
Freeing read buffer chunk (GC) 0x7f1b50558eb0. count 2
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
Freeing read buffer chunk (GC) 0x7f1b50979eb0. count 1
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 0
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 32768
Freeing memory to top level heap: num_freed 4
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 0
Work area cleanup start: global_ctx: 0x7f1b50ca6d20
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 app_state: 0x7f1b50ca67b0 refcnt: 0
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 refcnt: 0
Closed work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Deallocating kcfis wds: 0x7f1b50ca6768
Cleanup work area for app_state: 0x7f1b50ca67b0
kcfis_finalize_cached_sessions: global_ctx 0x7f1b50ca6d20
Caching: in kcfis_finalize_cached_sessions global ctx 0x7f1b50ca6d20 total cached 1 cached in ctx 1
Deallocating session. app state 0x7f1b50ca67b0 num cached sessions 0
In kcfis_deallocate_session: app_state: 0x7f1b50ca67b0
KCFIS: [NSMTIO]:SQL for this (non)Smart I/O session is: 
select count(*) from t
kcfis_cache_appliance_fd: Cached appliance fd 0x7f1b50da2b88  for appliance 0x7f1b5078b3a8
Appliance 0x7f1b5078b3a8 was not active. Active count 0
kcfis_deallocate_session: Freeing disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507908e0 name = o/192.168.12.8/DATA_CD_01_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507903f0 name = o/192.168.12.8/DATA_CD_05_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50d04fc0 name = o/192.168.12.8/DATA_CD_10_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfe960 name = o/192.168.12.8/DATA_CD_03_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfdf80 name = o/192.168.12.8/DATA_CD_09_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfc930 name = o/192.168.12.8/DATA_CD_00_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc8118 name = o/192.168.12.8/DATA_CD_11_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc7ae0 name = o/192.168.12.8/DATA_CD_08_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc5ab0 name = o/192.168.12.8/DATA_CD_02_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc3a80 name = o/192.168.12.8/DATA_CD_04_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc30a0 name = o/192.168.12.8/DATA_CD_07_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis rcv update : op: 6 val: 8192 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 24576

And the appliances are notified the smartscan has ended:

kcfis_close_all_appliance_fds. Cache count 4
kcfis_close_appliance_fd: appliance_fd 0x7f1b50da2b88
WAIT #139755290955216: nam='cell smart table scan' ela= 241 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707978094
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5db00
WAIT #139755290955216: nam='cell smart table scan' ela= 237 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707978356
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5df58
WAIT #139755290955216: nam='cell smart table scan' ela= 280 cellhash#=674246789 p2=0 p3=0 obj#=61458 tim=1593707978668
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5e3b0
WAIT #139755290955216: nam='cell smart table scan' ela= 242 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707978935

Summary.
This blogpost tries to summarise (..) the different steps in an Exadata smartscan. One of the most important things which this shows is that anything which is done, is covered by a single wait event (‘cell smart table scan’). In other words: profiling this wait event tells you little to nothing about what is actually happening, except that a smartscan is being processed. In other words: if you get high ‘cell smart table scan’ waits, either for all the waits or for a few, the only way to pinpoint what it is the wait is showing waiting time is executing this trace again, and see which step it is. Of course you can pinpoint if the waiting is happening for a specific cell/appliance by looking at cellhash# in the wait line.

Disclaimer:
Please mind I tried to use my knowledge on Oracle and Exadata processing together with the information the trace provided to build this description. If you encounter anything which is incorrect, please comment on this post, and I try to get it fixed. No bits where harmed during testing.

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Recently we upgraded an Exadata to the currently latest version, 11.2.3.2.0. The Exadata software itself consists of an image for the storage servers (the storage servers are essentially re-imaged), and a set of updates for the database/computing nodes, including: firmware for ILOM (lights out adapter), BIOS, LSI RAID adapter, Infiniband adapter, linux kernel, drivers, mandatory packages, to name some.

One of the exceptional things this upgrade does, is remove the hot-spare out of the RAID set on the database/compute nodes. This is documented in MOS note: 1468877.1, as ‘known issue 5: hotspare removed for compute nodes’. For some sites, this actually can be a good thing, if they are really tight on disk space on the compute nodes of Exadata. I must say that we have not encountered this situation. What this means, is that the actual HDD configuration on the compute node is left to the customer, instead of having one mandatory configuration (having 3 disks in a RAID-5 configuration, and one hot-spare).

So if you decide to use the former hot-spare disk as an active part of the RAID configuration, you are effectively trading availability for diskspace. Please mind the RAID set itself already provides redundancy, even without the hot-spare!

On the other hand, I think in most configurations, it makes sense to convert the disk back to being hot-spare.

This is done in the following way:

a) Get an overview of the current disk configuration:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B70

This shows the disk in slot number 3 being left “unconfigured”, but in “good” state (of a disk has gone bad because of errors, it will be removed from the RAID set, and will show up as “unconfigured(bad)”!)
This is the state the upgrade to 11.2.3.2.0 leaves your system.

New let’s make the disk hot spare again!

b) Get the inclosure id:
/opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252

This means we know the enclosure id (252) and the slot number (3), which is the information needed for the MegaCli utility to revert the unconfigured disk to hot-spare again!

c) Revert the unconfigured disk back to hot-spare
/opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[252:3] -a0
Adapter: 0: Set Physical Drive at EnclId-252 SlotId-3 as Hot Spare Success.
Exit Code: 0x00

d) Check the disk configuration again:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Hotspare, Spun Up
Device Firmware Level: 0B70

It appears that having the disk been removed from the RAID set by the update to 11.2.3.2.0 generates an ASR message. At least it did at our site, despite this being an undocumented bug (7161023,’ASR generating false errors in relation to disks’) which is marked resolved in 11.2.3.1.0 (?). Most sites I encounter have ASR setup, but not having all messages send additionally to local, onsite monitoring. I want to stress it’s very important to have the ASR messages sent to your own monitoring too!

Oracle Support does not list all the specifications from an ASR message it has gotten. Instead, a Service Request is made with enough information for Oracle itself (!!). In our case, the exact error message was NOT specified, only ‘compute server hard disk predictive failure’ and the node name.

Where do you look on an Exadata for that information? The first logical point is the ASR daemon. I didn’t spend too much time on it, but it seems that it’s more a proxy for messages than a database. I wasn’t able to find useful information about the systems which where using this daemon.

What are the sources for ASR with an Exadata? These are:

Computing node:
– “compmon daemon” / Linux level monitoring
– ILOM
Storage node:
– “cell daemon” / Linux level monitoring
– ILOM

For the computing node, it’s quite easy to see if there are any detected failed devices from the viewpoint of the ILOM:
(please mind ipmitool -I open only works on the local system)
# ipmitool -I open sunoem cli "show /SP/logs/event/list Severity==(Major,Critical,Down)"
Connected. Use ^D to exit.
-> show /SP/logs/event/list Severity==(Major,Critical,Down)

ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------

-> Session closed
Disconnected

This shows no messages with the severity Major, Critical or Down are in the eventlog in the ILOM. Please mind that the logons to the ILOM have severity “Minor”. These are in most system the vast majority of the messages, which are not of interest for this investigation. If you want to know if something has failed, there even a simpler command:
# ipmitool -I open sunoem cli "show faulty"

For the “compmon daemon”, grep the processlist for “compmon”:
# ps -ef | grep compmon
root 12812 1 0 Oct22 ? 00:00:11 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server

The most important part here is the directory: /opt/oracle/cellos/compmon
If you navigate to that directory, you will see a number of “state files”: asrs.state, traps.state and disks.state.
The disks.state lists the disk status as listed with a) with the firmware state.
The most important file for the ASR message investigation is the traps.state file. This file lists traps it has sent to ASR. In our case:
1 ; Mon Oct 22 14:39:10 2012 ; 86425886-b359-4587-8d46-f31ff2ecb135 ; Physicaldisk : Make Model: is at status predictive failure. Raised fault id: HALRT-02008 ; Physical disk should be replaced. Exadata Compute Server: Disk Serial Number:
Yes, this is pasted correctly, it misses Physicaldisk, Make Model and Disk Serial Number information. This has not been omitted for safety, it just is not listed.
So, the failure which was sent was HALRT-02008 in our case.

For completeness, the ILOM layer can be investigated identically to the description of the ILOM handling on the computing layer. The Linux layer messages can be investigated with: # cellcli -e list alerthistory
32 2012-10-17T02:00:27+02:00 info "HDD disk controller battery on disk contoller at adapter 0 is going into a learn cycle. This is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours. The disk controller cache might go into WriteThrough caching mode during the learn cycle. Disk write throughput might be temporarily lower during this time. The message is informational only, no action is required."
33 2012-10-22T11:43:21+02:00 info "Factory defaults restored for Adapter 0"
34 2012-10-22T11:43:23+02:00 info "Factory defaults restored for Adapter 0"

Follow

Get every new post delivered to your Inbox.

Join 2,134 other followers

%d bloggers like this: