Observing Oracle Exadata SmartScan internals, part 1

In order to look how Exadata smartscans are different, first let’s have a peek the Oracle full segment/multiblock read evolution as short as possible:

a) Traditional multiblock reads, visible via the event ‘db file scattered read’
The essence is: Multiple adjacent blocks are read from disk, and put in the buffercache. Because every read is sequentially processed, IO latency is a performance penalty for every physical read. This works roughly this way: get a set of adjacent blocks from the segment header, fetch these blocks from disk, process these blocks, then get the next set of adjacent blocks, fetch these blocks from disk, process these blocks, etc.

b) Direct path multiblock reads, visible via the event ‘direct path read’
The essence is: Multiple IOs are done asynchronously, one or more IOs are reaped and processed, after which the number of IOs is brought back to the number of IOs the process want to keep in flight. Blocks are read to the process’ PGA (which means the IO result is not shared with other processes). Because of the asynchronous way of issuing multiple requests, the process does not suffer from the IO latency penalty of every single IO. This works roughly this way: get a set of adjacent blocks from the segment header, issue an asynchronous IO request for these, get a next set of adjacent blocks from the segment header, issue another asynchronous IO request, process one or more of the IO requests which are ready, issue IO requests for the number requests reaped, process one or more of the IO requests ready, etc. During processing, Oracle measures CPU and IO times, and can decide to add one or more concurrent requests to the two IO’s which it tries to keep in flight.

Back to smartscans

It’s not very hard to understand that direct path multiblock reads can perform much better than traditional multiblock reads. Probably at this time you think: yes, I know, but what does this have to do with Exadata? This is all information about regular processing! Well, Exadata uses the regular Oracle database executable. This means that part of the codepath of smartscans is shared with the normal/non-Exadata Oracle database. Obviously, there is a part that is unique to Exadata.

This is best viewed with a backtrace of the call to submit an IO request. This a full backtrace of the submit of an IO request of a full table scan on Linux to a database on ASM:

io_submit
skgfqio
ksfd_skgfqio
ksfdgo
ksfdaio
kfk_ufs_async_io
kfk_submit_ufs_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

Of course the functions used internally in the executable are not documented. But it’s very useful to look at them to gain a better understanding of what is happening. First look at the function at line line 24, kcbgtcr (Kernel Cache Buffers GeT Consistent Read). This is the function to perform a logical IO. One line up on line number 23 is the function kcbldrget (Kernel Cache Buffers direct path LoaDeR GET). This function indicates that the execution did choose the direct path read code path. In fact, the kcbl prefixed functions are believed to belong to Oracle direct path read codepath. Then roughly the ksfd, kfio, kfk, ksfd, and lastly skgfqio is executed, which performs the actual submit of an IO using io_submit().

Now let’s look how the equivalent submit of an IO request looks like on Exadata with smartscan turned off:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_issue_read
ossdisk_read
oss_read
kfk_submit_one_oss_io
kfk_submit_oss_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

First locate the kcbgtcr function, which is on line 31 (forget about trying to find logic in the line numbers; backtraces are read from bottom to top, whilst the numbering logic is from top to bottom). One line up on number 30 is the function kcbldrget again. Okay, that looks the same. If we read the backtrace up, it’s easy to spot the same layers, in fact the same functions: ksfd, kfio up to the kfk layer.

In the kfk layer there is a slight difference, which is understandable: on line 7 of the non-Exadata backtrace we see the function kfk_submit_ufs_io, while on Exadata the same function is kfk_submit_oss_io on line 14. I think this deserves a little Explanation. Exadata is the marketing name of the database machine, which internally was called ‘Sage’ in Oracle, and this name still surfaces sometimes, like in Exadata naming, or in patch descriptions. Quite probably OSS means ‘Oracle Sage Software’. So, this means that the process is aware it needs to read something from an Exadata storage server, and chooses a function that is meant to set that up.

The next two functions (kfk_ufs_async_io on line 6 and kfk_submit_one_oss_io on line 13) probably do logically the same, but are different because the infrastructure is different.

The next function up in both backtraces is where it gets really interesting, because now the code has to do something entirely different: on the non-Exadata system the ksfd layer is entered again, in order to get to the function skgfqio, which submit’s the IO request using the io_submit call. On the Exadata system, we see a call which I have not encountered outside of Exadata: oss_read. With the knowledge gained above, we can tell this quite probably is an Exadata specific call, which is inside an entire layer: oss. From the function names we can guess it prepares the IO request, and then issues it. Once the oss layer is crossed, we enter another layer: skgxp. The skgxp (System Kernel Generic inter-process Communication (Xfer?) Protocol) layer is NOT unique to Exadata, it’s the communication layer which is used by an Oracle RAC database for inter-process communication. So it seems like Oracle re-used the knowledge gained with RAC inter-process communication for the communication with the (Exadata) storage server.

Can we now please get to smartscans?

Yes, we didn’t encounter a smartscan yet, I did show a backtrace of a regular direct path read on Exadata. The reason is to show the difference between a traditional system (meaning disks which are presented to the system as local disks) and an Exadata system, which has to use infiniband and has to fetch the information it needs from Exadata storage servers. This is how a smartscan read call backtrace looks like:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_cread
oss_cread
kcfis_read
kcbl_predpush_get
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010010kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

The first thing which did struck me is the number of functions did decrease. On the other hand, this does not say much (you can make functions as long or as short as you wish). Also, a smartscan is done using a number of steps, of which some are already been done, which are not visible from this backtrace. After the submit of a smartscan there is another number of steps; of course reaping the submitted scan requests, but also verification of the reaped request.

It’s now interesting to see how much different the codepath looks like. The kcbgtcr function is still present, at line 15. One line up there’s the kcbldrget function, which reveals the process chose the direct read path codepath during execution. If we go up one line we see a function in the kcbl layer, which is Exadata specific as far as I know: kcbl_predpush_get. Again: all has been setup for doing a smartscan prior to the point where the process enters the point of this backtrace: submitting a request to an Exadata storage server. This means the process has the information needed (what information to ask from what storage server) prior to arriving at this point.

One layer up is a call to the kcfis (Kernel Cache File Intelligent Storage is my guess) layer, then approximately the same calls in the oss layer, but there’s a difference: it’s oss_cread and ossdisk_cread instead of oss_read and ossdisk_read, which make the call an Exadata request, instead of a request for database blocks. After the oss layer, there’s the skgxp layer again, which are exactly the same calls for both the non-smartscan and smartscan.

Conclusion

The purpose of this blogpost is to show the differences between submitting an IO request on a traditional system, on an Exadata system with smartscans turned off, and with smartscans turned on.

There is a lot of ground to cover on this. It’s not doable to cover this all in one blogpost. That’s the reason I try to pick a specific part, and work from that. Any comments are welcome.

From the backtraces it’s quite good visible regular IO’s are processed and done the same way on non-Exadata and Exadata, on Exadata the request is submitted via the skgxp layer to a storage server, instead of an IO request done via io_submit. This makes it very probable that single block IO’s are done the same way too, which means no ‘magic’ performance enhancement is possible, because it’s processed the same way on Exadata as on non-Exadata, the only difference is the IO request is done differently.

If we look at the comparison between smartscan and non-smartscan requests, it becomes prevalent there is something different happening. But the basic processing is the same, with which I mean the process is doing exactly the same as non-Exadata processes. From the smartscan backtrace it becomes visible that a process has to travel through the direct path code layer (kcbl), in order to get a smartscan, because otherwise it would be impossible to issue kcbl_predpush_get, which is the call for issuing a smartscan.

Disclaimer: I am not an Oracle employee. This information has all been obtained by profiling execution (using perf/nm/gdb). I have never seen any Oracle database and Exadata source code. There is a chance some assumptions are wrong.

About these ads
5 comments
  1. Nice post, Frits… people would do well to study the software on their system deeply like you do. One can learn a lot for simple debugging and tracing (especially strace/truss). Now, about the post…

    If others examine their non-Exadata backtrace under the same conditions they might see different output. Yours appears to be non-Exadata but non-ASMLib as well?

    For what it’s worth I *can* tell you what OSS stands for but only god knows that Oracle’s lawyers consider disclosure of privileged information. So… I won’t say :-)

  2. Thank you Kevin!

    Indeed, the non-Exadata is a VM on my laptop, OL6u3, X86_64, Oracle 11.2.0.3 (no patches), ASM and udev for setting up the block devices.

  3. dba said:

    Hi Frits and Kevin

    You rocks guys!

    And Frits, regarding this post, it’s striking. I never had read something similar.

    Even that we can not have “privileged information” :(

    it sounds to me a great approach ;)

    I have no idea until now about oracle trace “mnemonics” and now I’m felling myself more inspired by your post to keep studying and researching (at Brazil \o/)…

    Thank you

    My best regards

    .:. Teracomp IT Consulting .:.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,056 other followers

%d bloggers like this: