How long does a logical IO take?
This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.
Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.
My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.
Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.
With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.
#!/usr/local/bin/stap global latency probe begin { printf("Begin.\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") { latency[pid()] = gettimeofday_us() printf(">kcbgtcr\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return { printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()]) }
Now make the script executable, and run it against a database session:
# ./lio.stap -x 3877
Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)
... >kcbgtcr <kcbgtcr, latency(us): 2 >kcbgtcr <kcbgtcr, latency(us): 79542 >kcbgtcr <kcbgtcr, latency(us): 4 >kcbgtcr <kcbgtcr, latency(us): 2 >kcbgtcr <kcbgtcr, latency(us): 2 >kcbgtcr <kcbgtcr, latency(us): 13 >kcbgtcr <kcbgtcr, latency(us): 1 ...
This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.
In order to get a better overview, we can add a histogram of the kcbgtcr latencies:
#!/usr/local/bin/stap global latency, latency_histogram probe begin { printf("Begin.\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") { latency[pid()] = gettimeofday_us() printf(">kcbgtcr\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return { latency[pid()] = gettimeofday_us() - latency[pid()] latency_histogram <<< latency[pid()] printf("<kcbgtcr, latency(us): %d\n", latency[pid()]) } probe end { if ( @count(latency_histogram) > 0 ) { printf("\n==kcbgtcr latency==\n") println(@hist_log(latency_histogram)) } }
Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:
... ==kcbgtcr latency== value |-------------------------------------------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@ 5998 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14284 4 | 268 8 | 58 16 | 137 32 | 25 64 | 1 128 | 15 256 | 69 512 | 1 1024 | 0 2048 | 0 4096 | 1 8192 | 5 16384 | 35 32768 | 39 65536 | 8 131072 | 0 262144 | 1
Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!
What could cause these huge changes in logical IO time?
At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:
#!/usr/local/bin/stap # global latency, latency_histogram probe begin { printf("Begin.\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*") { printf("%s > %s\n", thread_indent(1), probefunc()) if ( probefunc() == "kcbgtcr" ) latency[pid()] = gettimeofday_us() } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*").return { printf("%s < %s", thread_indent(-1), ppfunc()) if ( ppfunc() == "kcbgtcr" ) { latency[pid()] = gettimeofday_us() - latency[pid()] latency_histogram <<< latency[pid()] printf(" -- latency(us): %d", latency[pid()]) } printf("\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*") { printf("%s > %s\n", thread_indent(1), probefunc()) } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*").return { printf("%s < %s\n", thread_indent(-1), ppfunc()) } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*") { printf("%s > %s\n", thread_indent(1), probefunc()) } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*").return { printf("%s < %s\n", thread_indent(-1), ppfunc()) } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*") { printf("%s > %s\n", thread_indent(1), probefunc()) } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*").return { printf("%s < %s\n", thread_indent(-1), ppfunc()) } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiosq0") { printf("=PARSE\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiexe") { printf("=EXECUTE\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opifch2") { printf("=FETCH\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiclo") { printf("=CLOSE\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") { printf("=KSLWTBCTX\n") } probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") { printf("=KSLWTECTX\n") } probe process("/lib64/libaio.so.1").function("io_submit") { printf("=IO_SUBMIT\n") } probe process("/lib64/libaio.so.1").function("io_getevents_0_4") { printf("=IO_GETEVENTS\n") } probe process("/lib64/libc.so.6").function("pread64") { printf("=PREAD\n") }
Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.
After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:
166149 oracle_5487_tes(5487): < kdst_fetch 166155 oracle_5487_tes(5487): > qeaeCn1SerialRowsets 166159 oracle_5487_tes(5487): < qeaeCn1SerialRowsets 166162 oracle_5487_tes(5487): > kdst_fetch 166164 oracle_5487_tes(5487): > kdst_fetch0 166167 oracle_5487_tes(5487): > kcbipnns 166170 oracle_5487_tes(5487): < kcbipnns 166173 oracle_5487_tes(5487): > kcbrls 166177 oracle_5487_tes(5487): > kcbrls_direct 166181 oracle_5487_tes(5487): < kcbrls_direct 166183 oracle_5487_tes(5487): < kcbrls 166185 oracle_5487_tes(5487): > kdstsnb 166188 oracle_5487_tes(5487): < kdstsnb 166191 oracle_5487_tes(5487): > ktrget2 166194 oracle_5487_tes(5487): > ktsmg_max_query 166197 oracle_5487_tes(5487): < ktsmg_max_query 166200 oracle_5487_tes(5487): > kcbgtcr 166204 oracle_5487_tes(5487): > kcbldrget 166209 oracle_5487_tes(5487): > kcblgt 166211 oracle_5487_tes(5487): < kcblgt 166213 oracle_5487_tes(5487): > kcbzvb 166216 oracle_5487_tes(5487): > kcbhvbo 166219 oracle_5487_tes(5487): > kcbhxoro 166222 oracle_5487_tes(5487): < kcbhxoro 166224 oracle_5487_tes(5487): < kcbhvbo 166226 oracle_5487_tes(5487): < kcbzvb 166228 oracle_5487_tes(5487): > kcbztek_trace_blk 166230 oracle_5487_tes(5487): < kcbztek_trace_blk 166233 oracle_5487_tes(5487): > kcbl_objdchk_with_cache_reread 166236 oracle_5487_tes(5487): > kcbtgobj 166239 oracle_5487_tes(5487): > kd4obj 166251 oracle_5487_tes(5487): < kd4obj 166254 oracle_5487_tes(5487): < kcbtgobj 166257 oracle_5487_tes(5487): < kcbl_objdchk_with_cache_reread 166260 oracle_5487_tes(5487): < kcbldrget 166262 oracle_5487_tes(5487): < kcbgtcr -- latency(us): 62 166265 oracle_5487_tes(5487): > ktrgcm 166268 oracle_5487_tes(5487): > kcbcge 166271 oracle_5487_tes(5487): < kcbcge 166273 oracle_5487_tes(5487): > ktcckv 166275 oracle_5487_tes(5487): < ktcckv 166285 oracle_5487_tes(5487): < ktrgcm 166286 oracle_5487_tes(5487): < ktrget2 166289 oracle_5487_tes(5487): > kdr9ir2blk 166291 oracle_5487_tes(5487): < kdr9ir2blk 166293 oracle_5487_tes(5487): < kdst_fetch0 166294 oracle_5487_tes(5487): < kdst_fetch 166297 oracle_5487_tes(5487): > qeaeCn1SerialRowsets 166300 oracle_5487_tes(5487): < qeaeCn1SerialRowsets
How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.
The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.
After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.
A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.
The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.
The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.
Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.
The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.
Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?
After a little searching, I was able to find a very good reason:
169252 oracle_5487_tes(5487): > kdst_fetch 169254 oracle_5487_tes(5487): > kdst_fetch0 169256 oracle_5487_tes(5487): > kcbipnns 169258 oracle_5487_tes(5487): < kcbipnns 169259 oracle_5487_tes(5487): > kcbrls 169262 oracle_5487_tes(5487): > kcbrls_direct 169264 oracle_5487_tes(5487): < kcbrls_direct 169265 oracle_5487_tes(5487): < kcbrls 169267 oracle_5487_tes(5487): > kdstsnb 169269 oracle_5487_tes(5487): < kdstsnb 169270 oracle_5487_tes(5487): > ktrget2 169272 oracle_5487_tes(5487): > ktsmg_max_query 169274 oracle_5487_tes(5487): < ktsmg_max_query 169275 oracle_5487_tes(5487): > kcbgtcr 169278 oracle_5487_tes(5487): > kcbldrget 169280 oracle_5487_tes(5487): > kcblgt 169283 oracle_5487_tes(5487): > kcblrs 169286 oracle_5487_tes(5487): > kdsdrcbk 169288 oracle_5487_tes(5487): < kdsdrcbk 169291 oracle_5487_tes(5487): > kcbldio 169296 oracle_5487_tes(5487): > kcfaioe 169299 oracle_5487_tes(5487): < kcfaioe 169301 oracle_5487_tes(5487): > kcflbi 169314 oracle_5487_tes(5487): > kcf_hard_ftype_check 169317 oracle_5487_tes(5487): < kcf_hard_ftype_check =IO_SUBMIT 169416 oracle_5487_tes(5487): < kcflbi 169420 oracle_5487_tes(5487): < kcbldio 169425 oracle_5487_tes(5487): > kcblcffln 169429 oracle_5487_tes(5487): < kcblcffln 169432 oracle_5487_tes(5487): < kcblrs 169434 oracle_5487_tes(5487): > kcblsinc 169438 oracle_5487_tes(5487): < kcblsinc 169440 oracle_5487_tes(5487): > kcblcio 169443 oracle_5487_tes(5487): > kcblci 169447 oracle_5487_tes(5487): > kcflci =IO_GETEVENTS =IO_GETEVENTS 169481 oracle_5487_tes(5487): > kcflwi =IO_GETEVENTS =IO_GETEVENTS =KSLWTBCTX =IO_GETEVENTS =IO_GETEVENTS =KSLWTECTX 179477 oracle_5487_tes(5487): < kcflwi 179484 oracle_5487_tes(5487): < kcflci 179488 oracle_5487_tes(5487): < kcblci 179491 oracle_5487_tes(5487): < kcblcio 179494 oracle_5487_tes(5487): < kcblgt 179497 oracle_5487_tes(5487): > kcbzvb 179509 oracle_5487_tes(5487): > kcbhvbo 179513 oracle_5487_tes(5487): > kcbhxoro 179516 oracle_5487_tes(5487): < kcbhxoro 179518 oracle_5487_tes(5487): < kcbhvbo 179520 oracle_5487_tes(5487): < kcbzvb 179539 oracle_5487_tes(5487): > kcbztek_trace_blk 179544 oracle_5487_tes(5487): < kcbztek_trace_blk 179549 oracle_5487_tes(5487): > kcbl_objdchk_with_cache_reread 179555 oracle_5487_tes(5487): > kcbtgobj 179559 oracle_5487_tes(5487): > kd4obj 179562 oracle_5487_tes(5487): < kd4obj 179563 oracle_5487_tes(5487): < kcbtgobj 179565 oracle_5487_tes(5487): < kcbl_objdchk_with_cache_reread 179569 oracle_5487_tes(5487): < kcbldrget 179571 oracle_5487_tes(5487): < kcbgtcr -- latency(us): 10295 179576 oracle_5487_tes(5487): > ktrgcm 179580 oracle_5487_tes(5487): > kcbcge 179582 oracle_5487_tes(5487): < kcbcge 179585 oracle_5487_tes(5487): > ktcckv 179587 oracle_5487_tes(5487): < ktcckv 179589 oracle_5487_tes(5487): < ktrgcm 179591 oracle_5487_tes(5487): < ktrget2 179593 oracle_5487_tes(5487): > kdr9ir2blk 179606 oracle_5487_tes(5487): < kdr9ir2blk 179609 oracle_5487_tes(5487): < kdst_fetch0 179611 oracle_5487_tes(5487): < kdst_fetch 179616 oracle_5487_tes(5487): > qeaeCn1SerialRowsets 179620 oracle_5487_tes(5487): < qeaeCn1SerialRowsets
If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.
This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.
As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO
Very nice for a “true or false” quiz… Nice work!
Hi Frits,
thanks for yet another great post 🙂
I’ve booted into the 3.10, non-UEK kernel and downloaded systemtap 2.8, but still I get a parse error when I include the return probe:
[root@node5 ~]# uname -a
Linux node5.keynodes.com 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 04:05:24 PST 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@node5 ~]# /usr/local/bin/stap -V
Systemtap translator/driver (version 2.8/0.160, non-git sources)
[root@node5 ~]# ./s1.stp
parse error: expected statement
saw: ./s1.stp EOF
1 parse error.
Pass 1: parse failed. [man error::pass1]
When I comment the return probe it doesn’t look good either (and hangs):
[root@node5 ~]# ./s1.stp
WARNING: probe process(“/u00/app/oracle/product/12.1.0.2/db_1/bin/oracle”).function(“kcbgtcr”) inode-offset 000000000c565e80 registration error (rc 0)
Begin.
Am I missing something?
Actually it’s the first time I’m switching kernels, – I wonder do I have to do special things for perf too, or just download a newer perf?
^C[root@node5 ~]# perf record -p $(pgrep ora_lgwr)
/sbin/perf: line 6: /usr/libexec/perf.3.10.0-229.el7.x86_64: No such file or directory
Many thanks in advance!
Sigrid
Thank you for reading Sigrid!
1. Inode offset error: I do get this error too with the latest red hat compatible kernel. In order to work around this, I enabled an earlier kernel.
2. Download the latest system tap source using git and compile it yourself. I did fight with system tap oddities until I decided I had enough and just gotten the very latest version.
From what I’ve seen from the stock red hat kernel with OL7, it feels way more experimental than the UEK kernels.
Thanks Frits! With the latest version from github (2.9) it compiles fine, but what exact kernel version are you running? I’ve tested several 3.10.0-xxx ones and they all hang with inode offset error…
BTW just saw this about the 3.10 kernel
11. Tracing/perf
perf
Add new “perf mem” command for memory access profiling (commit 1, 2, 3, 4, 5)
… doesn’t look bad either 😉
I don’t have access to my test system at the moment. I think it was the xxx.227 kernel.
Yes, I know the memory access profile functions are in kernel 3.10. I have tested a 32 NUMA nodes system a little while ago, and was in dire need of that functionality. However, that was on oracle Linux 6, with the UEK3 kernel.
just tried 3.10.0-123, it works, thank you!
HI Frits, that topic gets exceedingly important since many of our customers can afford big buffer caches. Thanks for researching it.. Lothar
Pingback: Peeking into Oracle using Linux tracing tools – five tools in five minutes | recurrent null
Hi,
My systemtap script return following error message:
[root@ocp st]# ./st4.stp -x 4620
WARNING: cannot find module /u01/app/oracle/product/11.2.0/bin/oracle debuginfo: No DWARF information found
semantic error: no match while resolving probe point process(“/u01/app/oracle/product/11.2.0/bin/oracle”).function(“kcbgtcr”)
WARNING: cannot find module /u01/app/oracle/product/11.2.0/bin/oracle debuginfo: No DWARF information found
semantic error: no match while resolving probe point process(“/u01/app/oracle/product/11.2.0/bin/oracle”).function(“kcbgtcr”).return
My Linux is red hat 5 :
[root@ocp st]# uname -a
Linux ocp 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
And systemtap version is :
[root@ocp st]# /usr/bin/stap -v
A script must be specified.
SystemTap translator/driver (version 0.7.2/0.137 non-git sources)
Copyright (C) 2005-2008 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
Hi Haibo.
There are two things here: you need recent kernel to allow userspace probing. Recent here means 3.x I think.
For return probes (which I use to measure the time spend in the kcbgtcr function), you need at least kernel version 3.10.
Luca Canali has more information on the specifics of the kernel versions, but when you use OL7, install 3.10 version 123. (that is not the UEK version!)
Also, you need a version of systemtap that has the ability to do userspace probing. I don’t know about versions, my advise is to pull the latest version from the systemtap git repository, and compile it.