Getting to know Oracle wait events in Linux.

May 4th: some updates after discussion with Jeff Holt, Cary Millsap and Ron Christo of Method-R.

There’s all the documentation, and there all the terrific blogs about performance and Oracle wait events. But if you more or less start with this stuff, or encounter a wait event that is not (extensive enough) documented, or an event turns up and gives a response time you don’t expect, you need to understand what that wait event means. If you don’t know, it’s easy to get stuck at this point.

If you are familiar with waits, and just want to dig further, progress to “Get to know wait events”, if you want to get up speed with waits, read on.

What is a wait event?

This is the definition of the performance tuning guide in the Oracle documentation:

Statistics that are incremented by a server process/thread to indicate that it had to wait for an event to complete before being able to continue processing. Wait events are one of the first places for investigation when performing reactive performance tuning.

This description is quite good I think. If I would have to make a better description, I would say a wait event is essentially the registration of time spend on executing a part of the Oracle database server code, mostly doing something elementary. The granularity of wait events can differ between Oracle versions. Wait events can describe interaction with the operating system (reading or writing are good examples of this), but also interaction between processes (enqueue’s and ipc messaging are good examples of that). Needless to say, both interactions require operating system calls.

Also, a wait event is not necessary a bad thing. Every Oracle database process is either running on CPU or waiting. If your SQL*Plus session is not running, it is waiting for ‘SQL*Net message from client’, which means the database tries to tell you that it waits for the next instruction. Another interesting thing to note about wait events is that the execution of certain instrumented (covered by a wait event) parts if Oracle server code can not report a wait per kernel developer decision. This means, for example, that if doing a latch get read went efficiently, you will not get a ‘latch free’ or ‘latch: latch name‘ wait event, because the wait registration would take exponentially more time than the actual latch get.

How to get the wait events during execution

(skip this section if you are familiar with enabling and disabling sql trace at level 8)

The way to get all the waits of the execution of one or more SQL statements for a given process is to enable SQL trace, alias event 10046, at level 8. This can be done at the system level (meaning all processes) too, but unless you want to reverse engineer the interaction between ALL the process, this is not very useful. There much written about all the ways you can enable SQL trace at level 8. For the sake of simplicity, I use setting event 10046 at level 8 in the session itself. The session requires ‘alter session’ rights to be able to set it. This is how it’s done:

SQL> alter session set events '10046 trace name context forever, level 8';
Session altered.

And can be turned off the following way:

SQL> alter session set events '10046 trace name context off';
Session altered.

Get to know wait events: strace

At this point we know how to enable tracing, but want to dig deeper and understand the operating system interaction inside the waits.

To see what a process (for example your Oracle foreground process) is doing from the perspective of the Operating System, there’s the ‘strace’ executable. The ‘strace’ executable can display all the system calls an operating system process is executing if it has enough privileges to access it. It resembles SQL trace a bit, but at the Operating System/Linux layer. First, you determine the process ID of the session you want to use ‘strace’ on, then use another process to execute ‘strace’. This is what it looks like:

1. Setup a sqlplus session
$ sqlplus ts/ts@v11203

SQL*Plus: Release 11.2.0.3.0 Production on Wed Apr 25 23:03:22 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

TS@v11203 >

2. Attach to the sqlplus process with strace in a session on the linux host where the database is running:
$ strace -p 14128
Process 14128 attached - interrupt to quit
read(14,

3. Now execute something in the sqlplus session, for example ‘select * from dual’:
TS@v11203 > select * from dual;

D
-
X

TS@v11203 >

4. This results in the following list of system calls:

read(14, "\1/\6\21i \376\377\377\377\377\377\377\377\2\2"..., 8208) = 303
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372954
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\1r\6\20\27\2\374T\rD@\255\262t\t\313\242\1\247-8x"..., 370) = 370
read(14, "\25\6\3\5\"\2\17", 8208) = 21
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\254\6\4\1 \1\1{\5\2"..., 172) = 172
read(14, "\1\337\6\3^#)\4\4\376\377\377\377\377\377\377\3774"..., 8208) = 479
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "010\6\v\1\5\314\2\1p\5\207y"..., 304) = 304
read(14,

Wow! Nice! But what does it mean? Well, these are all the system calls used by the Oracle foreground process in order to execute the (simple) SQL. It’s hard to tell something about the execution of the SQL with this list. And be aware, this is only a ‘select * from dual’, so a ‘real’ SQL will probably give a list so long you do not dare to begin studying the list to get something useful out of it.

Well, how about the next chapter:

Get to know wait events: combining waits and strace

How about combining the sql trace output with the system calls? The ‘strace’ executable has quite some options, which can be very helpful:

– list the data written by the process (‘-e write=all’)
This option lists the data written by the session. The trace file is written by the server process, so it will list the contents of that.
– list a timestamp for every system call (‘-tt’)
– list the time spend doing the system call (‘-T’)
– save the output to a file (‘-o /path/to/file.txt’)

These are all helpful to get more information the system calls.

Now setup a session in the database, enable SQL trace at level 8, fetch the process ID of session and run ‘strace -ttT -e write=all -p PID’ in another session and run something. Yes, this generates a massive amount of output. But, as I use it, it’s to investigate what certain waits are doing. Here are some examples:


23:46:20.145793 pread(257, "\6\242\202\3\1\356O\6\2049\245\2\3226\264O\6"..., 1032192, 1996505088) = 1032192
23:46:20.148986 lseek(8, 0, SEEK_CUR) = 46126
23:46:20.149030 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 31 39 38 20 66 69 ad' ela= 3198 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 34 le#=5 bl ock#=664 |
| 00050 33 34 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 34 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 34 38 39 36 31 53903801 48961 |
23:46:20.149121 write(9, "J?e83~1z1\n", 10) = 10
| 00000 4a 3f 65 38 33 7e 31 7a 31 0a J?e83~1z 1. |
23:46:20.149163 write(8, "\n", 1) = 1
| 00000 0a . |
23:46:20.157813 pread(256, "\6\242\2\4\1\360O\6\2\4\217T\2\3226\264O\6"..., 1032192, 2000699392) = 1032192
23:46:20.161124 lseek(8, 0, SEEK_CUR) = 46252
23:46:20.161169 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 33 31 37 20 66 69 ad' ela= 3317 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 35 le#=5 bl ock#=665 |
| 00050 36 32 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 62 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 36 31 31 30 30 53903801 61100 |
23:46:20.161260 write(9, "J?Hv2~1z1\n", 10) = 10
| 00000 4a 3f 48 76 32 7e 31 7a 31 0a J?Hv2~1z 1. |
23:46:20.161624 lseek(8, 0, SEEK_CUR) = 46377
23:46:20.161659 write(8, "\n", 1) = 1
| 00000 0a

The pread() call is the actual IO call done by the database.
The lseek() call is the seek towards the end of the trace file, and sets the pointer for the upcoming write.
The write() call is the write in the trace file. You see the multiblock read (scattered read in the trace file) was done using a pread() call, and the time measured by strace () is line with the time measured by Oracle (ela= 3317), and the number of blocks in the trace (126) matches the number of bytes read 1032192 (126*8192).
The time spend in the system calls is between the ‘smaller’ and ‘bigger’ signs, and get removed by wordpress, so if you where searching for it in the above example, you wouldn’t find it
The write() to file descriptor 9 with the garbage in it is the write to the .trm file.
The lseek() sets the pointer in the file after the writes, and a write() is done to enter a carriage return (\n) to the tracefile.
This is a simple example, to show how a synchronous multiblock read is reported, let’s see a a-synchronous one!


00:17:27.557053 pread(256, "#\242\202\251\243O\6\1\4[\4"..., 8192, 1190150144) = 8192
00:17:27.558973 lseek(8, 0, SEEK_CUR) = 2659
00:17:27.559009 write(8, "WAIT #140244789954888: nam='db f"..., 124) = 124
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 62 20 66 54888: n am='db f |
| 00020 69 6c 65 20 73 65 71 75 65 6e 74 69 61 6c 20 72 ile sequ ential r |
| 00030 65 61 64 27 20 65 6c 61 3d 20 31 39 30 33 20 66 ead' ela = 1903 f |
| 00040 69 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 34 33 ile#=5 b lock#=43 |
| 00050 33 39 34 20 62 6c 6f 63 6b 73 3d 31 20 6f 62 6a 394 bloc ks=1 obj |
| 00060 23 3d 31 34 30 33 33 20 74 69 6d 3d 31 33 33 35 #=14033 tim=1335 |
| 00070 33 39 32 32 34 37 35 35 38 39 34 33 39224755 8943 |
00:17:27.559088 write(9, "N?Rb~22y1\n", 10) = 10
| 00000 4e 3f 52 62 7e 32 32 79 31 0a N?Rb~22y 1. |
00:17:27.559128 write(8, "\n", 1) = 1
| 00000 0a . |
00:17:27.559334 mmap(0x7f8d48e41000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48e41000
00:17:27.559450 mmap(NULL, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x1ec000) = 0x7f8d48c81000
00:17:27.559491 mmap(0x7f8d48c81000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48c81000
00:17:27.559576 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x2fc000) = 0x7f8d48a81000
00:17:27.559614 mmap(0x7f8d48a81000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48a81000
00:17:27.559834 io_submit(140244822339584, 1, {{0x7f8d48a861f8, 0, 0, 0, 256}}) = 1
00:17:27.560008 mmap(0x7f8d48ad1000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48ad1000
00:17:27.560080 io_submit(140244822339584, 1, {{0x7f8d48a86450, 0, 0, 0, 256}}) = 1
00:17:27.560272 io_getevents(140244822339584, 1, 128, {{0x7f8d48a86450, 0x7f8d48a86450, 122880, 0}}, {600, 0}) = 1
00:17:27.560363 io_getevents(140244822339584, 1, 128, {{0x7f8d48a861f8, 0x7f8d48a861f8, 106496, 0}}, {600, 0}) = 1
00:17:27.560865 lseek(8, 0, SEEK_CUR) = 2784
00:17:27.560902 write(8, "WAIT #140244789954888: nam='dire"..., 129) = 129
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 69 72 65 54888: n am='dire |
| 00020 63 74 20 70 61 74 68 20 72 65 61 64 27 20 65 6c ct path read' el |
| 00030 61 3d 20 35 38 36 20 66 69 6c 65 20 6e 75 6d 62 a= 586 f ile numb |
| 00040 65 72 3d 35 20 66 69 72 73 74 20 64 62 61 3d 34 er=5 fir st dba=4 |
| 00050 33 33 39 35 20 62 6c 6f 63 6b 20 63 6e 74 3d 31 3395 blo ck cnt=1 |
| 00060 33 20 6f 62 6a 23 3d 31 34 30 33 33 20 74 69 6d 3 obj#=1 4033 tim |
| 00070 3d 31 33 33 35 33 39 32 32 34 37 35 36 30 38 34 =1335392 24756084 |
| 00080 35 5 |

This advanced example shows how first the segment header is fetched using the ‘db file sequential read’ wait. Despite being on ASM (this is ASM!), this clearly shows the segment header is read synchronously using the pread() call.
The next system calls and waits are even more interesting!
First, memory is allocated using the mmap() call. This is the beginning of a full scan, and the stuff read from disk need to be buffered somewhere. Please mind the memory allocation is not part of any wait. That is no problem, the allocation of memory should be very fast.
Next, TWO asynchronous reads are initiated using the two io_submit() calls, mind the iocb (IO control block) addresses, 0x7f8d48a861f8 and 0x7f8d48a86450.
See the man page of the io_submit call on your linux system using ‘man io_submit’, also use the man page of io_getevents to learn more about that call
Next, the asynchronous reads are reaped using the io_getevents() calls. What is noteworthy here is, is that the two requests are reaped in REVERSE order (this is consistent for the first two asynchronous call pairs for every segment, this example is Oracle version 11.2.0.3). So first request 0x7f8d48a86450 is fetched, then 0x7f8d48a861f8.
After 0x7f8d48a861f8 is reaped, the wait is reported for the FIRST IO call (io_submit() & io_getevents() combination). The second request (0x7f8d48a86450) of the start of a full segment scan is not reported. Never. Further requests do not reverse reaping/’encapsulate’ IO requests (mostly, it can happen a few times more).

I hope you get the idea of digging deeper into waits by examining the system calls.

I am not sure if the above ‘encapsulation’ of IO calls is a bug. But think of this: if you issue multiple IO requests (like we saw above, multiple IO requests are issued!) you cannot report the waits of all these distinct IO requests, because you would get more time reported than has passed. This has some consequences for the above ‘direct path read’ waits: these do not report individual IO times (like ‘db file sequential read’ does). This is just an indication on time spend in the process it had to wait for results from its IO requests, rather than disk IO latency.

The above conclusions are the result of observation and logical reasoning from my side. If you have anything to add, please respond!

14 comments
  1. Martin Bach said:

    Glad to see you blogged about it-it was well worth reading it!

  2. Ron Crisco said:

    I’m fairly certain this is not true:
    “if the execution is not waiting/stuck in a certain piece of Oracle server kernel code, it doesn’t report a wait, because it didn’t wait.”
    Do you have an example of this?

    As you said, “wait” events are simply instrumentation of the Oracle database kernel making OS calls. 10046 instructs the kernel to write this instrumentation to a log file, and it does so, as far as I have ever seen, regardless of the time elapsed.

    • Hi Ron, thanks for reading!
      Before I had a SSD and enough memory in my laptop I would have the same opinion. However, with the configuration above I get almost no waits when reading a table physically (cold cache).
      Same is true for latch waits: in a normal reading situation there is plenty of latching going on, but you don’t get the waits, only if your session experiences contention on them.

      • Ron Crisco said:

        I would be extremely interested in an actual trace file and strace output with an example of this, whether or on your blog, or in email. I don’t have SSD to be able to reproduce this behavior.

  3. Hi Ron, here’s a 10046/8 output of a full tablescan of an Oracle 11.2.0.3 database, for which I’ve flushed the buffercache prior to issuing the SQL. The VM disk is a file on my laptop, and is now cached. So the physical read of the VM actually is served from the cache of my laptop:

    =====================
    PARSING IN CURSOR #140034543180104 len=23 dep=0 uid=34 oct=3 lid=34 tim=1335731777622244 hv=1020534364 ad=’7f45aaf8′ sqlid=’94dwfa8yd87kw’
    select count(*) from t2
    END OF STMT
    PARSE #140034543180104:c=0,e=88,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1335731777622243
    EXEC #140034543180104:c=0,e=47,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1335731777622338
    WAIT #140034543180104: nam=’SQL*Net message to client’ ela= 5 driver id=1650815232 #bytes=1 p3=0 obj#=14033 tim=1335731777622543
    WAIT #140034543180104: nam=’db file sequential read’ ela= 233 file#=5 block#=43394 blocks=1 obj#=14033 tim=1335731777622958
    WAIT #140034543180104: nam=’direct path read’ ela= 73 file number=5 first dba=47874 block cnt=126 obj#=14033 tim=1335731777667408
    FETCH #140034543180104:c=179973,e=186244,p=20942,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1335731777808818
    STAT #140034543180104 id=1 cnt=1 pid=0 pos=1 obj=0 op=’SORT AGGREGATE (cr=20944 pr=20942 pw=0 time=186237 us)’
    STAT #140034543180104 id=2 cnt=1000000 pid=1 pos=1 obj=14033 op=’TABLE ACCESS FULL T2 (cr=20944 pr=20942 pw=0 time=85699 us cost=5736 size=0 card=1000000)’
    WAIT #140034543180104: nam=’SQL*Net message from client’ ela= 153 driver id=1650815232 #bytes=1 p3=0 obj#=14033 tim=1335731777809134
    FETCH #140034543180104:c=0,e=2,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,plh=3321871023,tim=1335731777809167
    WAIT #140034543180104: nam=’SQL*Net message to client’ ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14033 tim=1335731777809184
    WAIT #140034543180104: nam=’SQL*Net message from client’ ela= 224 driver id=1650815232 #bytes=1 p3=0 obj#=14033 tim=1335731777809421
    PARSE #140034543180976:c=0,e=33,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=0,tim=1335731777809493
    WAIT #140034543180976: nam=’SQL*Net message to client’ ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14033 tim=1335731777809722
    EXEC #140034543180976:c=0,e=201,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=0,tim=1335731777809751

    *** 2012-04-29 22:36:23.198
    WAIT #140034543180976: nam=’SQL*Net message from client’ ela= 5388238 driver id=1650815232 #bytes=1 p3=0 obj#=14033 tim=1335731783198035
    CLOSE #140034543180104:c=0,e=12,dep=0,type=0,tim=1335731783198154
    CLOSE #140034543180976:c=0,e=18,dep=0,type=3,tim=1335731783198195
    =====================

    I didn’t edit this output in any way. I think it’s just that we are so used to IO calls being served by rotating disks, or remotely (SAN/NAS). But think about what earlier brought to this discussion: latches. Even though direct path read full scans are going to PGA (thus need a lot less latching), latches are still used intensely by processes. Latch waits are only reported when a process really waits for them, not when there is no waiting.

    • Ron Crisco said:

      To prove what you’re saying, we need to see a trace file without the latch events, and strace output of the same session showing the latch events occurred, but were not recorded in the trace file. I don’t want to sound like a pest, but what you are suggesting goes against everything I know about trace files and Oracle’s instrumentation method.

      I do understand that “physical IO” from the kernel’s perspective does not mean a physical read from a spinning disk. The read could be fulfilled by any number of cache layers in between, or from SSD, and Oracle has no way of knowing that. But that’s not the issue. The issue is your assertion that Oracle performs code path that is instrumented, and sometimes it chooses not to show that code path in the trace file. That’s quite an important thing for those who depend on trace files for performance analysis.

  4. Ron, you are absolutely right that for this kind of claims the method for investigating tracefiles as described in this blog is useful, and should be included. I have one now (don’t have access to internet at night), and include it with this reply:

    This is the start of the full table scan (the pread, then some mmap() calls and then io_submit() and io_getevents calls. Probably the timing of the system calls is gone (encapsulated in smaller than/bigger than characters which is removed by wordpress):

    5380 12:30:53.949418 pread(257, "#\242\202\251\243O\6\1\4[\4"..., 8192, 1190150144) = 8192
    5380 12:30:53.949641 write(8, "WAIT #140557241782656: nam='db f"..., 123) = 123
    | 00000 57 41 49 54 20 23 31 34 30 35 35 37 32 34 31 37 WAIT #14 05572417 |
    | 00010 38 32 36 35 36 3a 20 6e 61 6d 3d 27 64 62 20 66 82656: n am='db f |
    | 00020 69 6c 65 20 73 65 71 75 65 6e 74 69 61 6c 20 72 ile sequ ential r |
    | 00030 65 61 64 27 20 65 6c 61 3d 20 32 31 38 20 66 69 ead' ela = 218 fi |
    | 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 34 33 33 le#=5 bl ock#=433 |
    | 00050 39 34 20 62 6c 6f 63 6b 73 3d 31 20 6f 62 6a 23 94 block s=1 obj# |
    | 00060 3d 31 34 30 33 33 20 74 69 6d 3d 31 33 33 35 37 =14033 t im=13357 |
    | 00070 38 31 38 35 33 39 34 39 36 32 32 81853949 622 |
    5380 12:30:53.949702 write(8, "\n", 1) = 1
    | 00000 0a . |
    5380 12:30:53.949856 mmap(0x7fd608605000, 458752, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7fd608605000
    5380 12:30:53.949950 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.950100 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x4f5000) = 0x7fd608365000
    5380 12:30:53.950140 mmap(0x7fd608365000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7fd608365000
    5380 12:30:53.950207 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 257}}) = 1
    5380 12:30:53.950475 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa450, 0x7fd6087aa450, 106496, 0}, {0x7fd6087aa1f8, 0x7fd6087aa1f8, 122880, 0}}, {0, 0}) = 2
    5380 12:30:53.950591 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 257}}) = 1
    5380 12:30:53.950809 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.951020 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa1f8, 0x7fd6087aa1f8, 122880, 0}, {0x7fd6087aa450, 0x7fd6087aa450, 122880, 0}}, {0, 0}) = 2
    5380 12:30:53.951125 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.951318 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 257}}) = 1
    5380 12:30:53.951467 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa450, 0x7fd6087aa450, 122880, 0}, {0x7fd6087aa1f8, 0x7fd6087aa1f8, 122880, 0}}, {0, 0}) = 2
    5380 12:30:53.951568 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 257}}) = 1
    5380 12:30:53.951758 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.951905 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa1f8, 0x7fd6087aa1f8, 122880, 0}, {0x7fd6087aa450, 0x7fd6087aa450, 122880, 0}}, {0, 0}) = 2
    5380 12:30:53.952030 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 256}}) = 1
    5380 12:30:53.952835 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 257}}) = 1
    5380 12:30:53.955167 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa450, 0x7fd6087aa450, 1032192, 0}, {0x7fd6087aa1f8, 0x7fd6087aa1f8, 1032192, 0}}, {0, 0}) = 2
    5380 12:30:53.955618 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 256}}) = 1
    5380 12:30:53.956583 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.957238 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa1f8, 0x7fd6087aa1f8, 1032192, 0}}, {0, 0}) = 1
    5380 12:30:53.957691 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 256}}) = 1
    5380 12:30:53.958300 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa450, 0x7fd6087aa450, 1032192, 0}}, {0, 0}) = 1
    5380 12:30:53.958744 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1

    As you see, there are no wait events. A few IO calls are slower, and do get a wait event:


    5380 12:30:53.989996 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.990895 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa1f8, 0x7fd6087aa1f8, 1032192, 0}}, {0, 0}) = 1
    5380 12:30:53.991773 io_submit(140557273686016, 1, {{0x7fd6087aa1f8, 0, 0, 0, 256}}) = 1
    5380 12:30:53.993018 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa450, 0x7fd6087aa450, 1032192, 0}}, {0, 0}) = 1
    5380 12:30:53.993883 io_submit(140557273686016, 1, {{0x7fd6087aa450, 0, 0, 0, 257}}) = 1
    5380 12:30:53.995297 io_getevents(140557273686016, 2, 128, {{0x7fd6087aa1f8, 0x7fd6087aa1f8, 1032192, 0}}, {0, 0}) = 1
    5380 12:30:53.996028 io_getevents(140557273686016, 1, 128, {{0x7fd6087aa450, 0x7fd6087aa450, 1032192, 0}}, {600, 0}) = 1
    5380 12:30:53.996207 lseek(8, 0, SEEK_CUR) = 2664
    5380 12:30:53.996269 write(8, "WAIT #140557241782656: nam='dire"..., 130) = 130
    | 00000 57 41 49 54 20 23 31 34 30 35 35 37 32 34 31 37 WAIT #14 05572417 |
    | 00010 38 32 36 35 36 3a 20 6e 61 6d 3d 27 64 69 72 65 82656: n am='dire |
    | 00020 63 74 20 70 61 74 68 20 72 65 61 64 27 20 65 6c ct path read' el |
    | 00030 61 3d 20 31 36 37 20 66 69 6c 65 20 6e 75 6d 62 a= 167 f ile numb |
    | 00040 65 72 3d 35 20 66 69 72 73 74 20 64 62 61 3d 34 er=5 fir st dba=4 |
    | 00050 37 37 34 36 20 62 6c 6f 63 6b 20 63 6e 74 3d 31 7746 blo ck cnt=1 |
    | 00060 32 36 20 6f 62 6a 23 3d 31 34 30 33 33 20 74 69 26 obj#= 14033 ti |
    | 00070 6d 3d 31 33 33 35 37 38 31 38 35 33 39 39 36 31 m=133578 18539961 |
    | 00080 37 36 76 |

    I think this is proof the Oracle database only reports waits if it’s actually waiting.

    But let me emphasize this is true for all other waits too. As you probably know, statistics are written in the simulator heap (when STATISTICS_LEVEL is set to TYPICAL, which is default) during execution of SQL. This memory area is protected by latches, which need to be taken in order to write. Because there is no contention, you don’t see waits.

    • Ron Crisco said:

      Frits – thanks for the time on Skype. I think we all learned some things. I’ll summarize what I learned here in hopes that some other reader may benefit.

      * Frits and Cary and Jeff know far more about the Oracle kernel and extended SQL tracing than I do. Actually, that’s not something I learned, since I already knew that, but it’s worth mentioning 🙂
      * Trace files (specifically event 10046) are simply instrumentation of Oracle kernel code path. This occurs primarily in two types of data written to the file: database calls and OS calls. Database calls are things like PARSE/EXEC/FETCH and OS calls are everything else, prefixed (unfortunately) with the keyword WAIT.
      * The label WAIT doesn’t really mean “waiting”. It is simply the time spent performing a system call, like pread or semtimedop. Since it’s not code in the Oracle kernel itself, the developers thought of it as waiting on the OS to finish a call. This is a critical distinction, because a performance analyst who tries to “eliminate waiting” is really chasing the wrong animal.
      * Database calls can be nested, calling other database calls or OS calls.

      And now to some of the detailed bits:
      * Some code path in the kernel is not instrumented, because
      a) the developer considered the code to be likely to execute faster than would be wise to instrument. In the case of latches, there’s no reason for Oracle to instrument latches in most cases, because it would not be helpful to know that it spun 125 times for a successful latch acquisition. But the developers decided to instrument any time a latch spun X times unsuccessfully, and a sleep had to be issued, or any time the select() or semtimedop() calls were used for latching.
      b) the developer had no way of communicating detailed information in the existing trace file format, evidenced by the discussion above regarding direct path reads. In 11g, significant parallelization can happen during the DPR event, and the single trace file line doesn’t report all that CPU time. Especially in Exadata, with massive parallelism in the storage array, the kernel just doesn’t have (currently) a mechanism for reporting all the information we would like to know. We hope that will improve with later releases. The kernel team has always been responsive to suggestions about trace file bugs and new feature requests.
      c) the developer just neglected to instrument new code – it’s important to remember always that Oracle is just code, a lot of it, but still, just code written by developers 🙂

      The key point that started my discussion with Frits was clarified by something Cary Millsap explained. The trace file reports events as they actually happen, without conditional checks on elapsed times. The conditional checks are done during the development of the code – developers decide whether or not to instrument specific calls.

  5. Andre Gomes said:

    Hi, I have a performance issue and I am using strace in order to try to figure out what could be the ofensor. I have 2 Oracle databases in 2 differente servers, they have the same Linux kernel, same CPU, memory and so on. The Oracle server and patch level are the same and the dataset I am dealing with is the same too. I noticed that same query runs very fast in one box and slow in the other. When I compare the strace I found that in the slow box the strace call looks like io_getevents(47070404960256, 1, 1024, {…}{0, 0}) = 1 most of the time and a few times like this: io_getevents(47070404960256, 1, 1024, {…}{600, 0}) = 1 . I also noticed that in the fastest box is the contrary I mean must of the time I have {600,0}. I am just wondering if this could mean the requested IO event on the slow box is not ready almost of the time but this just a guessing. I have researched io_getevents but I did not find any conclusion until know. Any ideas?

  6. Albert said:

    Excellent article, still valid in 2021 🙂

    Could you please mention what’s the meaning of the following event in strace?

    getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=35103, tv_usec=84809}, ru_stime={tv_sec=107, tv_usec=613388}

    I can see a lot of these lines in a process.

    Thanks!

    • strace performs a trace of system calls executed by the operating system (hence ‘strace’).
      System calls can be different per operating system, and actually have really decent documentation via the man pages: https://linux.die.net/man/3/getrusage

      In fact, just google ‘man getrusage’. You will find system calls are actually fully documented, the Oracle wait events not so much.

Leave a reply to Ron Crisco Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.