Archive

Tag Archives: systemtap

When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:

WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957

The current documentation has no description for it. Let’s see what V$EVENT_NAME says:

SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 
  2  from v$event_name where name = 'PGA memory operation';

EVENT# NAME                                  PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
------ ------------------------------------- ---------- ---------- ---------- ---------------
   524 PGA memory operation                                                   Other

Well, that doesn’t help…

Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).

TS@fv122b2 > alter session set events 'sql_trace level 8';

Session altered.

Now use strace to look at the system calls in another session:

# strace -e write=all -e all -p 9426
Process 9426 attached
read(9,

Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:

io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
lseek(7, 0, SEEK_CUR)                   = 164639
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 37 38  30 20 70 31 3d 32 30 39   ela= 780 p1=209 |
 | 00040  37 31 35 32 20 70 32 3d  31 31 31 34 31 31 32 20  7152 p2=1114112  |
 | 00050  70 33 3d 30 20 6f 62 6a  23 3d 32 32 38 33 33 20  p3=0 obj#=22833  |
 | 00060  74 69 6d 3d 31 39 35 31  37 30 32 30 35 36 36 0a  tim=19517020566. |
...
munmap(0x7f5ba8fbd000, 2097152)         = 0
munmap(0x7f5ba91bd000, 2097152)         = 0
mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000
lseek(7, 0, SEEK_CUR)                   = 183409
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 35 39  32 20 70 31 3d 30 20 70   ela= 592 p1=0 p |
 | 00040  32 3d 30 20 70 33 3d 30  20 6f 62 6a 23 3d 32 32  2=0 p3=0 obj#=22 |
 | 00050  38 33 33 20 74 69 6d 3d  31 39 35 32 30 36 33 33  833 tim=19520633 |
 | 00060  36 37 34 0a                                       674.             |

Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:

global wait_event_nr=524
probe begin {
	printf("begin.\n")
}

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthbwt") {
	if ( pid() == target() && register("rdx") == wait_event_nr )
		printf("kskthbwt - %d\n", register("rdx"))
}
probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr )
		printf("kskthewt - %d\n", register("rsi"))
}
probe syscall.mmap2 {
	if ( pid() == target() )
		printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6))
}
probe syscall.mmap2.return {
	if ( pid() == target() )
		printf("return value: %x\n", $return)
}
probe syscall.munmap {
	if ( pid() == target() )
		printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2))
}

Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.

I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:

# stap -x 9426 mmap.stp
begin.
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000
 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000
kskthewt - 524
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000
 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000
kskthewt - 524
kskthbwt - 524
 munmap, addr 7f5ba8fbd000, size 2097152
 munmap, addr 7f5ba91bd000, size 2097152
kskthewt - 524

This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.

Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?

To answer that, we need to look at the flags of the two calls. Here is an example:

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000

The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.

The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.

The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.

Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr ) {
		ksuse = register("r13")-4672
		ksuseopc = user_uint16(ksuse + 2098)
		ksusep1 = user_uint64(ksuse + 2104)
		ksusep2 = user_uint64(ksuse + 2112)
		ksusep3 = user_uint64(ksuse + 2120)
		ksusetim = user_uint32(ksuse + 2128)
		printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3)
	}
}

When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:

kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000
 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0

This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?

Let’s look what happens when I execute the same SQL again:

kskthbwt - 524
 munmap, addr 7f1562130000, size 2097152
 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0
kskthbwt - 524
 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0

Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.

The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.

The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.

Conclusion:
With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.

Advertisement

This is the second blogpost on using PL/SQL inside SQL. If you landed on this page and have not read the first part, click this link and read that first. I gotten some reactions on the first article, of which one was: how does this look like with ‘pragma udf’ in the function?

Pragma udf is a way to speed up using PL/SQL functions in (user defined function), starting from version 12. If you want to know more about the use of pragma udf, and when it does help, and when it doesn’t, please google for it.

create or replace function add_one( value number ) return number is
        pragma udf;
        l_value number(10):= value;
begin
        return l_value+1;
end;
/

select sum(add_one(id)) from t2;

As you can see, really the only thing you have to do is add ‘pragma udf’ in the declaration section of PL/SQL.

Here is how the flamegraph looks like:

What is visible, is that the functions between the plsql interpreter (pfrrun) and the function that makes the operand evaluation switch to PL/SQL (evapls) now is only one function, peidxrex. However, inside the evapls function there are two additional functions called (kkxmsagof, kkxmsagif, not readable) which take noticeable time. Conclusion at this point is pragma udf is doing it in yet another way than a native PL/SQL function and the subquery factoring.

Profiling this using the systemtap script:

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

Shows:

# stap -x 92509 plsql.stap
Begin.
^C
evapls time:      2211412
pfrrun time:       804178

So, that’s very close to using this function using subquery factoring, a bit longer (2192685). This is very strictly depending on what is actually done, so milage may vary for your own use.

While we are at it, let’s have a look how this looks like when no PL/SQL is used, so:

select sum(id+1) from t2;

Here it is:

The function used for adding is now evaaddrset. From the size of the kdst_fetch function can be seen that it takes way lesser time. Let’s measure it with a changed version of the systemtap script:

global evaaddrset_time, evaaddrset_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset") {
	if ( pid() == target() )
		evaaddrset_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset").return {
	if ( pid() == target() )
		evaaddrset_tot+=local_clock_us()-evaaddrset_time
}

probe end {
	printf("\nevaaddrset time: %12d\n", evaaddrset_tot)
}

This is how the output looks like:

# stap -x 92509 plsql.stap
Begin.
^C
evaaddrset time:        43389

A simple calculation shows that doing the addition native in SQL only takes 43389/2211412*100=2% of the runtime of PL/SQL with pragma udf.

Whenever you use PL/SQL in SQL statements, the Oracle engine needs to switch from doing SQL to doing PL/SQL, and switch back after it is done. Generally, this is called a “context switch”. This is an example of that:

-- A function that uses PL/SQL 
create or replace function add_one( value number ) return number is
        l_value number(10):= value;
begin
        return l_value+1;
end;
/
-- A SQL statement that uses the PL/SQL function
select sum(add_one(id)) from t2;

Of course the functionality of the function is superfluous, it can easily be done in ‘pure’ SQL with ‘select sum(id+1) from t2’. But that is not the point.
Also, I added a sum() function, for the sake of preventing output to screen per row.

The first thing to check, if there is a difference in performance between executing with sum(id+1) and sum(add_one(id)). If there isn’t we can stop here 🙂

TS@frits > set timing on
TS@frits > select sum(id+1) from t2;

 SUM(ID+1)
----------
5.0000E+11

Elapsed: 00:00:00.19
TS@frits > select sum(add_one(id)) from t2;

SUM(ADD_ONE(ID))
----------------
      5.0000E+11

Elapsed: 00:00:01.13

This statement executes a full table scan, I’ve closely guarded the IO times were alike.
But it looks there is a whopping difference between including PL/SQL and not: 113/19*100=595%, or differently worded: almost six times slower.

How does this work? Let’s have a look using stapflames. The idea behind flame graphs in general is to understand in which (c code) functions (user mode and kernel mode) the time is spend. Because of the full backtrace (all the called functions on top of each other), it gives an insight on how a program is working.

The first thing we need to establish, is how PL/SQL looks like from the perspective of C-functions. For that reason, I created a bogus PL/SQL program to profile:

declare
t number:=0;
begin
while t < 1000000 loop
t:=t+1;
end loop;
end;
/

Yes, that is right, the only thing this anonymous PL/SQL block does, is declare a number to a variable named ‘t’, and then loop adding one to the variable until the variable reaches the number 1000000. Again, what this program does is not interesting nor functional, the only thing it needs to do is run, so when we profile the program we are sure it is doing PL/SQL.

I ran this anonymous PL/SQL block using my stapflame utility to generate a flamegraph, and this is how that looks like:
I have taken the flamegraph of all time considered on cpu by the Oracle database.

First of all, one important property of flamegraphs is shown: the sequence is random. If you look at the kpoal8 function, you see there are two different paths taken from this function: opiexe (oracle program interface execute) and opiosq0 (oracle program interface prepare to parse). Of course the PL/SQL block is first parsed and then executed, so the order is different than shown.

What is also very visible, is that almost all time executing, is done using a function ‘pfrrun’, which seems to be the main function executing PL/SQL. On top of that we can see some functions which roughly resemble the functionality used in the PL/SQL block: pfrinstr_ADDN (addition, t:=t+1) and pfrinstr_RELBRNCH (the loop). This also gives a fair indication PL/SQL is interpreted rather than compiled. Anyway, what is important is that from looking at the little test above, we can distinguish running PL/SQL from SQL by the pfrrun function.

Now let’s look at at a flamegraph of running a PL/SQL function in the SQL statement:
The flamegraph shows all the time considered running on CPU for executing the statement ‘select sum(add_one(id)) from t2’. There is a lot to see!

When we look on top of the function kpoal8, we see the function opifch2. This means the vast majority of the time is spend in the fetch phase of the SQL statement. On top of the opifch2 function we see two functions which start with qer. ‘qer’ probably means Query Execute Rowsource. ‘qertbFetch’ is the fetch procedure for table scans. Then we see the kdsttgr function (kernel data scan table get row), and then the ultra fast table scan function (kdstf; kernel data scan table full) followed by a lot of zero’s and/or one’s and ending with ‘km’ or ‘kmP’. There are a lot of kdstf functions in the Oracle executable, I assume the zero’s and one’s after ‘kdstf’ are a bitmap of potentially needed functions during the full scan, so the full table scan function can omit a lot of choices on runtime, leading to better CPU efficiency. See an article by Tanel on what this means.

In the full table scan function, there are two main functions which consume time, kdst_fetch and qesaFastAggNonDistSS. ‘kdst_fetch’ and deeper functions are functions related to doing IO for reading the data from the data file. ‘qesaFastAggNonDistSS’ is the main function for actually processing the data. The function qesaFastAggNonDistSS has two main functions consuming its time, evaopn2 and a function visible as ‘sl..’, which is actually a function called slnxsum, in other words, the sum() function. The function evaopn2 is a function to evaluate operands. This evaluation is the path towards executing the PL/SQL function.

Actually, when carefully looking at the evaopn2 function, we see the slnxsum function, and ‘evapls’, which is the function to switch to PL/SQL. The two main functions in ‘evapls’ are kgmexec and opiomc. Again here the order is switched; what happens here is first a cursor must be mapped for executing PL/SQL (opiomc function), after which it can be executed (kgmexec function).

In order to understand what the time taken by “switching” to PL/SQL is, we can take the total time the query engine is processing everything PL/SQL related, which is the total time taken by the ‘evapls’ function, and measure the time actually running PL/SQL, which is the time taken by the ‘pfrrun’ function. This can be accomplished by simple systemtap script:
(please mind you need a recent systemtap version, preferably gotten from https://sourceware.org/git/systemtap.git, and kernel version 3.10 to be able to use return probes)

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

This is how it looks like on my machine:

# stap -x 29680 plsql.stap
Begin.
^C
evapls time:      3203637
pfrrun time:       951830

So, the overhead or context switching time, which must be Oracle server code executing between the the evapls function, where it determines it needs to execute PL/SQL and the pfrrun function, which is the PL/SQL interpreter, is on my machine:
(1-951830/3203637)*100=70%
Wow!

One way of reducing this problem, is using subquery factoring, alias the ‘with clause’. To use the function that way, this is how the SQL should be written:

with
function add_one( value number ) return number is
	l_value number(10):= value;
begin
	return l_value+1;
end;
select sum(add_one(id)) from t2;
/

Let’s have a look at the flamegraph of this construction:
It becomes apparent that with subquery factoring, there are way lesser functions between the evapls and pfrrun functions.
Normal PLSQL: kgmexec, kkxmpexe, kkxdexe, peidxexe, peidxr_run, plsql_run
Subquery factoring: kkxmss_speedy_stub, peidxrex
Also mind there is no codepath for mapping a cursor.

Let’s have a look at the timing:

# stap -x 29680 plsql.stap
Begin.
^C
evapls time:      2192685
pfrrun time:       880230

The time spend in PL/SQL, by looking at total time spend in the evapls function reduced by 32% ((1-2192685/3203637)*100).
However, if you calculate the overhead, it is still pretty significant: (1-880230/2192685)*100=60%

Conclusion
The most simple conclusion I can make is: do not use PL/SQL if you can solve it in SQL, like in the example above. This does not mean you should never use PL/SQL, contrary: in a lot of cases processing should be done where the data is, and sometimes you need a procedural language for that.

I made a lot of assumptions in this little investigation. The function naming (the translation from the Oracle C function name to what functionality it is supposed to deliver) are speculations.

The context switch between SQL mode and PL/SQL mode looks like it is technically setting up the execution environment for PL/SQL. Indeed this takes time, and the true PL/SQL execution time when repeatedly executing PL/SQL is very low in my case. Please mind actual times will differ on different systems. However, the main conclusion is that using PL/SQL in SQL execution probably is not the most performant thing to do, including using subquery factoring.

There’s been a lot of work in the area of profiling. One of the things I have recently fallen in love with is Brendan Gregg’s flamegraphs. I work mainly on Linux, which means I use perf for generating stack traces. Luca Canali put a lot of effort in generating extended stack profiling methods, including kernel (only) stack traces and CPU state, reading the wait interface via direct SGA reading and kernel stack traces and getting userspace stack traces using libunwind and ptrace plus kernel stack and CPU state. I was inspired by the last method, but wanted more information, like process CPU state including runqueue time.

I started playing around with systemtap, and was able to read a process’ CPU state including run queue time. This involves using kernel tapset scheduler, which unfortunately needs the kernel debug info packages (kernel-euk-debuginfo and kernel-uek-debuginfo-common, available via oss.oracle.com). It is not hard to include wait interface information, this is work Luca and I collaborated on in the past. I created a systemtap script called cpu_and_wait_profile.stap, which shows the oracle database state transition between on cpu and in a wait, as well as kernel CPU state information. This is how that should be executed and what it outputs:

# stap -x 6641 cpu_and_wait_profile.stap
Begin.
w     - 388 (no begin)
c     1    	tot:         334	on:         333	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     2 384	tot:           5	on:           5	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
c     3    	tot:         644	on:         644	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     4 212	tot:          58	on:          41	off:          17	q:           5	ti:           9	tu:           0	#slices:    1
c     5    	tot:         371	on:         371	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     6 212	tot:         146	on:          58	off:          88	q:          14	ti:          69	tu:           0	#slices:    1
c     7    	tot:        1787	on:        1745	off:          42	q:          37	ti:           0	tu:           0	#slices:    2
w     8 212	tot:         265	on:          30	off:         234	q:          12	ti:         218	tu:           0	#slices:    1

The first column indicates if the process is inside an Oracle wait event (w), or is considered running on cpu (c) by the database.
The second column is a serial number. The third column is the wait event number if the process is considered inside a wait, or empty if on CPU.
The column ‘tot’ is the total time (in microseconds) spent on cpu or inside a wait event.
The column ‘on’ is the time spent truly running on CPU. Obviously, ‘off’ is all the time not spent running on the CPU.
The columns ‘q’, ‘ti’ and ‘tu’ are off CPU states. ‘q’ is time spend in the CPU runqueue. This is not a kernel state, a process gets the state ‘TASK_RUNNING’ to indicate it is willing to run, after which it’s the task of the scheduler to manage willing to run processes and get them onto a CPU. ‘ti’ is a kernel state, which means ‘TASK_INTERRUPTABLE’. This is a state after which the process is taken off the CPU, because it is waiting for something to complete. ‘Something’ means a disk IO, a timer to expire, etc. ‘tu’ means ‘TASK_UNINTERRUPTIBLE’, which is used if a process should only continue when a specific condition is met, and reacting to signals would be problematic.
The last column ‘#slices’ is the number of times the process has gotten on cpu.
If you look at the example output above, you see that the process started running, and remained running until sequence number 4 (sequence number is the second column). Sequence number 4 is an Oracle wait event, number 212 (direct path read). The earlier wait event number 384 was passed without actually waiting; total time is 5us, on cpu was 5us too (!). The total time spent in the wait event in sequence #4 is 58us, of which 41us was spent on cpu, and 17us off cpu. The off cpu time is composited of 5us run queue time (q) and 9us ‘TASK_INTERRUPTIBLE’ time, of which the total is 14us, which leaves 3us off cpu/unaccounted for. This is time taken by the state transitions and context switches. The actual sequence of events of the CPU state is: TASK_RUNNING (on cpu), then TASK_INTERRUPTIBLE is entered, which is actually waiting for IOs in this case (wait event ‘direct path read’, remember?). The ‘TASK_INTERRUPTIBLE’ state means the process is stopped from processing by the kernel (taken off cpu), which is logical, because it means the process is deliberately waiting for something before it can continue. Once the condition is met (IO(s) ready in this case), the process can continue. To continue, the process state is set to ‘TASK_RUNNING’, and put on a runqueue. This means there is no explicit process state ‘in run queue’. This state (state set to ‘TASK_RUNNING’ but not running on CPU yet) is shown with ‘q’. Once the process has enough priorities, the scheduler switches the process running on the CPU again.

Okay, so at this point we have a (systemtap) script that can very precisely count the time spend of a process. Wouldn’t it be great if we can see a flame graph per sequence number? I spent a great deal of time trying to figure out a way to combine the two. Until I learned about the ‘-T’ switch of perf record:

    -T, --timestamp       Sample timestamps

Great!! The way this works, is that perf includes ‘timestamps’ during recording (perf record), which are printed when the perf recording is externalised with the ‘perf script’ command:

oracle_92213_fv 92213 34075.900988: cycles:
        ffffffff810483da native_write_msr_safe ([kernel.kallsyms])
        ffffffff8102bf91 intel_pmu_enable_all ([kernel.kallsyms])
        ffffffff810263cc x86_pmu_enable ([kernel.kallsyms])
        ffffffff811221db perf_pmu_enable ([kernel.kallsyms])
        ffffffff81124d09 perf_event_context_sched_in ([kernel.kallsyms])
        ffffffff811252c5 __perf_event_task_sched_in ([kernel.kallsyms])
        ffffffff810962ce finish_task_switch ([kernel.kallsyms])
        ffffffff8159f81d __schedule ([kernel.kallsyms])
        ffffffff8159fec9 schedule ([kernel.kallsyms])
        ffffffff8119e56b pipe_wait ([kernel.kallsyms])
        ffffffff8119f030 pipe_read ([kernel.kallsyms])
        ffffffff81195c37 do_aio_read ([kernel.kallsyms])
        ...etc...

‘34075.900988’ is the timestamp. However, what is this number?? I searched for quite some time, and there is no clear description to be found. It clearly is not epoch time.

Some experimentation learned that -apparently- the number is seconds since startup with microsecond granularity. Further experimentation using systemtap learned that exactly the same number can be fetched with the systemtap local_clock_us() function. This makes it possible to link perf stacktraces with systemtap output!! I created a script (run_stapflame.sh) that runs perf record -g and systemtap at the same time, then combines the information from both tools (meaning the systemtap runtime data is pushed into the stack trace information), after which flame graphs are created.

When a process is not running, there will be no perf stack traces, because there is no process for perf to take the stack trace from. So only when running on CPU (TASK_RUNNING state excluding run queue time), there should be perf data. Also, the systemtap times are accurately measured, but the stack traces of perf are sampled. This means it is missing data (by definition: sampling means you are going to lookup something at a certain interval, which means you are not looking between the intervals). What I have done, is extrapolate the perf samples found for an Oracle CPU or wait interval relative to the time in the measured interval. This means that if the time in the interval is 100us, and I get two collapsed stack traces with 1 and 3 samples, the extrapolated time will be; 1: 1/(1+3)*100us=25us, and 3: 3/(1+3)*100us=75us. This is not scientific, but it is meant to give an idea. Also, in order to show something useful in flame graphs, all the data needs to be based on the same data type, so I need to transform the number of stack traces to time.

I created a github project stapflame for my files.

First, you need to install the debuginfo kernel packages, as has been mentioned in this blogpost.
Then, you need to execute eventsname.sql in order to generate eventsname.sed, which is used to translate wait event numbers to wait event names. Wait event numbers change between Oracle database versions, and could potentially change after PSU apply. eventsname.sed must be in the same directory as the run_stapflame.sh script.
Then, you need to fetch stackcollapse-perf.pl and flamegraph.pl from Brendan Gregg’s github flamegraph repository. These need to be in the same directory as the run_stapflame.sh script too, and have the execute bit set.

Once the requirements are met, you can use the run_stapflame.sh script:

# ./run_stapflame.sh 123

The first argument must the PID of an existing Oracle foreground process.
This will compile and run the systemtap script. Once both systemtap and perf are running, the following message is displayed:

profiling started, press enter to stop

Now execute what you want to be profiled in the Oracle session. Once you are done, return to the session where you ran run_stapflame.sh, and press enter to stop the profiling.
Depending on how much systemtap and perf information this generated, the script will process for some time (it is coded in bash script, it’s not highly optimised, in fact it’s quite inefficient with larger sets(!)).
There will be some messages from perf indicating how many samples it captured:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.213 MB perf.data (~9286 samples) ]

And systemtap complaining about missing symbols for the kernel module it just created (for itself :-)):

No kallsyms or vmlinux with build-id 532f482ae316500802639d35de6c302fdb685afa was found
[stap_b6a486219fd483305d4991b483feb397_16565] with build id 532f482ae316500802639d35de6c302fdb685afa not found, continuing without symbols

The stapflames in the example mentioned here are of a simple ‘select count(*) from table’ in oracle, resulting in a direct path read.

This is the resulting flame graph of my original idea. In short, this does not look very useful in this case, and might never be useful as soon as you got more than approximately 20-100 cpu-wait combinations.

However, other flame graphs are more useful; look at this flame graph, it’s about process CPU state (stack traces grouped by process CPU state, which means that Oracle events (ON CPU and the wait events) can be found for every CPU state). It’s nice to see that in the kernel state TASK_RUNNING (which is on CPU, RUNQUEUE is specified independently) is 46.86% of the total time. Of this time, almost all time (40.35% of the total time) is in the Oracle state ‘ON CPU’. After TASK_RUNNING, TASK_INTERRUPTIBLE time is 49.56% of the total time. Almost all time in TASK_INTERRUPTIBLE has gone to the wait event direct path read. There is not a great deal of time spend in the run queue (1.74% of the total time). If you look up through the stacktraces in TASK_RUNNING and ON CPU, you see that of the 40.35% of running on CPU, 15.33% is spend on executing the io_submit function. This means 15.33/40.35*100=37.99% of the time on CPU is spend on submitting IO requests. Also in TASK_RUNNING and ON CPU, 7.57% is spend in the function sxorchk, which is block checksumming (db_block_checksum; set to ‘typical’ by default).

Another interesting view is this flamegraph; this one is only the wait event ‘direct path read’. As you can see, most of the wait event is not spent running on CPU. That is quite logical for a wait event :). 96.03% of the time spent in the wait event ‘direct path read’ is in the ‘TASK_INTERRUPTIBLE’ state. 2.68% of the time in the wait event ‘direct path read’ is spend in TASK_RUNNING on CPU state. Despite being inside wait event time, it’s very logical to have some time spend on running on CPU, because you need to run on the CPU to reap IO requests.

I love to hear comments, additions, corrections or any other feedback!

Credits and other references:
# Brendan Gregg and his work on FlameGraphs.
# Luca Canali and his work on stack profiling, which inspired me to create this tool, and proofreading.
# Tanel Poder and his work on Oracle internals in general.
# Klaas-Jan Jongsma for proofreading.

Footnotes:
– The systemtap script contains the full path to the Oracle database executable in the userspace function probes. This obviously must be changed to reflect the path to the Oracle executable of the process the systemtap script is profiling. Alternatively, it can just list the executable name (“oracle”), but then it the executable must be in the $PATH.

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.

#!/usr/local/bin/stap

global latency

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()])
}

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

...
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 79542
>kcbgtcr
<kcbgtcr, latency(us): 4
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 13
>kcbgtcr
<kcbgtcr, latency(us): 1
...

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

#!/usr/local/bin/stap
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("<kcbgtcr, latency(us): %d\n", latency[pid()])
}

probe end {
	if ( @count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")
		println(@hist_log(latency_histogram))
	}
}

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:

...

==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

#!/usr/local/bin/stap
#
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
	}
	printf("\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiosq0") {
	printf("=PARSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiexe") {
	printf("=EXECUTE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opifch2") {
	printf("=FETCH\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiclo") {
	printf("=CLOSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	printf("=KSLWTBCTX\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	printf("=KSLWTECTX\n")
}
probe process("/lib64/libaio.so.1").function("io_submit") {
	printf("=IO_SUBMIT\n")
}
probe process("/lib64/libaio.so.1").function("io_getevents_0_4") {
	printf("=IO_GETEVENTS\n")
}
probe process("/lib64/libc.so.6").function("pread64") {
	printf("=PREAD\n")
}

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
=IO_SUBMIT
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
=IO_GETEVENTS
=IO_GETEVENTS
169481 oracle_5487_tes(5487):              > kcflwi
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTBCTX
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTECTX
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

Some time back, I investigated the options to do profiling of processes in Linux. One of the things I investigated was systemtap. After careful investigation I came to the conclusion that systemtap was not really useful for my investigations, because it only worked in kernelspace, only very limited in userspace. The limitation of working in userspace was that you had to define your own markers in the source code of the program you wanted to profile with systemtap and compile that. Since my investigations are mostly around Oracle products, which are closed source, this doesn’t help me at all.

Some time ago, Frank Eigler responded to a blog article I posted on my blog about using gdb (GNU debugger) for doing userspace profiling, indicating that systemtap could do userspace function profiling too. I was quite shocked, because I carefully investigated that option, and came to the conclusion that exactly this did not work. After some communication on this, the conclusion was that this indeed did NOT work with the version of systemtap which is included with current versions of RHEL (and therefore Oracle Linux). But in the current source version of systemtap userspace ‘probing’ is included.

But that is not all…in order to give systemtap the opportunity to do userspace probing, it needs userspace ‘trace hooks’. This is only available in the current stock kernels if the source is of the kernel patched with the ‘utrace patch’, enabled, and compiled. That means a custom compiled kernel. On itself a custom compiled kernel is fine, but in much environments where you work with closed source products, products are certified against stock kernels, and supported only on stock kernels. From a support point of view I very much understand this, and from the viewpoint from me as a consultant too. To put it in a different way: it is an enormous red flag which is raised if I encountered an environment where people compile their own kernel on Linux.

But there is good news. Since linux kernel version 3.5 userspace probing support is included in the linux kernel, which means there is no patch needed against the kernel source in order to be able to profile in userspace. If you take a look at the kernels Oracle provides (for red hat: I am sorry, there is no way that I know to obtain RHEL online for free for testing, which for me rules out using it. I know about the merger with CentOS, but haven’t looked if that makes it attractive for me again), we can see that Oracle provides UEK (2.6.32), UEK2 (2.6.39) and UEK3 (3.8.13). Yes! That means that I can hook up a yum repo and install a kernel that allows userspace probing!

I installed a testmachine with Oracle Linux 6.5, installed the UEK3 kernel, and installed systemtap. When doing testing of the primary desired functionality (profile userland functions without debug symbols), I encountered this problem:

[root@ol6-uekbeta ~]# /usr/bin/stap -e 'probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }'
WARNING: cannot find module /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv debuginfo: No DWARF information found [man warning::debuginfo]
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }
                      ^

semantic error: no match
Pass 2: analysis failed.  [man error::pass2]

This strongly looks like systemtap does not understand the ‘process’ probe, where Frank warned about. So. Is this the end of the journey? No!

The userland function probing is documented in the documentation on the systemtap website. This means it should be available. Let’s clone the systemtap source, and build systemtap ourselves. This has a few implications. For starters, this eliminates the usage of systemtap for userland functions on “real” systems. With “real” I mean systems that have a function, and need to be supported and need to be stable. Because on this kind of systems no beta or preview software can and should be installed, no matter how much we want it, need it or want it. But to have an investigation system where we can mimic one of the most desired functions of dtrace, this is fine!

So. I have got a X86_64 Oracle Linux 6.5 installation (default install, and the meta-rpm oracle-rdbms-server-11gR2-preinstall.x86_64 installed), installed the UEK3 kernel on it (using the UEKR3 repo on Oracle Linux public yum), and added the git version system executables using ‘yum install git’, and next I cloned the systemtap git repository using ‘git clone git clone git://sourceware.org/git/systemtap.git. What needed to be done next, is compile and install the stuff. This can be done in a quite standardised way:

./configure
make
make install

If all goes well, you end up with the latest version of systemtap (version 2.5/0.152), which should be able to do userspace probing, and a kernel capable to provide the information for userspace probing.

Now let’s test this, and create a systemtap script to profile the time dbv (db verify) takes just by running it:
(please mind this is a proof of concept script, any additions or remarks are welcome!)

global time, function_times, prev_func, function_count

probe begin {
	printf("Begin.\n");
	time=0
	prev_func="begin"
}

probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") {
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	time=gettimeofday_us()
	prev_func=probefunc()
}

probe end {
	printf("End.\n")
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	delete function_times["__do_global_dtors_aux"]
	printf("Function\t\ttime (us)\tcount\tavg (us)\n")
	foreach( tm = [ fn ] in function_times+ ) {
		printf("%s: \t\t%d\t\t%d\t%d\n", fn, tm, function_count[fn],tm/function_count[fn])
		tot_time += tm
	}
	printf("Total time: %d\n", tot_time)
}

This systemtap script can be run from one (root) session, and dbv run in another session. Please mind to wait with running dbv until the systemtap session notifies you it is ready by saying “Begin.”. This is the result:

Function		time (us)	count	avg (us)
frame_dummy: 		3		1	3
lxplget: 		3		1	3
lxpsset: 		3		1	3
call_gmon_start: 		4		1	4
lxplset: 		4		1	4
lxpcset: 		4		1	4
lxptget: 		4		1	4
lxptset: 		4		1	4
lxhLaToId: 		5		1	5
kudbvcCreate: 		5		1	5
_fini: 		6		1	6
__do_global_ctors_aux: 		7		1	7
lxldini: 		7		1	7
lxhenvquery: 		7		1	7
kudbvhlp: 		7		1	7
lxldlbb: 		8		2	4
lxldLoadBoot: 		8		2	4
lxpname: 		12		3	4
kudbvcCreateMsg: 		12		1	12
lxlfOpen: 		13		4	3
lmsapop: 		13		2	6
lxldLoadObject: 		14		4	3
lxpdload: 		14		2	7
lxldlod: 		15		4	3
lxladjobj: 		15		4	3
lxlchkobj: 		15		4	3
__libc_csu_init: 		16		1	16
lxlgsz: 		16		4	4
lxfgnb: 		20		2	10
lxoCnvCase: 		22		2	11
lxhLangEnv: 		24		3	8
_init: 		27		1	27
lxpe2i: 		31		9	3
slmsbfn: 		31		2	15
lxdlobj: 		34		4	8
lxmopen: 		36		5	7
lxlfrd: 		40		4	10
_start: 		41		1	41
lmsagb1: 		46		14	3
lxhchtoid: 		47		6	7
lmsapts: 		47		14	3
lxpcget: 		48		7	6
lxgratio: 		48		14	3
slxldgnv: 		49		11	4
lmsapsb: 		49		14	3
lmsagbcmt: 		50		14	3
lmsapsc: 		50		14	3
lmsapnm: 		51		14	3
lxldalc: 		54		6	9
main: 		63		1	63
kudbvmal: 		63		1	63
lmsaprb: 		67		7	9
kudbvexit: 		68		1	68
lmsapfc: 		71		7	10
slxcfct: 		72		5	14
lxpmclo: 		81		13	6
slmscl: 		88		1	88
slxdfsync: 		91		1	91
lmsapic: 		91		7	13
lxhci2h: 		97		28	3
lxpendian: 		107		13	8
kudbvcml: 		116		1	116
lxgu2t: 		119		16	7
lmsagbf: 		120		14	8
kudbvmai: 		151		1	151
lxdgetobj: 		225		44	5
lxinitc: 		247		6	41
kudbvcpf: 		254		27	9
slmsrd: 		256		9	28
lxhh2ci: 		350		34	10
slxcfot: 		514		5	102
lxlinit: 		688		6	114
kudbvini: 		798		1	798
slmsop: 		1005		2	502
kudbvvpf: 		4102		27	151
Total time: 10993

Of course the result itself is not very useful. The time spend in dbv is measured at 10,993 microseconds (us), the function the most time was spend in was kudbvvpf(), which was 4102 us, but that function was executed 27 times, which makes the time per execution 151 us. The longest taking function was kudbvini(), which was 798 us.

%d bloggers like this: