Archive

Tag Archives: profile

During investigating I ran once again into statistics in the Oracle database that still provide a useful details, but the actual naming of the statistic is describing a situation that in reality does not exist anymore. The statistics I am talking about are ‘calls to kcmgcs’, ‘calls to kcmgrs’, ‘calls to kcmgas’ and ‘calls to get snapshot scn: kcmgss’.

Disclaimer: this is research. Any of these techniques potentially can crash your instance or leave your database in a corrupted state. Test the techniques used in this article severely before applying it in an actual situation. Use at your own risk.

Back to the ‘calls to’ statistics. To see what I mean here, you can look up the functions in symbol table in the Oracle executable. There are several ways to do that, one way is using gdb:

(gdb) info function ^kcmg
All functions matching regular expression "^kcmg":

Non-debugging symbols:
0x0000000002d12410  kcmgtsUTC
0x0000000002d12be0  kcmgts
0x0000000002d139a0  kcmgtsRSL
0x000000000cc924b0  kcmglscn
0x000000000cc92560  kcmgtsf

See? There are no functions kcmgcs, kcmgrs, kcmgas and kcmgss. Are these statistics of any use then? I think so, although the only really relevant statistic probably is ‘calls to kcmgas’ to indicate an approximate count of the number of calls for a new SCN (gas probably means ‘get and advance’). The meaning of them is documented (at least up to version 12.1) in the online Oracle documentation, Statistics Description:
– ‘calls to get snapshot scn: kcmgss’: Number of times a snapshot system change number (SCN) was allocated. The SCN is allocated at the start of a transaction.
– ‘calls to kcmgas’: Number of calls to routine kcmgas to get a new SCN
– ‘calls to kcmgcs’: Number of calls to routine kcmgcs to get a current SCN
– ‘calls to kcmgrs’: Number of calls to routine kcsgrs to get a recent SCN
(I think ‘routine kcsgrs’ is a typo or a partial improvement and should be either kcmgrs to stay in line with the other (wrong) descriptions, or it should be kcsgrsn (added ‘n’) to be correct)

Would there be a way to find out how this now works? We can’t work from the function names, since I have just proven the functions do not exist anymore (I have no sufficient old enough database to prove this actually, feel free to test this on your old 7/8i/9i/10g database!). However, we can work this out when starting from the other end as seen from what the functions are supposed to do. The functions are supposed to provide a SCN for the session to work with. Since we know the source of SCN values in the SGA (kcsgscn, the global SCN), we can watch which functions do call this variable for a new SCN value or a reference value!

Obtain the address of kcsgscn:

SQL> select ksmfsnam, ksmfsadr from x$ksmfsv
2  where ksmfsnam = 'kcsgscn_';

KSMFSNAM                                                         KSMFSADR
---------------------------------------------------------------- ----------------
kcsgscn_                                                         000000006001FBB0

Now put a access watchpoint on the kcsgscn address for a session using gdb:

$ gdb -p PID
...
(gdb) awatch *0x6001fbb0
Hardware access (read/write) watchpoint 1: *0x6001fbb0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now let’s test something very simple, like a ‘select * from dual’:

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 331907
New value = 332367
0x000000000cca0f74 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0423 in kcscur3 ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0f74 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0423 in kcscur3 ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()

That’s my session accessing address 0x6001fbb0, alias the global SCN, eight times.
The functions that access 0x6001fbbo in this case are kcsgssn, kcscur3 and kcsgcsn.

When we look at the session statistics of the above executed SQL (please mind there is a difference between the first execution and subsequent executions), and focus on the ‘calls to’ statistics this is what this shows (partial output from Tanel’s snapper):

130, SYS       , STAT, calls to kcmgcs                                           ,             4,
130, SYS       , STAT, calls to get snapshot scn: kcmgss                         ,             2,

If carefully combine the snapper output with the function names the watchpoint gives, it becomes apparent that the amount of the statistic with kcmgcs is equal to the times the kcsgcsn function was called, and the amount of the statistic with kcmgss is equal to the times the kcsgssn function was called.

This is only a very simple example, when you try different SQL’s, especially DML, you will notice a lot of calls to the global SCN are made, because the metadata is protected by SCN’s too. I assume Oracle kept the statistics names the same for the sake of not breaking products and scripts using these statistic names. However in reality the functions using the global SCN are different from how the database truly works.

There’s been a lot of work in the area of profiling. One of the things I have recently fallen in love with is Brendan Gregg’s flamegraphs. I work mainly on Linux, which means I use perf for generating stack traces. Luca Canali put a lot of effort in generating extended stack profiling methods, including kernel (only) stack traces and CPU state, reading the wait interface via direct SGA reading and kernel stack traces and getting userspace stack traces using libunwind and ptrace plus kernel stack and CPU state. I was inspired by the last method, but wanted more information, like process CPU state including runqueue time.

I started playing around with systemtap, and was able to read a process’ CPU state including run queue time. This involves using kernel tapset scheduler, which unfortunately needs the kernel debug info packages (kernel-euk-debuginfo and kernel-uek-debuginfo-common, available via oss.oracle.com). It is not hard to include wait interface information, this is work Luca and I collaborated on in the past. I created a systemtap script called cpu_and_wait_profile.stap, which shows the oracle database state transition between on cpu and in a wait, as well as kernel CPU state information. This is how that should be executed and what it outputs:

# stap -x 6641 cpu_and_wait_profile.stap
Begin.
w     - 388 (no begin)
c     1    	tot:         334	on:         333	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     2 384	tot:           5	on:           5	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
c     3    	tot:         644	on:         644	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     4 212	tot:          58	on:          41	off:          17	q:           5	ti:           9	tu:           0	#slices:    1
c     5    	tot:         371	on:         371	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     6 212	tot:         146	on:          58	off:          88	q:          14	ti:          69	tu:           0	#slices:    1
c     7    	tot:        1787	on:        1745	off:          42	q:          37	ti:           0	tu:           0	#slices:    2
w     8 212	tot:         265	on:          30	off:         234	q:          12	ti:         218	tu:           0	#slices:    1

The first column indicates if the process is inside an Oracle wait event (w), or is considered running on cpu (c) by the database.
The second column is a serial number. The third column is the wait event number if the process is considered inside a wait, or empty if on CPU.
The column ‘tot’ is the total time (in microseconds) spent on cpu or inside a wait event.
The column ‘on’ is the time spent truly running on CPU. Obviously, ‘off’ is all the time not spent running on the CPU.
The columns ‘q’, ‘ti’ and ‘tu’ are off CPU states. ‘q’ is time spend in the CPU runqueue. This is not a kernel state, a process gets the state ‘TASK_RUNNING’ to indicate it is willing to run, after which it’s the task of the scheduler to manage willing to run processes and get them onto a CPU. ‘ti’ is a kernel state, which means ‘TASK_INTERRUPTABLE’. This is a state after which the process is taken off the CPU, because it is waiting for something to complete. ‘Something’ means a disk IO, a timer to expire, etc. ‘tu’ means ‘TASK_UNINTERRUPTIBLE’, which is used if a process should only continue when a specific condition is met, and reacting to signals would be problematic.
The last column ‘#slices’ is the number of times the process has gotten on cpu.
If you look at the example output above, you see that the process started running, and remained running until sequence number 4 (sequence number is the second column). Sequence number 4 is an Oracle wait event, number 212 (direct path read). The earlier wait event number 384 was passed without actually waiting; total time is 5us, on cpu was 5us too (!). The total time spent in the wait event in sequence #4 is 58us, of which 41us was spent on cpu, and 17us off cpu. The off cpu time is composited of 5us run queue time (q) and 9us ‘TASK_INTERRUPTIBLE’ time, of which the total is 14us, which leaves 3us off cpu/unaccounted for. This is time taken by the state transitions and context switches. The actual sequence of events of the CPU state is: TASK_RUNNING (on cpu), then TASK_INTERRUPTIBLE is entered, which is actually waiting for IOs in this case (wait event ‘direct path read’, remember?). The ‘TASK_INTERRUPTIBLE’ state means the process is stopped from processing by the kernel (taken off cpu), which is logical, because it means the process is deliberately waiting for something before it can continue. Once the condition is met (IO(s) ready in this case), the process can continue. To continue, the process state is set to ‘TASK_RUNNING’, and put on a runqueue. This means there is no explicit process state ‘in run queue’. This state (state set to ‘TASK_RUNNING’ but not running on CPU yet) is shown with ‘q’. Once the process has enough priorities, the scheduler switches the process running on the CPU again.

Okay, so at this point we have a (systemtap) script that can very precisely count the time spend of a process. Wouldn’t it be great if we can see a flame graph per sequence number? I spent a great deal of time trying to figure out a way to combine the two. Until I learned about the ‘-T’ switch of perf record:

    -T, --timestamp       Sample timestamps

Great!! The way this works, is that perf includes ‘timestamps’ during recording (perf record), which are printed when the perf recording is externalised with the ‘perf script’ command:

oracle_92213_fv 92213 34075.900988: cycles:
        ffffffff810483da native_write_msr_safe ([kernel.kallsyms])
        ffffffff8102bf91 intel_pmu_enable_all ([kernel.kallsyms])
        ffffffff810263cc x86_pmu_enable ([kernel.kallsyms])
        ffffffff811221db perf_pmu_enable ([kernel.kallsyms])
        ffffffff81124d09 perf_event_context_sched_in ([kernel.kallsyms])
        ffffffff811252c5 __perf_event_task_sched_in ([kernel.kallsyms])
        ffffffff810962ce finish_task_switch ([kernel.kallsyms])
        ffffffff8159f81d __schedule ([kernel.kallsyms])
        ffffffff8159fec9 schedule ([kernel.kallsyms])
        ffffffff8119e56b pipe_wait ([kernel.kallsyms])
        ffffffff8119f030 pipe_read ([kernel.kallsyms])
        ffffffff81195c37 do_aio_read ([kernel.kallsyms])
        ...etc...

‘34075.900988’ is the timestamp. However, what is this number?? I searched for quite some time, and there is no clear description to be found. It clearly is not epoch time.

Some experimentation learned that -apparently- the number is seconds since startup with microsecond granularity. Further experimentation using systemtap learned that exactly the same number can be fetched with the systemtap local_clock_us() function. This makes it possible to link perf stacktraces with systemtap output!! I created a script (run_stapflame.sh) that runs perf record -g and systemtap at the same time, then combines the information from both tools (meaning the systemtap runtime data is pushed into the stack trace information), after which flame graphs are created.

When a process is not running, there will be no perf stack traces, because there is no process for perf to take the stack trace from. So only when running on CPU (TASK_RUNNING state excluding run queue time), there should be perf data. Also, the systemtap times are accurately measured, but the stack traces of perf are sampled. This means it is missing data (by definition: sampling means you are going to lookup something at a certain interval, which means you are not looking between the intervals). What I have done, is extrapolate the perf samples found for an Oracle CPU or wait interval relative to the time in the measured interval. This means that if the time in the interval is 100us, and I get two collapsed stack traces with 1 and 3 samples, the extrapolated time will be; 1: 1/(1+3)*100us=25us, and 3: 3/(1+3)*100us=75us. This is not scientific, but it is meant to give an idea. Also, in order to show something useful in flame graphs, all the data needs to be based on the same data type, so I need to transform the number of stack traces to time.

I created a github project stapflame for my files.

First, you need to install the debuginfo kernel packages, as has been mentioned in this blogpost.
Then, you need to execute eventsname.sql in order to generate eventsname.sed, which is used to translate wait event numbers to wait event names. Wait event numbers change between Oracle database versions, and could potentially change after PSU apply. eventsname.sed must be in the same directory as the run_stapflame.sh script.
Then, you need to fetch stackcollapse-perf.pl and flamegraph.pl from Brendan Gregg’s github flamegraph repository. These need to be in the same directory as the run_stapflame.sh script too, and have the execute bit set.

Once the requirements are met, you can use the run_stapflame.sh script:

# ./run_stapflame.sh 123

The first argument must the PID of an existing Oracle foreground process.
This will compile and run the systemtap script. Once both systemtap and perf are running, the following message is displayed:

profiling started, press enter to stop

Now execute what you want to be profiled in the Oracle session. Once you are done, return to the session where you ran run_stapflame.sh, and press enter to stop the profiling.
Depending on how much systemtap and perf information this generated, the script will process for some time (it is coded in bash script, it’s not highly optimised, in fact it’s quite inefficient with larger sets(!)).
There will be some messages from perf indicating how many samples it captured:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.213 MB perf.data (~9286 samples) ]

And systemtap complaining about missing symbols for the kernel module it just created (for itself :-)):

No kallsyms or vmlinux with build-id 532f482ae316500802639d35de6c302fdb685afa was found
[stap_b6a486219fd483305d4991b483feb397_16565] with build id 532f482ae316500802639d35de6c302fdb685afa not found, continuing without symbols

The stapflames in the example mentioned here are of a simple ‘select count(*) from table’ in oracle, resulting in a direct path read.

This is the resulting flame graph of my original idea. In short, this does not look very useful in this case, and might never be useful as soon as you got more than approximately 20-100 cpu-wait combinations.

However, other flame graphs are more useful; look at this flame graph, it’s about process CPU state (stack traces grouped by process CPU state, which means that Oracle events (ON CPU and the wait events) can be found for every CPU state). It’s nice to see that in the kernel state TASK_RUNNING (which is on CPU, RUNQUEUE is specified independently) is 46.86% of the total time. Of this time, almost all time (40.35% of the total time) is in the Oracle state ‘ON CPU’. After TASK_RUNNING, TASK_INTERRUPTIBLE time is 49.56% of the total time. Almost all time in TASK_INTERRUPTIBLE has gone to the wait event direct path read. There is not a great deal of time spend in the run queue (1.74% of the total time). If you look up through the stacktraces in TASK_RUNNING and ON CPU, you see that of the 40.35% of running on CPU, 15.33% is spend on executing the io_submit function. This means 15.33/40.35*100=37.99% of the time on CPU is spend on submitting IO requests. Also in TASK_RUNNING and ON CPU, 7.57% is spend in the function sxorchk, which is block checksumming (db_block_checksum; set to ‘typical’ by default).

Another interesting view is this flamegraph; this one is only the wait event ‘direct path read’. As you can see, most of the wait event is not spent running on CPU. That is quite logical for a wait event :). 96.03% of the time spent in the wait event ‘direct path read’ is in the ‘TASK_INTERRUPTIBLE’ state. 2.68% of the time in the wait event ‘direct path read’ is spend in TASK_RUNNING on CPU state. Despite being inside wait event time, it’s very logical to have some time spend on running on CPU, because you need to run on the CPU to reap IO requests.

I love to hear comments, additions, corrections or any other feedback!

Credits and other references:
# Brendan Gregg and his work on FlameGraphs.
# Luca Canali and his work on stack profiling, which inspired me to create this tool, and proofreading.
# Tanel Poder and his work on Oracle internals in general.
# Klaas-Jan Jongsma for proofreading.

Footnotes:
– The systemtap script contains the full path to the Oracle database executable in the userspace function probes. This obviously must be changed to reflect the path to the Oracle executable of the process the systemtap script is profiling. Alternatively, it can just list the executable name (“oracle”), but then it the executable must be in the $PATH.

%d bloggers like this: