Archive

Tag Archives: stap

There’s been a lot of work in the area of profiling. One of the things I have recently fallen in love with is Brendan Gregg’s flamegraphs. I work mainly on Linux, which means I use perf for generating stack traces. Luca Canali put a lot of effort in generating extended stack profiling methods, including kernel (only) stack traces and CPU state, reading the wait interface via direct SGA reading and kernel stack traces and getting userspace stack traces using libunwind and ptrace plus kernel stack and CPU state. I was inspired by the last method, but wanted more information, like process CPU state including runqueue time.

I started playing around with systemtap, and was able to read a process’ CPU state including run queue time. This involves using kernel tapset scheduler, which unfortunately needs the kernel debug info packages (kernel-euk-debuginfo and kernel-uek-debuginfo-common, available via oss.oracle.com). It is not hard to include wait interface information, this is work Luca and I collaborated on in the past. I created a systemtap script called cpu_and_wait_profile.stap, which shows the oracle database state transition between on cpu and in a wait, as well as kernel CPU state information. This is how that should be executed and what it outputs:

# stap -x 6641 cpu_and_wait_profile.stap
Begin.
w     - 388 (no begin)
c     1    	tot:         334	on:         333	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     2 384	tot:           5	on:           5	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
c     3    	tot:         644	on:         644	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     4 212	tot:          58	on:          41	off:          17	q:           5	ti:           9	tu:           0	#slices:    1
c     5    	tot:         371	on:         371	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     6 212	tot:         146	on:          58	off:          88	q:          14	ti:          69	tu:           0	#slices:    1
c     7    	tot:        1787	on:        1745	off:          42	q:          37	ti:           0	tu:           0	#slices:    2
w     8 212	tot:         265	on:          30	off:         234	q:          12	ti:         218	tu:           0	#slices:    1

The first column indicates if the process is inside an Oracle wait event (w), or is considered running on cpu (c) by the database.
The second column is a serial number. The third column is the wait event number if the process is considered inside a wait, or empty if on CPU.
The column ‘tot’ is the total time (in microseconds) spent on cpu or inside a wait event.
The column ‘on’ is the time spent truly running on CPU. Obviously, ‘off’ is all the time not spent running on the CPU.
The columns ‘q’, ‘ti’ and ‘tu’ are off CPU states. ‘q’ is time spend in the CPU runqueue. This is not a kernel state, a process gets the state ‘TASK_RUNNING’ to indicate it is willing to run, after which it’s the task of the scheduler to manage willing to run processes and get them onto a CPU. ‘ti’ is a kernel state, which means ‘TASK_INTERRUPTABLE’. This is a state after which the process is taken off the CPU, because it is waiting for something to complete. ‘Something’ means a disk IO, a timer to expire, etc. ‘tu’ means ‘TASK_UNINTERRUPTIBLE’, which is used if a process should only continue when a specific condition is met, and reacting to signals would be problematic.
The last column ‘#slices’ is the number of times the process has gotten on cpu.
If you look at the example output above, you see that the process started running, and remained running until sequence number 4 (sequence number is the second column). Sequence number 4 is an Oracle wait event, number 212 (direct path read). The earlier wait event number 384 was passed without actually waiting; total time is 5us, on cpu was 5us too (!). The total time spent in the wait event in sequence #4 is 58us, of which 41us was spent on cpu, and 17us off cpu. The off cpu time is composited of 5us run queue time (q) and 9us ‘TASK_INTERRUPTIBLE’ time, of which the total is 14us, which leaves 3us off cpu/unaccounted for. This is time taken by the state transitions and context switches. The actual sequence of events of the CPU state is: TASK_RUNNING (on cpu), then TASK_INTERRUPTIBLE is entered, which is actually waiting for IOs in this case (wait event ‘direct path read’, remember?). The ‘TASK_INTERRUPTIBLE’ state means the process is stopped from processing by the kernel (taken off cpu), which is logical, because it means the process is deliberately waiting for something before it can continue. Once the condition is met (IO(s) ready in this case), the process can continue. To continue, the process state is set to ‘TASK_RUNNING’, and put on a runqueue. This means there is no explicit process state ‘in run queue’. This state (state set to ‘TASK_RUNNING’ but not running on CPU yet) is shown with ‘q’. Once the process has enough priorities, the scheduler switches the process running on the CPU again.

Okay, so at this point we have a (systemtap) script that can very precisely count the time spend of a process. Wouldn’t it be great if we can see a flame graph per sequence number? I spent a great deal of time trying to figure out a way to combine the two. Until I learned about the ‘-T’ switch of perf record:

    -T, --timestamp       Sample timestamps

Great!! The way this works, is that perf includes ‘timestamps’ during recording (perf record), which are printed when the perf recording is externalised with the ‘perf script’ command:

oracle_92213_fv 92213 34075.900988: cycles:
        ffffffff810483da native_write_msr_safe ([kernel.kallsyms])
        ffffffff8102bf91 intel_pmu_enable_all ([kernel.kallsyms])
        ffffffff810263cc x86_pmu_enable ([kernel.kallsyms])
        ffffffff811221db perf_pmu_enable ([kernel.kallsyms])
        ffffffff81124d09 perf_event_context_sched_in ([kernel.kallsyms])
        ffffffff811252c5 __perf_event_task_sched_in ([kernel.kallsyms])
        ffffffff810962ce finish_task_switch ([kernel.kallsyms])
        ffffffff8159f81d __schedule ([kernel.kallsyms])
        ffffffff8159fec9 schedule ([kernel.kallsyms])
        ffffffff8119e56b pipe_wait ([kernel.kallsyms])
        ffffffff8119f030 pipe_read ([kernel.kallsyms])
        ffffffff81195c37 do_aio_read ([kernel.kallsyms])
        ...etc...

‘34075.900988’ is the timestamp. However, what is this number?? I searched for quite some time, and there is no clear description to be found. It clearly is not epoch time.

Some experimentation learned that -apparently- the number is seconds since startup with microsecond granularity. Further experimentation using systemtap learned that exactly the same number can be fetched with the systemtap local_clock_us() function. This makes it possible to link perf stacktraces with systemtap output!! I created a script (run_stapflame.sh) that runs perf record -g and systemtap at the same time, then combines the information from both tools (meaning the systemtap runtime data is pushed into the stack trace information), after which flame graphs are created.

When a process is not running, there will be no perf stack traces, because there is no process for perf to take the stack trace from. So only when running on CPU (TASK_RUNNING state excluding run queue time), there should be perf data. Also, the systemtap times are accurately measured, but the stack traces of perf are sampled. This means it is missing data (by definition: sampling means you are going to lookup something at a certain interval, which means you are not looking between the intervals). What I have done, is extrapolate the perf samples found for an Oracle CPU or wait interval relative to the time in the measured interval. This means that if the time in the interval is 100us, and I get two collapsed stack traces with 1 and 3 samples, the extrapolated time will be; 1: 1/(1+3)*100us=25us, and 3: 3/(1+3)*100us=75us. This is not scientific, but it is meant to give an idea. Also, in order to show something useful in flame graphs, all the data needs to be based on the same data type, so I need to transform the number of stack traces to time.

I created a github project stapflame for my files.

First, you need to install the debuginfo kernel packages, as has been mentioned in this blogpost.
Then, you need to execute eventsname.sql in order to generate eventsname.sed, which is used to translate wait event numbers to wait event names. Wait event numbers change between Oracle database versions, and could potentially change after PSU apply. eventsname.sed must be in the same directory as the run_stapflame.sh script.
Then, you need to fetch stackcollapse-perf.pl and flamegraph.pl from Brendan Gregg’s github flamegraph repository. These need to be in the same directory as the run_stapflame.sh script too, and have the execute bit set.

Once the requirements are met, you can use the run_stapflame.sh script:

# ./run_stapflame.sh 123

The first argument must the PID of an existing Oracle foreground process.
This will compile and run the systemtap script. Once both systemtap and perf are running, the following message is displayed:

profiling started, press enter to stop

Now execute what you want to be profiled in the Oracle session. Once you are done, return to the session where you ran run_stapflame.sh, and press enter to stop the profiling.
Depending on how much systemtap and perf information this generated, the script will process for some time (it is coded in bash script, it’s not highly optimised, in fact it’s quite inefficient with larger sets(!)).
There will be some messages from perf indicating how many samples it captured:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.213 MB perf.data (~9286 samples) ]

And systemtap complaining about missing symbols for the kernel module it just created (for itself :-)):

No kallsyms or vmlinux with build-id 532f482ae316500802639d35de6c302fdb685afa was found
[stap_b6a486219fd483305d4991b483feb397_16565] with build id 532f482ae316500802639d35de6c302fdb685afa not found, continuing without symbols

The stapflames in the example mentioned here are of a simple ‘select count(*) from table’ in oracle, resulting in a direct path read.

This is the resulting flame graph of my original idea. In short, this does not look very useful in this case, and might never be useful as soon as you got more than approximately 20-100 cpu-wait combinations.

However, other flame graphs are more useful; look at this flame graph, it’s about process CPU state (stack traces grouped by process CPU state, which means that Oracle events (ON CPU and the wait events) can be found for every CPU state). It’s nice to see that in the kernel state TASK_RUNNING (which is on CPU, RUNQUEUE is specified independently) is 46.86% of the total time. Of this time, almost all time (40.35% of the total time) is in the Oracle state ‘ON CPU’. After TASK_RUNNING, TASK_INTERRUPTIBLE time is 49.56% of the total time. Almost all time in TASK_INTERRUPTIBLE has gone to the wait event direct path read. There is not a great deal of time spend in the run queue (1.74% of the total time). If you look up through the stacktraces in TASK_RUNNING and ON CPU, you see that of the 40.35% of running on CPU, 15.33% is spend on executing the io_submit function. This means 15.33/40.35*100=37.99% of the time on CPU is spend on submitting IO requests. Also in TASK_RUNNING and ON CPU, 7.57% is spend in the function sxorchk, which is block checksumming (db_block_checksum; set to ‘typical’ by default).

Another interesting view is this flamegraph; this one is only the wait event ‘direct path read’. As you can see, most of the wait event is not spent running on CPU. That is quite logical for a wait event :). 96.03% of the time spent in the wait event ‘direct path read’ is in the ‘TASK_INTERRUPTIBLE’ state. 2.68% of the time in the wait event ‘direct path read’ is spend in TASK_RUNNING on CPU state. Despite being inside wait event time, it’s very logical to have some time spend on running on CPU, because you need to run on the CPU to reap IO requests.

I love to hear comments, additions, corrections or any other feedback!

Credits and other references:
# Brendan Gregg and his work on FlameGraphs.
# Luca Canali and his work on stack profiling, which inspired me to create this tool, and proofreading.
# Tanel Poder and his work on Oracle internals in general.
# Klaas-Jan Jongsma for proofreading.

Footnotes:
– The systemtap script contains the full path to the Oracle database executable in the userspace function probes. This obviously must be changed to reflect the path to the Oracle executable of the process the systemtap script is profiling. Alternatively, it can just list the executable name (“oracle”), but then it the executable must be in the $PATH.

Some time back, I investigated the options to do profiling of processes in Linux. One of the things I investigated was systemtap. After careful investigation I came to the conclusion that systemtap was not really useful for my investigations, because it only worked in kernelspace, only very limited in userspace. The limitation of working in userspace was that you had to define your own markers in the source code of the program you wanted to profile with systemtap and compile that. Since my investigations are mostly around Oracle products, which are closed source, this doesn’t help me at all.

Some time ago, Frank Eigler responded to a blog article I posted on my blog about using gdb (GNU debugger) for doing userspace profiling, indicating that systemtap could do userspace function profiling too. I was quite shocked, because I carefully investigated that option, and came to the conclusion that exactly this did not work. After some communication on this, the conclusion was that this indeed did NOT work with the version of systemtap which is included with current versions of RHEL (and therefore Oracle Linux). But in the current source version of systemtap userspace ‘probing’ is included.

But that is not all…in order to give systemtap the opportunity to do userspace probing, it needs userspace ‘trace hooks’. This is only available in the current stock kernels if the source is of the kernel patched with the ‘utrace patch’, enabled, and compiled. That means a custom compiled kernel. On itself a custom compiled kernel is fine, but in much environments where you work with closed source products, products are certified against stock kernels, and supported only on stock kernels. From a support point of view I very much understand this, and from the viewpoint from me as a consultant too. To put it in a different way: it is an enormous red flag which is raised if I encountered an environment where people compile their own kernel on Linux.

But there is good news. Since linux kernel version 3.5 userspace probing support is included in the linux kernel, which means there is no patch needed against the kernel source in order to be able to profile in userspace. If you take a look at the kernels Oracle provides (for red hat: I am sorry, there is no way that I know to obtain RHEL online for free for testing, which for me rules out using it. I know about the merger with CentOS, but haven’t looked if that makes it attractive for me again), we can see that Oracle provides UEK (2.6.32), UEK2 (2.6.39) and UEK3 (3.8.13). Yes! That means that I can hook up a yum repo and install a kernel that allows userspace probing!

I installed a testmachine with Oracle Linux 6.5, installed the UEK3 kernel, and installed systemtap. When doing testing of the primary desired functionality (profile userland functions without debug symbols), I encountered this problem:

[root@ol6-uekbeta ~]# /usr/bin/stap -e 'probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }'
WARNING: cannot find module /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv debuginfo: No DWARF information found [man warning::debuginfo]
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }
                      ^

semantic error: no match
Pass 2: analysis failed.  [man error::pass2]

This strongly looks like systemtap does not understand the ‘process’ probe, where Frank warned about. So. Is this the end of the journey? No!

The userland function probing is documented in the documentation on the systemtap website. This means it should be available. Let’s clone the systemtap source, and build systemtap ourselves. This has a few implications. For starters, this eliminates the usage of systemtap for userland functions on “real” systems. With “real” I mean systems that have a function, and need to be supported and need to be stable. Because on this kind of systems no beta or preview software can and should be installed, no matter how much we want it, need it or want it. But to have an investigation system where we can mimic one of the most desired functions of dtrace, this is fine!

So. I have got a X86_64 Oracle Linux 6.5 installation (default install, and the meta-rpm oracle-rdbms-server-11gR2-preinstall.x86_64 installed), installed the UEK3 kernel on it (using the UEKR3 repo on Oracle Linux public yum), and added the git version system executables using ‘yum install git’, and next I cloned the systemtap git repository using ‘git clone git clone git://sourceware.org/git/systemtap.git. What needed to be done next, is compile and install the stuff. This can be done in a quite standardised way:

./configure
make
make install

If all goes well, you end up with the latest version of systemtap (version 2.5/0.152), which should be able to do userspace probing, and a kernel capable to provide the information for userspace probing.

Now let’s test this, and create a systemtap script to profile the time dbv (db verify) takes just by running it:
(please mind this is a proof of concept script, any additions or remarks are welcome!)

global time, function_times, prev_func, function_count

probe begin {
	printf("Begin.\n");
	time=0
	prev_func="begin"
}

probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") {
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	time=gettimeofday_us()
	prev_func=probefunc()
}

probe end {
	printf("End.\n")
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	delete function_times["__do_global_dtors_aux"]
	printf("Function\t\ttime (us)\tcount\tavg (us)\n")
	foreach( tm = [ fn ] in function_times+ ) {
		printf("%s: \t\t%d\t\t%d\t%d\n", fn, tm, function_count[fn],tm/function_count[fn])
		tot_time += tm
	}
	printf("Total time: %d\n", tot_time)
}

This systemtap script can be run from one (root) session, and dbv run in another session. Please mind to wait with running dbv until the systemtap session notifies you it is ready by saying “Begin.”. This is the result:

Function		time (us)	count	avg (us)
frame_dummy: 		3		1	3
lxplget: 		3		1	3
lxpsset: 		3		1	3
call_gmon_start: 		4		1	4
lxplset: 		4		1	4
lxpcset: 		4		1	4
lxptget: 		4		1	4
lxptset: 		4		1	4
lxhLaToId: 		5		1	5
kudbvcCreate: 		5		1	5
_fini: 		6		1	6
__do_global_ctors_aux: 		7		1	7
lxldini: 		7		1	7
lxhenvquery: 		7		1	7
kudbvhlp: 		7		1	7
lxldlbb: 		8		2	4
lxldLoadBoot: 		8		2	4
lxpname: 		12		3	4
kudbvcCreateMsg: 		12		1	12
lxlfOpen: 		13		4	3
lmsapop: 		13		2	6
lxldLoadObject: 		14		4	3
lxpdload: 		14		2	7
lxldlod: 		15		4	3
lxladjobj: 		15		4	3
lxlchkobj: 		15		4	3
__libc_csu_init: 		16		1	16
lxlgsz: 		16		4	4
lxfgnb: 		20		2	10
lxoCnvCase: 		22		2	11
lxhLangEnv: 		24		3	8
_init: 		27		1	27
lxpe2i: 		31		9	3
slmsbfn: 		31		2	15
lxdlobj: 		34		4	8
lxmopen: 		36		5	7
lxlfrd: 		40		4	10
_start: 		41		1	41
lmsagb1: 		46		14	3
lxhchtoid: 		47		6	7
lmsapts: 		47		14	3
lxpcget: 		48		7	6
lxgratio: 		48		14	3
slxldgnv: 		49		11	4
lmsapsb: 		49		14	3
lmsagbcmt: 		50		14	3
lmsapsc: 		50		14	3
lmsapnm: 		51		14	3
lxldalc: 		54		6	9
main: 		63		1	63
kudbvmal: 		63		1	63
lmsaprb: 		67		7	9
kudbvexit: 		68		1	68
lmsapfc: 		71		7	10
slxcfct: 		72		5	14
lxpmclo: 		81		13	6
slmscl: 		88		1	88
slxdfsync: 		91		1	91
lmsapic: 		91		7	13
lxhci2h: 		97		28	3
lxpendian: 		107		13	8
kudbvcml: 		116		1	116
lxgu2t: 		119		16	7
lmsagbf: 		120		14	8
kudbvmai: 		151		1	151
lxdgetobj: 		225		44	5
lxinitc: 		247		6	41
kudbvcpf: 		254		27	9
slmsrd: 		256		9	28
lxhh2ci: 		350		34	10
slxcfot: 		514		5	102
lxlinit: 		688		6	114
kudbvini: 		798		1	798
slmsop: 		1005		2	502
kudbvvpf: 		4102		27	151
Total time: 10993

Of course the result itself is not very useful. The time spend in dbv is measured at 10,993 microseconds (us), the function the most time was spend in was kudbvvpf(), which was 4102 us, but that function was executed 27 times, which makes the time per execution 151 us. The longest taking function was kudbvini(), which was 798 us.

%d bloggers like this: