Systemtap revisited

Some time back, I investigated the options to do profiling of processes in Linux. One of the things I investigated was systemtap. After careful investigation I came to the conclusion that systemtap was not really useful for my investigations, because it only worked in kernelspace, only very limited in userspace. The limitation of working in userspace was that you had to define your own markers in the source code of the program you wanted to profile with systemtap and compile that. Since my investigations are mostly around Oracle products, which are closed source, this doesn’t help me at all.

Some time ago, Frank Eigler responded to a blog article I posted on my blog about using gdb (GNU debugger) for doing userspace profiling, indicating that systemtap could do userspace function profiling too. I was quite shocked, because I carefully investigated that option, and came to the conclusion that exactly this did not work. After some communication on this, the conclusion was that this indeed did NOT work with the version of systemtap which is included with current versions of RHEL (and therefore Oracle Linux). But in the current source version of systemtap userspace ‘probing’ is included.

But that is not all…in order to give systemtap the opportunity to do userspace probing, it needs userspace ‘trace hooks’. This is only available in the current stock kernels if the source is of the kernel patched with the ‘utrace patch’, enabled, and compiled. That means a custom compiled kernel. On itself a custom compiled kernel is fine, but in much environments where you work with closed source products, products are certified against stock kernels, and supported only on stock kernels. From a support point of view I very much understand this, and from the viewpoint from me as a consultant too. To put it in a different way: it is an enormous red flag which is raised if I encountered an environment where people compile their own kernel on Linux.

But there is good news. Since linux kernel version 3.5 userspace probing support is included in the linux kernel, which means there is no patch needed against the kernel source in order to be able to profile in userspace. If you take a look at the kernels Oracle provides (for red hat: I am sorry, there is no way that I know to obtain RHEL online for free for testing, which for me rules out using it. I know about the merger with CentOS, but haven’t looked if that makes it attractive for me again), we can see that Oracle provides UEK (2.6.32), UEK2 (2.6.39) and UEK3 (3.8.13). Yes! That means that I can hook up a yum repo and install a kernel that allows userspace probing!

I installed a testmachine with Oracle Linux 6.5, installed the UEK3 kernel, and installed systemtap. When doing testing of the primary desired functionality (profile userland functions without debug symbols), I encountered this problem:

[root@ol6-uekbeta ~]# /usr/bin/stap -e 'probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }'
WARNING: cannot find module /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv debuginfo: No DWARF information found [man warning::debuginfo]
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }
                      ^

semantic error: no match
Pass 2: analysis failed.  [man error::pass2]

This strongly looks like systemtap does not understand the ‘process’ probe, where Frank warned about. So. Is this the end of the journey? No!

The userland function probing is documented in the documentation on the systemtap website. This means it should be available. Let’s clone the systemtap source, and build systemtap ourselves. This has a few implications. For starters, this eliminates the usage of systemtap for userland functions on “real” systems. With “real” I mean systems that have a function, and need to be supported and need to be stable. Because on this kind of systems no beta or preview software can and should be installed, no matter how much we want it, need it or want it. But to have an investigation system where we can mimic one of the most desired functions of dtrace, this is fine!

So. I have got a X86_64 Oracle Linux 6.5 installation (default install, and the meta-rpm oracle-rdbms-server-11gR2-preinstall.x86_64 installed), installed the UEK3 kernel on it (using the UEKR3 repo on Oracle Linux public yum), and added the git version system executables using ‘yum install git’, and next I cloned the systemtap git repository using ‘git clone git clone git://sourceware.org/git/systemtap.git. What needed to be done next, is compile and install the stuff. This can be done in a quite standardised way:

./configure
make
make install

If all goes well, you end up with the latest version of systemtap (version 2.5/0.152), which should be able to do userspace probing, and a kernel capable to provide the information for userspace probing.

Now let’s test this, and create a systemtap script to profile the time dbv (db verify) takes just by running it:
(please mind this is a proof of concept script, any additions or remarks are welcome!)

global time, function_times, prev_func, function_count

probe begin {
	printf("Begin.\n");
	time=0
	prev_func="begin"
}

probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") {
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	time=gettimeofday_us()
	prev_func=probefunc()
}

probe end {
	printf("End.\n")
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	delete function_times["__do_global_dtors_aux"]
	printf("Function\t\ttime (us)\tcount\tavg (us)\n")
	foreach( tm = [ fn ] in function_times+ ) {
		printf("%s: \t\t%d\t\t%d\t%d\n", fn, tm, function_count[fn],tm/function_count[fn])
		tot_time += tm
	}
	printf("Total time: %d\n", tot_time)
}

This systemtap script can be run from one (root) session, and dbv run in another session. Please mind to wait with running dbv until the systemtap session notifies you it is ready by saying “Begin.”. This is the result:

Function		time (us)	count	avg (us)
frame_dummy: 		3		1	3
lxplget: 		3		1	3
lxpsset: 		3		1	3
call_gmon_start: 		4		1	4
lxplset: 		4		1	4
lxpcset: 		4		1	4
lxptget: 		4		1	4
lxptset: 		4		1	4
lxhLaToId: 		5		1	5
kudbvcCreate: 		5		1	5
_fini: 		6		1	6
__do_global_ctors_aux: 		7		1	7
lxldini: 		7		1	7
lxhenvquery: 		7		1	7
kudbvhlp: 		7		1	7
lxldlbb: 		8		2	4
lxldLoadBoot: 		8		2	4
lxpname: 		12		3	4
kudbvcCreateMsg: 		12		1	12
lxlfOpen: 		13		4	3
lmsapop: 		13		2	6
lxldLoadObject: 		14		4	3
lxpdload: 		14		2	7
lxldlod: 		15		4	3
lxladjobj: 		15		4	3
lxlchkobj: 		15		4	3
__libc_csu_init: 		16		1	16
lxlgsz: 		16		4	4
lxfgnb: 		20		2	10
lxoCnvCase: 		22		2	11
lxhLangEnv: 		24		3	8
_init: 		27		1	27
lxpe2i: 		31		9	3
slmsbfn: 		31		2	15
lxdlobj: 		34		4	8
lxmopen: 		36		5	7
lxlfrd: 		40		4	10
_start: 		41		1	41
lmsagb1: 		46		14	3
lxhchtoid: 		47		6	7
lmsapts: 		47		14	3
lxpcget: 		48		7	6
lxgratio: 		48		14	3
slxldgnv: 		49		11	4
lmsapsb: 		49		14	3
lmsagbcmt: 		50		14	3
lmsapsc: 		50		14	3
lmsapnm: 		51		14	3
lxldalc: 		54		6	9
main: 		63		1	63
kudbvmal: 		63		1	63
lmsaprb: 		67		7	9
kudbvexit: 		68		1	68
lmsapfc: 		71		7	10
slxcfct: 		72		5	14
lxpmclo: 		81		13	6
slmscl: 		88		1	88
slxdfsync: 		91		1	91
lmsapic: 		91		7	13
lxhci2h: 		97		28	3
lxpendian: 		107		13	8
kudbvcml: 		116		1	116
lxgu2t: 		119		16	7
lmsagbf: 		120		14	8
kudbvmai: 		151		1	151
lxdgetobj: 		225		44	5
lxinitc: 		247		6	41
kudbvcpf: 		254		27	9
slmsrd: 		256		9	28
lxhh2ci: 		350		34	10
slxcfot: 		514		5	102
lxlinit: 		688		6	114
kudbvini: 		798		1	798
slmsop: 		1005		2	502
kudbvvpf: 		4102		27	151
Total time: 10993

Of course the result itself is not very useful. The time spend in dbv is measured at 10,993 microseconds (us), the function the most time was spend in was kudbvvpf(), which was 4102 us, but that function was executed 27 times, which makes the time per execution 151 us. The longest taking function was kudbvini(), which was 798 us.

7 comments
  1. Frank Ch. Eigler said:

    Hi, Frits.

    Thanks for working through the problems & reporting! Note that the need for the freshest version of systemtap comes from your applying it to a debuginfo-less binary (man warning::debuginfo), so symbol-table-only probing is required. The next release of stap includes the fixes for that functionality.

    By the way, everyone’s welcome to test RHEL7 RC for free at ftp://ftp.redhat.com/redhat/rhel/rc/7/Server/x86_64/iso/, and indeed CentOS rebuilds of RHEL5 & 6.

  2. Hi Fritz,

    Thank you for your post!
    Probably you know what is internal systemtap implementation for @var function, especially for userland symbols?

    • Thank you for reading! I haven’t used systemtap in a long time, you might want to head over to https://mahmoudhatem.wordpress.com and see if his articles do answer your question. You could also download the systemtap source, and start reading the code. yes, that will take time, but it’ll be a great learning journey!

      • Thank you, Frits,

        I’ve already visited Mahmoud’s site and I have downloaded systemtap source. I was just going to save a bit time 🙂 as @var function seems need to be “compiled”/translated somehow on the fly to some API calls.
        The main driver was attempt to understand why some reverse look up can resolve a symbol name by an address:
        like
        name = _stp_kallsyms_lookup(STAP_ARG_addr, &size, &offset, &modname, NULL);
        strlcpy (STAP_RETVALUE, name, MAXSTRINGLEN);

        but direct lookup can’t detect the same symbol,
        like @var(“somesymbol@/u01/app/oracle/product/11.2.0/dbhome_1/bin/oracle”)

    • Frank Ch. Eigler said:

      The @var(“symbol”) machinery is implemented by the atvar_op class. dwarf_var_expanding_visitor::visit_atvar_op may be a good place to start if you really want to see how it works. Be warned though, this is the most complex part of systemtap, dealing with DWARF debuginfo analysis / translation.

      • Thank you, Frank,
        That is really very useful !

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.