Introduction to pinatrace annotate version 2: a look into latches again

This post is an introduction to pinatrace annotate version 2, which is a tool to annotate the output of the Intel Pin tools ‘pinatrace’ tool.

The pinatrace tool generates a file with every single memory access of a process. Please realise what this means: this is every single read from main memory or write to main memory from the CPU. This allows you to get an understanding what happens within a C function. This means you can determine what information or data is accessed in what function. Needless to say this is a tool for internals investigations and research, not something for normal daily database maintenance and support. Also, the performance of the process that you attached to is severely impacted, and it can only be turned off by stopping the process. Do not use this on a production database, use this at your own risk for research and investigational purposes only.

The general way to use the pin tool is to run something together with pin, for example ‘ls’:

$ ./pin -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -- ls

This generates an output file that looks like this:

#
# Memory Access Trace Generated By Pin
#
0x00007f49de60f173: W 0x00007ffd160c46f8  8     0x7f49de60f178
0x00007f49de60f840: W 0x00007ffd160c46f0  8                  0
0x00007f49de60f844: W 0x00007ffd160c46e8  8                  0
0x00007f49de60f846: W 0x00007ffd160c46e0  8                  0
0x00007f49de60f848: W 0x00007ffd160c46d8  8                  0
0x00007f49de60f84a: W 0x00007ffd160c46d0  8                  0
0x00007f49de60f84f: W 0x00007ffd160c46c8  8                  0
0x00007f49de60f85f: R 0x00007f49de82fe00  8                0xe
0x00007f49de60f866: W 0x00007f49de82fbd8  8      0x5f25f53b3dc
0x00007f49de60f877: R 0x00007f49de82ff90  8           0x221e00
0x00007f49de60f881: W 0x00007f49de8309a8  8     0x7f49de82fe00
0x00007f49de60f888: W 0x00007f49de830998  8     0x7f49de60e000

This on itself is not very helpful. Let me explain what you are seeing: the first column is the instruction pointer address, in other words: the place in the executable in which the memory transfer was initiated, ‘W’ and ‘R’ are self explanatory, the third column is the memory address to which the read or write is addressed, the fourth column is the size of the memory IO, and the fifth is the value.

Luckily, there is another way to run pin, which is attaching to an already running process. This is what is helpful in using the memory access trace with an Oracle process. The way this is done, is quite simple; first get the linux process number, then run pin specifying the process number:

$ ps -ef | grep [L]OC
oracle    7250  7249  0 09:42 ?        00:00:00 oracletest (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ./pin -pid 7250 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

There is an important caveat here: because pin inserts itself into the running process, the pin library (source/tools/SimpleExamples/obj-intel64/pinatrace.so in this case) must be available to this process. So if you downloaded, extracted and linked pin tools as root, and then as root try to attach to an oracle process, it will not work because the oracle process can not reach the library. So: extract and link the pin tools as oracle.

The output of the pinatrace (a file called pinatrace.out by default) is in $ORACLE_HOME/dbs when you attach to an Oracle database process.

Back to the pinatrace output: what if we could see the C function that was at the instruction pointer address, the name of the Oracle memory area for the memory address and for the fifth column the name of the Oracle memory area again in case the value is actually a pointer? That is exactly what my pinatrace annotate oracle tool does!

How to use it: first clone the pinatrace_annotate repository:

git clone https://gitlab.com/FritsHoogland/pinatrace_annotate.git

Then set the O_BINARY to your oracle binary at line 6 of pinatrace_annotate_oracle.sh.

The next thing is to generate the oracle memory area CSV files. I added a script to generate the memory ranges for the PGA too, however you should be aware this is very fluent, and it requires an execution of the oracle PID you point it to before the v$process_memory_detail gets populated. Sometimes I skip the PGA detail step because it doesn’t add too much value.
To run the scripts for generating the memory area CSV files:

$ sqlplus / as sysdba
...
SQL> @get_memory_ranges
...
	76 14,20,@1	   8645 		    SYS 			   sqlplus@memory-presentation.local (TNS V1-V3)
...
enter pid number to get pga details: 76
old   1: alter session set events 'immediate trace name pga_detail_get level &pid'
new   1: alter session set events 'immediate trace name pga_detail_get level 76'

Session altered.

execute something (eg. select * from dual) in the target process, then press enter

Now head over to the session you want the PGA memory details from, and execute something like ‘select * from dual’, and press enter in the sqlplus session in which you ran get_memory_ranges.sql. This will then run for a while, it generates pipe-delimited files that sqlite can parse.

After the CSV files have been generated, create the sqlite database for use by the tool:

$ ./pinatrace_annotate_oracle.sh -c
generate symbols from /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/oracle.
create and populate table symbol_ranges with symbols.
create index.
create table memory_ranges.
import memory_ranges.csv
import memory_ranges_xtables.csv
import memory_ranges_pga.csv
create index.

Now the pinatrace_annotate_oracle.sh tool is all set for usage!

Example usage
Let’s use an actual example to see where the memory trace provides benefit: latches. Nowadays, latches are not the issue these were in the oracle 8 days. In these days, the latches protecting the database buffers double linked lists, the ‘cache buffers chains’ latches, were not gotten in shared mode. That meant especially index root blocks could get latch contention if there was much concurrency.

Some great work has been done on the internals of latches by Andrey Nikolaev; see his latch internals presentation. His work is done on Solaris with Dtrace, I did some of that on linux (part 1, part 2 and part 3).

Let’s replay a willing to wait latch get and see how that looks like in the annotated pinatrace output! Setup the host with the pin tools (link it to get the pinatrace.so library), and clone pinatrace_annotate and configure it to point to the oracle binary you are using.

1. session 1:

$ sqlplus / as sysdba
SQL> select distinct sid from v$mystat;

2. session 2:

$ sqlplus / as sysdba
SQL> @get_memory_ranges

identify session 1, and fill out the pid.
take note of the O/S pid.
it will ask you to execute something in the session for which the pid was entered.

3. session 1:

SQL> select * from dual;

4. session 2:
press enter, and wait for @get_memory_ranges to finish.

SQL> exit
$ ./pinatrace_annotate_oracle.sh -c
$ sqlplus / as sysdba
SQL> select addr from v$latch where name = 'cache table scan latch';

note the latch address, it’s 0x6004A400 in my case.

SQL> oradebug setmypid
SQL> oradebug call kslgetl 0x6004A400 0 0 2442

result: ‘Function returned 1’ indicating a successful latch get.

5. session 1:

SQL> oradebug setmypid

6. session 3:
go to the directory where pin is extracted and linked, and go to the directory where the pin executable is located. Execute the following and change the pid for the O/S pid obtained in step 2.

$ ./pin -pid 8645 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -o pina_latch.txt

now the memory trace is enabled for pid 8645

7. session 1:

SQL> oradebug call kslgetl 0x6004A400 1 0 2442

this will hang, it is waiting for the latch which is taken by session 2.

8. session 2:

SQL> oradebug call kslfre 0x6004A400

this will return ‘Function returned 0’ indicating it freed the latch.

9. session 1:
the oradebug call will now return ‘Function returned 1’ indicating it has gotten the latch.

10. session 3:

$ cp /u01/app/oracle/product/12.2.0.1/dbhome_1/dbs/pina_latch.txt ~/pinatrace_annotate/pina_latch.txt

as indicated, the pinatrace output file will be in $ORACLE_HOME/dbs, and with this I copy the current state of the trace to the pinatrace_annotate directory where my pinatrace annotate tool is cloned.

$ cd ~/pinatrace_annotate
$ ./pinatrace_annotate_oracle.sh pina_latch.txt > pina_latch_annotated.txt

This will run for a while…

Once pinatrace_annotate_oracle.sh is done, open the pina_latch_annotated.txt file. Short guide on how this looks like; the first two lines in my file are:

function+offset:memory address(annotation+offset):Read/Write:size:value hex/dec(pointer annotation+offset)
??:0x00007fff224e6498():R:8:0x11060785/285607813()

The first line is the header, which is displayed once, and essentially explains it all. The second line starts with ‘??’. As indicated by the header, the first field is the function. Any call to a function (‘symbol’) that is not in the oracle executable, like functions in shared libraries, do not have an address in the executable. In fact, the place at which the shared libraries are loaded, and thus their functions are located is randomised by default, which explains why the executable does not have the address.

The second column is the address memory is read from or written into. In the second line, there is nothing in between the parentheses, meaning this address is not in a known memory range.

The ‘R’ or ‘W’ is obvious, and the number is the amount of memory read or written too.

The fifth field shows the content of the read or write in hexadecimal and in decimal. Here again there are parenthesis which are empty. I look up the value in the memory area’s, and put the memory area, if the value is in a known memory area, as annotation including offset. What you should realise is that I don’t know if a value is in fact a memory area and thus a pointer, or that the value coincidentally the same as a known memory area. I do think understanding if the value is a known memory area is handy and useful, in the C language pointers are heavily used.

Another example:

nsbasic_brc+548:0x00007f33d3026690(pga|Other, Fixed Uga, Fixed UGA heap+88712 ):R:8:0x7f33d3019600/139860560156160(pga|Other, Fixed Uga, Fixed UGA heap+3532
0 )

Here we see an example where the function is a function in the oracle executable, nsbasic_brc, which at offset 548 of the function read 8 bytes from memory address 0x00007f33d3026690, which was a known memory area at the time of running 0_get_pga_detail.sql when ‘select * from dual’ was run, which was in the PGA, in the Fixed UGA heap at offset 88712, and the value it read was 0x7f33d3019600, which is a memory address in the same chunk of memory at offset 3532, and therefore likely to be a pointer.

Now let’s look further into the annotated pinatracefile. If you search for ‘kslgetl’, you will find the actual latch get executed with oradebug. This is how that looks like on my system:

kslgetl+2:0x00007fff224e4ec0():W:8:0x7fff224e4ff0/140733768945648()
kslgetl+13:0x00007fff224e4ea0():W:8:0x7fff224e5870/140733768947824()
kslgetl+17:0x00007f33d8095f40(pga|Other, free memory, top call heap+243776 ):R:8:0x7f33d8095f40/139860644552512(pga|Other, free memory, top call heap+243776 )
kslgetl+26:0x00007fff224e4e98():W:8:0x1/1()
kslgetl+33:0x00007fff224e4e90():W:8:0/0()
kslgetl+37:0x00007fff224e4ea8():W:8:0x4/4()
kslgetl+44:0x00007f33d8087e78(pga|Other, free memory, top call heap+186232 ):R:8:0x7a63d268/2053362280(shared pool|permanent memor,duration 1,cls perm+5702248 )
kslgetl+58:0x00007fff224e4eb0():W:8:0x7fff224e57d0/140733768947664()
kslgetl+65:0x00007fff224e4e60():W:8:0/0()
kslgetl+69:0x000000006004a40f(fixed sga|(parent)latch:cache table scan latch+15 fixed sga|var:kcb_table_scan_latch_+15 ):R:1:0x37/55()
kslgetl+80:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):R:8:0/0()
kslgetl+96:0x000000006004a410(fixed sga|(parent)latch:cache table scan latch+16 fixed sga|var:kcb_table_scan_latch_+16 ):R:4:0x1/1()
kslgetl+103:0x000000007a63d5f0(shared pool|X$KSUPR.KSLLALOW+0 shared pool|permanent memor,duration 1,cls perm+5703152 ):R:4:0/0()
kslgetl+109:0x0000000011141524():R:4:0x1cffe/118782()
kslgetl+122:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0x6004a400/1610916864(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 )
kslgetl+131:0x00007f33d8087ea8(pga|Other, free memory, top call heap+186280 ):R:8:0x4b/75()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):W:8:0x22/34()
kslgetl+1898:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0/0()
kslgetl+1909:0x00000000600119b4(fixed sga|var:kcbccc_+0 ):R:1:0/0()
kslgetl+1938:0x00007fff224e4e60():R:8:0/0()
kslgetl+1942:0x00007fff224e4e38():W:8:0x10a6084b/279316555()

The two highlighted lines show that the function read the beginning of the latch memory and found a number, indicating the oracle process id that has the latch taken already. In other words, this means the process found a latch that has already been taken by another process, oracle pid 34.

If you scroll down further you see kslgetl called kslges, which is how a willing to wait latch works if the latch it wants to take is already taken, and then further down, you will see this pattern:

kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()

This is what is the spinning for the latch looks like from a memory access perspective. So that means that spinning on a latch does not mean the entire latching function (kslgetl or kslges) is called to try to take the latch, but rather when kslgetl finds the latch to be taken, it calls kslges, which does some housekeeping, and then in a really tight loop reads the latch’ memory at offset 0 to detect when it is freed. We can see it is a loop by looking at the offset numbers of the function, these are 930, 937 and 944, which are the offsets we see over and over.

In fact, if I grep for that line and count the number of lines, we can see this version of oracle (12.2.0.1) still does spin 20’000 times for a latch in willing to wait mode:

grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | wc -l
20001

Yes, this shows 20001 executions, remember we freed the latch in the other session?

$ grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | tail -3
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0/0()

The last line shows kslges finding the latch in a free state, which is after it has been woken from sleeping on a semaphore.

A modern version of Oracle doesn’t sleep for an arbitrary time and then spin again, it goes to sleep on a semaphore after it spun 20’000 times and then waits until it gets posted. That can be seen in the pinatrace output too, but if you want to look at the function flow, it might be better to generate a debugtrace trace, the pinatrace trace shows what information is used within a function, and might be too verbose. In fact, the function flow towards the semaphore call is kslges > skgpwwait > sskgpwwait.

Conclusion
pinatrace_annotate_oracle.sh is a tool to annotate a tracefile generated by the pin tool ‘pinatrace’ when running the oracle binary. It shows information about any memory access done by a process, which allows you to understand what information is obtained in what function of Oracle. This is a tool only for research and debug purposes.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.