Archive

Linux

To me, ‘cloud computing’ is renting a compute resource to perform a task. In order to use that compute resource, you need to instruct it to do something, which is typically done via the network. If the task the compute resource needs to fulfil is being an application server or being a client or both in the case of an application server that uses an Oracle database, the network latency between the client of the database and the database server is a critical property.

I think so far everybody is with me. If we zoom in to the network, it becomes more difficult, and *very* easy to make wrong assumptions. Let me explain. A network, but really any connection between processing and a resource, has two DIFFERENT properties that I see getting mixed up consistently. These are:
* Latency: the time it takes for a signal or (network) packet to travel from the client to the server, or the time it takes to travel from the client to the server and back.
* Bandwidth: the amount of data that can be transported from the client to the server in a certain time.

How do you determine the latency of a network? Probably the most people respond with ‘use ping’. This is how that looks like:

[user@oid1 ~]$ ping -c 3 lsh1
PING lsh1 (x.x.x.x) 56(84) bytes of data.
64 bytes from lsh1: icmp_seq=1 ttl=62 time=680 ms
64 bytes from lsh1: icmp_seq=2 ttl=62 time=0.304 ms
64 bytes from lsh1: icmp_seq=3 ttl=62 time=0.286 ms

The question I often ask myself is: what is that we see actually? How does this work?
In order to answer that question, the tcpdump tool can answer that question. Using tcpdump, you can capture the network packets on which the ping utility based the above outcome. The ‘-ttt’ option calculates the time between each arrived packet:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host lsh1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1 > lsh1: ICMP echo request, id 35879, seq 1, length 64
00:00:00.680289 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 1, length 64
00:00:00.319614 IP oid1 > lsh1: ICMP echo request, id 35879, seq 2, length 64
00:00:00.000287 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 2, length 64
00:00:01.000180 IP oid1 > lsh1: ICMP echo request, id 35879, seq 3, length 64
00:00:00.000269 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 3, length 64

So, ping works by sending a packet (ICMP echo request) requesting a reply (ICMP echo reply) from the remote server, and measure the time it takes to get that reply. Great, quite simple, isn’t it? However, the biggest issue I see this is using a protocol that is not used for sending regular data (!). Most application servers I encounter send data using TCP (transmission control protocol), the traffic ping sends are sent using a protocol called ICMP (internet control message protocol). Especially in the cloud, which means (probably) a lot of the infrastructure is shared, ICMP might be given different priority than TCP traffic, which you quite probably are using when the application on your cloud virtual machine is running. For those of you who haven’t looked into the network side of the IT landscape, you can priorise protocols and even specific ports, throttle traffic and you can even terminate it. In fact, a sensible protected (virtual) machine in the cloud will not respond to ICMP echo requests in order to protected it from attacks.

So, what would be a more sensible approach then? A better way would be to use the same protocol and port number that your application is going to use. This can be done using a tool called hping. Using that tool, you can craft your own packet with the protocol and flags you want. In the case of Oracle database traffic that would be the TCP protocol, port 1521 (it can be any port number, 1521 is the default port). This is how you can do that. In order to mimic starting a connection, the S (SYN) flag is set (-S), one packet is send (-c 1) to port 1521 (-p 1521).

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip

What this does is best investigated with tcpdump once again. The server this is executed against can respond in two ways (three actually). When you send this to TCP port 1521 where a listener (or any other daemon that listens on that port) is listening, this is the response:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [S], seq 1436552830, win 512, length 0
00:00:00.001229 IP db01-vip.ncube-lm > oid1.kjtsiteserver: Flags [S.], seq 2397022511, ack 1436552831, win 14600, options [mss 1460], length 0
00:00:00.000023 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [R], seq 1436552831, win 0, length 0

This is a variation of the classic TCP three way handshake:
1. A TCP packet is sent with the SYN flag set to indicate starting a (client to server) connection.
2. A TCP packet is sent back with SYN flag set to indicate starting a (server to client) connection, and the first packet is acknowledged.
3. This is where the variation is, normally an acknowledgement would be sent of the second packet to establish a two way connection, but in order to stop the communication a packet is sent with the RST (reset) flag set.

However, this is if a process is listening on the port. This is how that looks like when there is no process listening on port 1521:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.vsamredirector > db01.ncube-lm: Flags [S], seq 1975471906, win 512, length 0
00:00:00.001118 IP db01.ncube-lm > oid1.vsamredirector: Flags [R.], seq 0, ack 1975471907, win 0, length 0

This means that if a connection is initiated to a port on which no process is listening (port status ‘closed’), there is communication between the client and the server. This is why firewalls were invented!
1. A TCP packet is sent with the SYN flag set to indicate starting a connection.
2. A TCP packet is sent back to with the RST (reset) flag set to indicate no connection is possible.

The third option, when port 1521 is firewalled on the server, simply means only the first packet (from client to server with the SYN flag set) is sent and no response is coming back.

Okay, let’s pick up the performance aspect again. This hping command:

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip
HPING db01-vip (eth0 x.x.x.x): S set, 40 headers + 0 data bytes
len=44 ip=db01-vip ttl=57 DF id=0 sport=1521 flags=SA seq=0 win=14600 rtt=1.2 ms

Says the roundtrip time is 1.2ms. If we look at the network packets and timing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.mmcal > db01-vip.ncube-lm: Flags [S], seq 1289836562, win 512, length 0
00:00:00.001113 IP db01-vip.ncube-lm > oid1.mmcal: Flags [S.], seq 2504750542, ack 1289836563, win 14600, options [mss 1460], length 0
00:00:00.000016 IP oid1.mmcal > db01-vip.ncube-lm: Flags [R], seq 1289836563, win 0, length 0

It becomes apparent that the 1.2ms time hping reports is the time it takes for the remote server to send back the SYN+ACK package in the TCP three way handshake.

So does that mean that if we take a number of measurements (let’s say 100, or 1000) to have a statistically significant number of measurements we can establish my TCP roundtrip time and then know how fast my connection will be (outside of all the other variables inherent to the internet and potential noisy neighbours to name a few)?

Oracle provides a way to generate and measure SQL-Net traffic in My Oracle Support note: Measuring Network Capacity using oratcptest (Doc ID 2064368.1). This note provides a jar file which contains server and client software, and is aimed at dataguard, but is useful to measure SQL-Net network latency. I have looked at the packets oratcptest generates, and they mimic SQL-Net quite well.

Let’s see if we can redo the test above to measure pure network latency. First on the database server side, setup the server:

[user@db01m ~]$ java -jar oratcptest.jar -server db01 -port=1521

And then on the client side run the client using the same oratcptest jar file:

java -jar oratcptest.jar db01 -mode=sync -length=0 -duration=1s -interval=1s -port=1521

The important bits are -mode=sync (client packet must be acknowledged before sending another packet) and -length=0 (network traffic contains no payload). This is the result:

[Requesting a test]
	Message payload        = 0 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(07:34:42) The server is ready.
                        Throughput                 Latency
(07:34:43)          0.017 Mbytes/s                0.670 ms
(07:34:43) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.017 Mbytes/s
	             Avg. latency = 0.670 ms

If you look at the hping roundtrip time (1.2ms) and the oratcptest roundtrip time (0.7ms) clearly this is different! If you just look at the numbers (1.2 versus 0.7) it might seem like the oratcptest time is only measuring client to server traffic instead of the whole roundtrip? For this too it’s good to use tcpdump once again and look what oratcptest actually is doing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.63602 > db01.ncube-lm: Flags [S], seq 2408800085, win 17920, options [mss 8960,sackOK,TS val 3861246405 ecr 0,nop,wscale 7], length 0
00:00:00.001160 IP db01.ncube-lm > oid1.63602: Flags [S.], seq 2178995555, ack 2408800086, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:00:00.000015 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 1, win 140, length 0
00:00:00.023175 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 1:145, ack 1, win 140, length 144
00:00:00.000520 IP db01.ncube-lm > oid1.63602: Flags [.], ack 145, win 123, length 0
00:00:00.000951 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 1:145, ack 145, win 123, length 144
00:00:00.000008 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 145, win 149, length 0
00:00:00.018839 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 145:157, ack 145, win 149, length 12
00:00:00.000563 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 145:149, ack 157, win 123, length 4
00:00:00.000358 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 157:169, ack 149, win 149, length 12
00:00:00.000486 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 149:153, ack 169, win 123, length 4
00:00:00.000100 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 169:181, ack 153, win 149, length 12
00:00:00.000494 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 153:157, ack 181, win 123, length 4
...
00:00:00.000192 IP oid1.63586 > db01.ncube-lm: Flags [P.], seq 18181:18193, ack 6157, win 149, length 12
00:00:00.000447 IP db01.ncube-lm > oid1.63586: Flags [P.], seq 6157:6161, ack 18193, win 123, length 4
00:00:00.006696 IP oid1.63586 > db01.ncube-lm: Flags [F.], seq 18193, ack 6161, win 149, length 0
00:00:00.000995 IP db01.ncube-lm > oid1.63586: Flags [F.], seq 6161, ack 18194, win 123, length 0
00:00:00.000012 IP oid1.63586 > db01.ncube-lm: Flags [.], ack 6162, win 149, length 0

If you look at rows 4, 5 and 6 you see the typical TCP three-way handshake. What is nice to see, is that the actual response or roundtrip time for the packet from the server on line 5 actually took 1.1ms, which is what we have measured with hping! At lines 7-10 we see there is a packet send from the client to the server which is ACK’ed and a packet send from the server to the client which is ACK’ed. If you add the ‘-A’ flag to tcpdump you can get the values in the packet printed as characters, which shows the client telling the server how it wants to perform the test and the server responding with the requested settings. This is all a preparation for the test.

Starting from line 11, there is a strict repeating sequence of the client sending a packet of length 12, ACK’ing the previous received packet, and then the server responding with a packet of length 4 ACK’ing its previous received packet. This is the actual performance test! This means that the setting ‘-duration=1s -interval=1s’ does not mean it sends one packet, it actually means it’s continuously sending packets for the duration of 1 second. Also another flag is showing: the P or PSH (push) flag. This flag means the kernel/tcpip-stack understands all data to transmit is provided from ‘userland’, and now must be sent immediately, and instructs the receiving side to process it immediately in order to bring it to the receiving userland application as soon as possible too.

Lines 20-22 show how the connection is closed by sending a packet with a FIN flag, which is done for both the client to the server and the server to the client, and because it’s TCP, these need to be ACK’ed, which is why you see a trailing packet without a flag set, only ACK’ing the FIN packet.

The conclusion so far is that for real usable latency calculations you should not use a different protocol (so whilst ICMP (ping) does give an latency indication it should really only be used as an indicator), and that you should measure doing the actual work, not meta-transactions like the TCP three way handshake. Probably because of the PSH flag, the actual minimal latency for SQL-Net traffic is lower than ping and hping showed.

Wait a minute…did you notice the ‘actual minimal latency’? So far we only have been sending empty packets, which means we measured how fast a packet can travel from client to server and back. In reality, you probably want to send actual data back and forth, don’t you? That is something that we actually have not measured yet!

Let’s do actual Oracle transactions. For the sake of testing network latency, we can use Swingbench to execute SQL. This is how that is done:

[user@oid1 bin]$ cd ~/sw/swingbench/bin
[user@oid1 bin]$ ./charbench -c ../configs/stresstest.xml -u soe -p soe -uc 1 -rt 00:01
Author  :	 Dominic Giles
Version :	 2.5.0.971

Results will be written to results.xml.
Hit Return to Terminate Run...

Time		Users	TPM	TPS

8:22:56 AM      1       14450   775

Please mind I am using 1 user (-uc 1) and a testing time of 1 minute (-rt 00:01), which should be longer when you are doing real testing. As a reminder, I am using 1 session because I want to understand the latency, not the bandwidth! In order to understand if the network traffic looks the same as oratcptest.jar, I can use tcpdump once again. Here is a snippet of the traffic:

...
00:00:00.000106 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5839:5852, ack 5986, win 272, length 13
00:00:00.000491 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 5986:6001, ack 5852, win 330, length 15
00:00:00.000234 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5852:6003, ack 6001, win 272, length 151
00:00:00.000562 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6001:6077, ack 6003, win 330, length 76
00:00:00.000098 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6003:6016, ack 6077, win 272, length 13
00:00:00.000484 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6077:6092, ack 6016, win 330, length 15
00:00:00.000238 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6016:6159, ack 6092, win 272, length 143
00:00:00.000591 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6092:6425, ack 6159, win 330, length 333
...

The important bit is this shows the same single packet traffic client to server and back as we saw oratcptest generated, however now with varying packet size (which is logical, different SQL statements are sent to the database), the PSH bit is set, which also is the same as oratcptest generated.

Let’s assume this is a real-life workload. In order to measure and calculate differences in performance between different networks, we need the average packet length. This can be done with a tool called tcpstat (this link provides the EL6 version). In my case I have only one application using a database on this server, so I can just filter on port 1521 to measure my SQL-Net traffic:

[user@oid1 ~]$ sudo tcpstat -i eth0 -o "Packet/s=%p\tmin size: %m\tavg size: %a\tmax size: %M\tstddev: %d\n" -f 'port 1521'
Packet/s=2526.40	min size: 53	avg size: 227.76	max size: 1436	stddev: 289.21
Packet/s=2531.40	min size: 53	avg size: 229.79	max size: 1432	stddev: 291.22
Packet/s=2634.20	min size: 53	avg size: 229.59	max size: 1432	stddev: 293.38
Packet/s=2550.00	min size: 53	avg size: 234.11	max size: 1435	stddev: 296.77
Packet/s=2486.80	min size: 53	avg size: 232.24	max size: 1436	stddev: 293.16

In case you wondered why tcpstat reports a minimum length of 53 and tcpdump (a little up in the article) of 13; tcpstat reports full packet length including packet, protocol and frame headers, tcpdump in this case reports the payload length.

Now we can execute oratcptest.jar again, but with a payload size set that matches the average size that we measured, I have taken 250 as payload size:

[user@oid1 ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(09:39:47) The server is ready.
                        Throughput                 Latency
(09:39:48)          0.365 Mbytes/s                0.685 ms
(09:39:48) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.365 Mbytes/s
	             Avg. latency = 0.685 ms

As you can see, there is a real modest increase in average latency going from 0.670ms to 0.685ms.

In order to test the impact of network latency let’s move the oratcptest client to the server, to get the lowest possible latency. Actually, this is very easy, because the oratcptest.jar file contains both the client and the server, so all I need to do is logon to the server where I started the oratcptest.jar file in server mode, and run it in client mode:

[user@db01m ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(14:49:29) The server is ready.
                        Throughput                 Latency
(14:49:30)         12.221 Mbytes/s                0.020 ms
(14:49:30) Test finished.
	       Socket send buffer = 26010 bytes
	          Avg. throughput = 11.970 Mbytes/s
	             Avg. latency = 0.021 ms

Wow! The roundtrip latency dropped from 0.685ms to 0.021ms! Another test using oratcptest.jar using a true local network connection (with Linux being virtualised using Xen/OVM) shows a latency of 0.161ms.

These are the different network latency figures measured with oratcptest using a payload size that equals my average network payload size:
– Local only RTT: 0.021
– Local network RTT: 0.161
– Different networks RTT: 0.685

If I take swingbench and execute the ‘stresstest’ run local, on a machine directly connected via the local network and across different networks (think cloud), and now measure TPS (transactions per second), I get the following figures:
– Local only TPS: 2356
– Local network TPS: 1567
– Different networks TPS: 854

Do these figures make sense?
– Local only: Time not in network transit per second: 1000-(0.021*2356)=950.524; approximate average time spend on query: 950.523/2356=0.40ms
– Local network: 1000-(0.161*1567)=747.713/1567=0.48ms
– Different networks: 1000-(0.685*854)=415.010/854=0.49ms
It seems that this swingbench test spends roughly 0.40-0.50ms on processing, the difference in transactions per second seem to be mainly caused by the difference in network latency.

This second blogpost on Performance Co Pilot or PCP in short is about visualisation. In case you haven’t read the first part, here it is, which describes how it works, why you should use it, and how you can install it.

Pmchart.
One way of visualising PCP is using the pmchart utility. The pmchart utility is installed via the pcp-gui package (yum install pcp-gui). The pmchart utility uses X to display a window and draw lines, bar graphs, area graphs, etc. in a sense alike the performance manager on Microsoft Windows. You can select the individual performance statistics PCP measures, which can be viewed and investigated with pminfo, for example kernel.all.cpu.user:

# pminfo -dT kernel.all.cpu.user

kernel.all.cpu.user
    Data Type: 64-bit unsigned int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: counter  Units: millisec
Help:
total user CPU time from /proc/stat for all CPUs, including guest CPU time

Use ‘pminfo’ without any argument to get a list of available statistics.

When I tried the pmchart utility on OL7.2, I needed to install the dejavu-sans fonts in order not to get black squares as letters in the pmchart window (yum install dejavu-sans-fonts). Also, there are pre-created ‘views’ for filesystem, load, memory, etc. (file, open view). This is very nice, it requires you to log on to a host to investigate with X tunnelled to your system (or have VNC setup), start pmchart and select metrics or open a view, and then record them. It doesn’t seem to be possible to use the archives pmlogger creates for historical analysis, nor it doesn’t seem to be able to save the recorded data, only the graphics that are created based on the selected statistics or using a view as a picture.
Note: upon reading the documentation, it turns out it is possible to use archives. This means you can pick up an archive from a machine and load it using pmchart locally.

However, there are different ways of investigating a host via PCP statistics… The pcp-webapi package provides exactly what the name suggests: a deamon that externalises PCP data via a web based (REST) API. Here’s how that installed:
There is a caveat: the pcp-webapi package is in the ‘optional latest’ repository, which might not be enabled. You can use ‘yum repolist all’ to see all configured repositories.

# yum install --enablerepo="ol7_optional_latest" pcp-webapi

After installing, you need to enable automatic startup (using systemctl in OL7) and start it:

# systemctl enable pmwebd
# systemctl start pmwebd

At this point I need to point out that starting pmwebd means a daemon starts at port 44323 for HTTP requests, which does not provide any authentication. If the host you are using is directly connected to the internet, you should protect this port and do not allow access. The best to use this port is using a tunnel.

This, however, just enables the API, it doesn’t provide anything that displays data useful for a person. That is done when you add the pcp-webjs package:

# yum install --enablerepo="ol7_optional_latest" pcp-webjs

This installs 3 webapps: vector, graphite and grafana.

Vector
Vector (http://vectoross.io) is a web based application that is written in javascript that fetches current data from the pcp webapi and creates graphs from it. You can use the vector app that is installed as part of the pcp-webjs package together with PCP, or start it up from somewhere else, as an example, the documentation mentions putting the app in an Amazon S3 bucket. If it’s started from somewhere else, you need to point vector to the PCP webapi of the machine you want to investigate because your browser, which runs vector, needs to have access to port 44323 to fetch data from the webapi. To use vector that is installed with PCP using the pcp-webjs package, point your browser to http://hostname:44323/vector and specify the hostname or ip address at “Hostname”. This is how that looks like:
vector
Vector displays PCP measured statistics about CPU, memory, network, disk and even about containers, with an high resolution (2 seconds by default).

Grafana
Grafana (http://grafana.net) installed with the webjs package is a web based application that takes an approach opposite to vector, it fetches the data from the archives created by pmlogger and displays the statistics stored in these. The default dashboard displays the 1 minute load average, network I/O, disk reads and writes, memory statistics and filesystem fullness. The different colours in the graphs over time indicate different PCP archives being used, because by default PCP archives store data per day. To use grafana, point your browser to http://hostname:44323/grafana. This is how that looks like:
grafana

This blog post is about two things: one how you can monitor who is bringing you database up and down (there is a twist at the end!) and two how you can very conveniently do that with aggregated logs in a browser with a tool called ‘Kibana’, which is the K in ELK.

What is the ‘ELK stack’?
The ELK stack gets it’s name from Elasticsearch, Logstash and Kibana.
– Elasticsearch is an open source search engine based on Apache Lucene, which provides a distributed, multitenant-capable full-text search engine with a http web interface and schema-free JSON documents.
– Logstash is a fully configurable open source data processing pipeline that can receive data from a multiple sources simultaneously, transform it and output it based on the output plugin, which is the elastic search plugin in this blogpost but could be anything from STDOUT, an unix pipe, a file, a file in CSV, HTTP, email, IRC, Jira, graphite, kafka, mongodb, nagios, S3, SolR, … really whatever you want.
– Kibana is an open source data visualisation plugin for Elasticsearch.
When looking at Kibana, it quite much looks like the splunk interface.

Installing the ELK stack.
Installing the ELK stack in a basic way is easy. In this blogpost I will install everything on the same host, everything being the ELK stack and an Oracle database installation. In reality you should have a log gatherer on every host (called ‘filebeat’) and a dedicated host which runs the rest of the stack (logstash, elasticsearch and kibana). The below install actions were executed on a Linux 64 bit host running Oracle Linux 6.8.
In order to make the installation really easy, I use the yum repository of the elastic company, this is how to set that up (all done as root, ‘#’ indicates root):

# rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
# vi /etc/yum.repos.d/elastic.repo
[elastic-5.x]
name=Elastic repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

Install elasticsearch:

# yum install java-1.8.0-openjdk
# yum install elasticsearch
# chkconfig --add elasticsearch
# service elasticsearch start

Install logstash:

# yum install logstash

Configure logstash input and output:

# vi /etc/logstash/conf.d/input.conf
input {
  beats {
    port => 5044
  }
}
# vi /etc/logstash/conf.d/output.conf
output {
  elasticsearch {
    hosts => "localhost:9200"
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
  }
}

Verify the logstash config files:

# sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash -t
Sending Logstash's logs to /var/log/logstash which is now configured via log4j2.properties
Configuration OK

If you see the ‘Configuration OK’ message, it means logstash could interprent the configuration files. It does not mean it will all work as desired, there could be runtime issues.
Now let’s start logstash. Logstash uses upstart (meaning a startup script in /etc/init) instead of the legacy startup mechanism using the chkconfig and service utilities.

# initctl start logstash

The last part of the data pipeline is ‘filebeat’. There are and could be multiple input products, in this blogpost I use ‘filebeat’, which keeps track of logfiles.

# yum install filebeat
# chkconfig --add filebeat

We are going to look into linux and oracle auditing. So we need to keep track of a couple of files:
– /var/log/secure: this is the default linux logfile which contains all kinds of authentication messages, as defined in /etc/rsyslog.conf (authpriv.* /var/log/secure).
– /u01/app/oracle/admin/*/adump/*.aud: this is the default place where the oracle database stores it’s audit files. These audit files provide what is called ‘mandatory auditing’, and includes at least connections to the instance with administrator privilege, database startup and database shutdown. The default is a normal text based logfile, it could be set to XML.
– /var/log/audit/audit.log: this is the logfile of the linux kernel based audit facility. This is actually a lesser known hidden gem in Linux, and provides audit information from the Linux kernel.

These files need to be configured in filebeat, in the file: /etc/filebeat/filebeat.yml. As the extension of the file indicates, this is a file organised in YAML syntax. The best way to configure the file is to move the file, and create your own file with your desired configuration. First of all we add the output, which is logstash in our case. Please mind the default configuration of filebeat is direct output to elasticsearch, which means we don’t have an option to enrich the data!

# mv /etc/filebeat/filebeat.yml /etc/filebeat/filebeat.yml.orig
# vi /etc/filebeat/filebeat.yml
output.logstash:
  hosts: ["localhost:5044"]

Please mind the two spaces in front of ‘hosts’, which is mandatory for a YAML document!
Next up we add the files to monitor in the configuration file. The linux based logfiles are easy:

filebeat.prospectors:
- input_type: log
  paths:
    - /var/log/secure
  document_type: secure

- input_type: log
  paths:
    - /var/log/audit/audit.log
  document_type: audit

One thing to notice is that a type is set for each file (which is really just a name for the file filebeat monitors), which makes it able to find data from these specific files later on. Now the Oracle audit file:

- input_type: log
  paths:
    - /u01/app/oracle/admin/*/adump/*.aud
  document_type: oracle_audit
  multiline:
    pattern: '^[A-Za-z]{3} [A-Za-z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}'
    negate: true
    match: after

This looks a bit more complicated. The reason for the complication is the multiline specification. An Oracle database audit file contains a timestamp, after which the audit data is written; it looks like this:

Thu Jan 19 13:44:12 2017 +00:00
LENGTH : '198'
ACTION :[49] 'ALTER DATABASE OPEN /* db agent *//* {0:0:476} */'
DATABASE USER:[1] '/'
PRIVILEGE :[6] 'SYSDBA'
CLIENT USER:[6] 'oracle'
CLIENT TERMINAL:[0] ''
STATUS:[1] '0'
DBID:[10] '2622783786'

The important things at this time: the ‘pattern’ keyword specifies the timestamp, you can see you can match it with the timestamp, and all the following data needs to be processed together, this is a single record, written over multiple lines. ‘negate: true’ means that anything that does not fit the pattern needs to be added to this piece of data, ‘match: after’ means that this is added after the pattern is matched.

Now that filebeat is setup, we can start the filebeat daemon:

# service filebeat start

The last component is kibana:

# yum install kibana
# chkconfig --add kibana
# service kibana start

Now that we’ve set the entire pipeline up, a next thing to do is to configure logstash to enrich the data. Here’s the how it’s done for the Oracle database audit file:

# vi /etc/logstash/conf.d/oracle-audit.conf
filter {
  if [type] == "oracle_audit" {
    grok {
      match => { "message" => "^%{DAY} %{MONTH:M} %{MONTHDAY:d} %{HOUR:h}:%{MINUTE:m}:%{SECOND:s} %{YEAR:y}" }
      add_tag => [ "grok", "oracle_audit" ]
    }
    grok {
      match => { "message" => "ACTION :\[[0-9]*\] '(?<ora_audit_action>.*)'.*DATABASE USER:\[[0-9]*\] '(?<ora_audit_dbuser>.*)'.*PRIVILEGE :\[[0-9]*\] '(?<ora_audit_priv>.*)'.*CLIENT USER:\[[0-9]*\] '(?<ora_audit_osuser>.*)'.*CLIENT TERMINAL:\[[0-9]*\] '(?<ora_audit_term>.*)'.*STATUS:\[[0-9]*\] '(?<ora_audit_status>.*)'.*DBID:\[[0-9]*\] '(?<ora_audit_dbid>.*)'" }
    }
    grok {
      match => { "source" => [ ".*/[a-zA-Z0-9_#$]*_[a-z0-9]*_(?<ora_audit_derived_pid>[0-9]*)_[0-9]*\.aud" ] }
    }
    mutate {
      add_field => { "ts" => "%{y}-%{M}-%{d} %{h}:%{m}:%{s}" }
    }
    date {
      locale => "en"
      match => [ "ts", "YYYY-MMM-dd HH:mm:ss" ]
    }
    mutate {
      remove_field => [ "ts", "y", "M", "d", "h", "m", "s" ]
    }
  }
}

It’s beyond the scope of this article to go through every detail, but as you can see we apply a filter. Everything in this filter takes place for the type “oracle_audit”, which is set by filebeat. The next thing we encounter a couple of times is ‘grok’s’. The term grok comes from the Robert Heinlein science-fiction novel ‘Stranger in a Strange land’. Essentially, a grok with logstash means you specify a pattern, for which the actions are applied if the specified pattern matches. The first grok looks for the date pattern for which extra fields are created (M,d,h,m,s, after the colon) in the field ‘message’, and adds a tag (a word in the tags field for the record that is created). The second grok also looks in the ‘message’ field, and specifies text (ACTION for example), some other characters and then (?.*) is visible. This is a custom pattern, for which the field name to be created is in between < and > and is followed by a pattern. This grok line (including all the patterns) creates fields for all the Oracle audit fields in the audit file! The next grok picks up the PID from the filename of the logfile (the filename is in a field ‘source’), and the two mutates create and destroy a new field ts which is used for the date, and date specifies the date/time with the data flowing through logstash is filled with the date and time in the ts field, instead of the time filebeat picked up the data and sent it through logstash. Please mind that if you add (or change) configuration in a logstash configuration file, you need to restart logstash.

We are all set now! Last words on this configuration: kibana and elasticsearch by default do not require authentication. Do not expose the ports of these products to the internet! I am using a tunnel to the kibana website, which runs on port 5601. It’s very easy to ssh into the machine running the ELK stack using ssh user@machine -L 5601:localhost:5601, which creates a port on localhost:5601 on my machine at home (-L = local), for which communication is tunnelled to localhost:5601 on the remote machine (the localhost in the ssh line example is an address on the machine you ssh in to, this could also be another server which is only visible from the machine you ssh into.

First let’s login to the machine, and see what information is revealed with /var/log/secure:
kibana-secure-login
You get this screen when you goto kibana at port 5601, enter: ‘type: secure’ in the search bar to display data of the type secure (which is what is set with document_type: secure in filebeat.yml), and login to the machine where filebeat is watching the /var/log/secure file. As you can see, you get two lines from the ssh deamon, one indicating something’s done with pam (pam_unix), and one line which indicates it authenticated via a public key for user ops from an ip address (which is anonymised) at port 39282 via ssh2.

With a lot of cloud providers you get a user which has public key authentication setup (which you saw above), after which you need to sudo to for example the oracle user. In a lot of companies, you get a personalised account to log on to servers, after which you need to sudo to oracle. In both cases you need to use sudo to become the user that you need to administer, for example oracle. This is what sudo generates in the /var/log/secure file:
kibana-secure-sudo
The secure log displays sudo was invoked by the user opc, on TTY pts/1 and the command executed via sudo was ‘/bin/su – oracle’.

Now that I have became oracle using sudo, I set the environment of my database using oraenv and started up a database. Now go over to kibana, and issued a search for ‘type: oracle_audit’. This is how that looks like:
kibana-oracle_audit
Now if you look at what the audit record provides, the only things that provide something useful for the purpose of investigating who did stop or start a database are ACTION and CLIENT TERMINAL (I assume the database is stopped and started by the ‘oracle’ user). Now change the ‘selected fields’ in kibana and add the (dynamically created!) fields: ora_audit_action, ora_audit_term and ora_audit_derived_pid, and remove message. This is how that looks like:
kibana-oracle-audit-startup
The important thing to look for here is the ora_audit_action ‘startup’, then look at the ora_audit_derived_pid, and two rows down we see terminal ‘pts/1’ was the terminal on which this was entered.

Now that we know the terminal, we can add in searching in the message field for the secure type. Enter ‘type: oracle_audit OR (type: secure AND message: “*pts/1*”)’ in the search bar.
kibana-secure-oracle_audit
Okay, this works. But it’s far from perfect. In fact, it only works if the username of the session doing the sudo is the only session with that username, otherwise if there is more than one session it can be any of these sessions doing the sudo, since there is nothing more than the username. This also means that if there is a direct logon to the oracle user, there is no way to identify a session with a TTY, and thus database startup and shutdown are completely anonymous, there’s no way to link a specific session to that action outside of probably the oracle user and a TTY which can not be linked to anything like for example an ip address.

Is there a better way? Yes! We can also use the linux, kernel based, auditing service, which is on by default. This service keeps a log file at /var/log/secure/secure.log, and gives way more granular auditing events than the /var/log/secure log. Linux audit generates a lot of diverse types of rows, so it’s actually not easy to grok them, but in order to understand which session executed a startup or shutdown, the only audit row that is important for this specific use case is an audit type called ‘CRED_ACQ’. The grok for this type looks like this:

# vi /etc/logstash/conf.d/linux-audit.conf
filter {
  if [type] == "audit" {
    grok {
        match => { "message" => ""type=%{WORD:audit_type} msg=audit\(%{NUMBER:audit_epoch}:%{NUMBER:audit_counter}\): pid=%{NUMBER:audit_pid} uid=%{NUMBER:audit_uid} auid=%{NUMBER:audit_auid} ses=%{NUMBER:audit_ses} msg='op=%{NOTSPACE:audit_op} ((acct=\"%{GREEDYDATA:audit_acct}\")|(id=%{NUMBER:audit_id})|acct=%{BASE16NUM:audit_acct}) exe=\"%{GREEDYDATA:audit_exe}\" hostname=%{NOTSPACE:audit_hostname} addr=%{NOTSPACE:audit_addr} terminal=%{NOTSPACE:audit_terminal} res=%{NOTSPACE:audit_res}'" }
        add_tag => [ "grok", "audit" ]
    }
    date {
      locale => en
      match => [ "audit_epoch", "UNIX" ]
    }
  }
}

This grok matches the CREDIT_ACQ audit type which we will use to trace back the session via the audit log. Another nicety of this logstash configuration is the audit records time using an epoch timestamp, which logstash can translate back to a human readable timestamp. Once this is in place, log in again and use sudo to switch to oracle (or log in directly as oracle, it doesn’t really matter that much now!), and search in kibana for: ‘type: oracle_audit OR (type: audit AND audit_type: CRED_ACQ)’. Now get the relevant fields; remove ‘message’, and add: audit_hostname, audit_acct, audit_ses, audit_terminal, ora_audit_term, ora_audit_derived_pid, ora_audit_action. This probably returns a log of rows, now scroll (“back in time”) and search for the startup or shutdown command, and then follow the trail:
kibana-oracle_audit-audit-raw
Startup points to (oracle server process) PID 17748, which was instantiated by a session using by pts/1 (two rows down), one row further down we see the audit information which shows pts/1, which is connected to audit_ses 4230. The audit_ses number is a number that sticks with a session, regardless of using sudo. If you follow down number 4230, you see multiple rows of audit_ses 4230, some of them with root, which is typical for sudo switching from one user to another. The final row shows the user logging in with it’s ip address. In other words: using the linux kernel audit facility, you can get all available information!

Okay, all happy now? Are you sure? Now for the twist!

Whenever you use RAC, or use ASM, or use both, or you are using the grid infra single instance as a framework to track your your listener(s) and database(s) and start and stop them automatically, you can still stop and start an instance directly using sqlplus, but in most cases you will be using the grid infrastructure crsctl or srvctl commands. When the grid infrastructure crsctl and srvctl commands are used, this is how the Oracle database audit information looks like:
kibana-oracle_audit-crs-shutdown
As you can see, because the cluster ware brought the database down, there is no terminal associated with the shutdown. So the above mentioned way of first searching for startup and shutdown in the oracle audit information, finding the associated terminal, and then tracing it through the audit records can NOT be used whenever the Oracle cluster ware is used, because a grid infrastructure deamon is actually stopping and starting the database, and the grid infrastructure does not keep any information (that I am aware of) about which client invoked a grid infrastructure command. I guess a lot of auditors will be very unhappy about this.

Now the good news: you can solve this issue very easy. The downside is it requires additional configuration of the linux auditing. The solution is to put an ‘execution watch’ on srvctl and crsctl; this is how this is done:

# auditctl -w /u01/app/12.1.0.2/grid/bin/srvctl -p x -k oracrs
# auditctl -w /u01/app/12.1.0.2/grid/bin/crsctl -p x -k oracrs

In order to validate the working, I started a database using srvctl, and searched for: ‘(type: oracle_audit AND ora_audit_action: STARTUP) OR (type: audit AND message: key=\”oracrs\”)’. This is how that looks like:
kibana-oracle_audit-audit-watch
As you can see, there’s the Oracle database record indicating the startup of the database, and a little while back in time there’s the linux audit row indicating the execution of the srvctl executable. Once you are at that point, you can using the earlier mentioned way of using the audit_ses number to trace the session execution, including sudo and ip address at logon time.

I was investigating gathering performance data on (oracle) linux servers recently and came across Performance Co-Pilot (PCP). I have come across this product regularly in the past, but it seemed somewhat abstract to me, and I never ran into any actual usage. And we got sar for linux performance data and for the Oracle database we got oswatcher (and it’s exadata cousin exawatcher) and TFA right? How wrong I was.

First let me explain a few things.

There is no magic involved in performance measurement and tuning. What this aims to say is that most performance gathering frameworks use the same performance data sources as regular tools use. The are tools that can go beyond the regular data sources, like systemtap, which actually can be used as a source for PCP (but is beyond the scope of this article) and perf to name a few. This means that whatever tool you use, like: mpstat, iostat, top or sar, they all display the same cpu information. It might disguised in a different way, but really it’s all the same information.

This means that’s it’s not about the information itself, but about how you can interact with historical data. The majority of the performance tools simply do not have a facility to store measurements. That’s perfectly fine, and that wasn’t ever a design criterion when they were written. This is true for tools like mpstat, iostat, top to name a few. However in a lot of cases having volatile data is a real pain. Of course you can save the output to a file, but this very quickly becomes a challenge once you have a few files of sar, iostat, etc.

This is where sar comes in. Sar is installed on linux using the sysstat rpm package, and actually does store historical data! Sar stores it’s data in /var/log/sa, and stores sar archives historically based on the HISTORY setting in /etc/sysconfig/sysstat. In the past I seen this been set to 8, a check on a current oracle linux 6 machine shows HISTORY been set to 28. Also in the past the maximal retention time was 31, because the day of the month number would overwrite the same day of the previous month. This issue too seems to be solved, the comments in the /etc/sysconfig/sysstat file says that if the value is set higher than 28, files are kept in month-based directories.

The historical data sar stores are measurements with a ten minute interval. The reason for this is how sa1 (a helper utility for storing sar data) is invoked. This is done via a cron setting in /etc/cron.d/sysstat:

# Run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 1 1
# 0 * * * * root /usr/lib64/sa/sa1 600 6 &
# Generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A

If you speak cron, you see that the second row instructs cron to run sa1 taking one sample every ten minutes (which is put in /var/log/saNN, and on line 5 to run sa2, which creates a (text)report called /var/log/sarNN, and does maintenance based on the HISTORY setting.

This is brilliant when coming from having no data at all. In fact, I have used this in the past and am using this with success to look into measurements looking for anomalies. However, it’s not very granular, it’s per ten minutes. Also, getting data is by extracting data from the /var/log/sa/saNN file (sar -f /var/log/sa/saNN -s hh:mm:ss -e hh:mm:ss – (switches for data sources, like cpu% (u), swapping in/out (W), etc)). More frequently I just ask for the /var/log/sa/sarNN (text) report, which contains all sar measurements.

So far we are all happy. Performance data can be queried ad-hoc, and there’s a history with per 10 minute samples. But how about when you do test things on your system? In most of the cases I am involved with benchmarking and proof of concepts, we run things and actually want data which is more granular than 10 minutes. In fact, a single test might not even take 10 minutes and not be visible in the sar history at all! There is a solution for that too: setup sar, iostat, mpstat, etc. each with output into a file in the background, and then run whatever you want to run, and kill the performance collectors afterwards. Sounds great, but if you’ve ever tried that, it becomes a bit complex quite soon.

This is where performance co-pilot (PCP) comes in. PCP is an open-source tool that is available in the ‘latest’ repository of oracle linux 6 and 7, which means it can simply be installed using yum. Once installed, you can extract the very same data earlier mentioned tools get from a linux system using the pcp tools:

CPU information:

[opc@deploy1 ~]$ pmrep :sar-u
               %user       %nice        %sys     %iowait      %steal       %idle
13:00:41         N/A         N/A         N/A         N/A         N/A         N/A
13:00:42        0.47        0.00        0.11        0.00        0.00       99.41
13:00:43        0.96        0.00        0.12        0.00        0.00       98.89
13:00:44        0.21        0.00        0.06        0.00        0.00       99.72
13:00:45        1.15        0.00        0.17        0.00        0.00       98.65
13:00:46        0.03        0.00        0.01        0.00        0.00       99.96
13:00:47        1.22        0.00        0.14        0.00        0.00       98.58
13:00:48        0.58        0.00        0.18        0.00        0.00       99.23
13:00:49        0.98        0.00        0.12        0.00        0.00       98.85
13:00:50        0.18        0.00        0.04        0.00        0.00       99.77
13:00:51        1.15        0.00        0.18        0.00        0.00       98.65
13:00:52        0.07        0.00        0.01        0.00        0.00       99.91
13:00:53        1.19        0.00        0.15        0.00        0.00       98.63

Memory information:

[opc@deploy1 ~]$ pmrep :mem
              free(mb)   bufhead(mb)  pagecache(mb  kernelstack(      slab(mb)  pagetables(m    mapped(mb)      anon(mb)       shm(mb)   swapuse(mb)  hugefree(mb)  hugeused(mb)
13:08:33           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A
13:08:34        237199           320         17056            16          1113            23            62          1048            27             0             0             0
13:08:35        237258           320         17056            16          1113            24            62           991            27             0             0             0
13:08:36        237200           320         17056            16          1113            24            62          1046            27             0             0             0
13:08:37        237255           320         17056            16          1113            24            62           991            27             0             0             0
13:08:38        237212           320         17056            16          1113            24            62          1034            27             0             0             0

Network:

[opc@deploy1 ~]$ pmrep :nw-pkt
          pck/in  pck/in  pck/in  err/in  err/in  err/in  drp/in  drp/in  drp/in  pck/out  pck/out  pck/out  err/out  err/out  err/out  drp/out  drp/out  drp/out
            eth0    eth1      lo    eth0    eth1      lo    eth0    eth1      lo     eth0     eth1       lo     eth0     eth1       lo     eth0     eth1       lo
13:08:58     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A      N/A      N/A      N/A      N/A      N/A      N/A      N/A      N/A      N/A
13:08:59  10.972   0.000  29.924   0.000   0.000   0.000   0.000   0.000   0.000   16.957    0.000   29.924    0.000    0.000    0.000    0.000    0.000    0.000
13:09:00   0.996   0.000  103.63   0.000   0.000   0.000   0.000   0.000   0.000    0.996    0.000  103.627    0.000    0.000    0.000    0.000    0.000    0.000
13:09:01   0.996   0.000  29.892   0.000   0.000   0.000   0.000   0.000   0.000    0.996    0.000   29.892    0.000    0.000    0.000    0.000    0.000    0.000
13:09:02   0.996   0.000  94.653   0.000   0.000   0.000   0.000   0.000   0.000    0.996    0.000   94.653    0.000    0.000    0.000    0.000    0.000    0.000

Disk:

[opc@deploy1 ~]$ pmrep :disk-iops
          riops  wiops  iops
            sda    sda   sda
13:09:23    N/A    N/A   N/A
13:09:24      0      0     0
13:09:25      0      0     0
13:09:26      0      0     0
13:09:27      0      3     3
13:09:28      0      0     0

As you can see, all these statistics where provided using a utility called ‘pmrep’ (reporter utility). They were fetched on runtime. I can go back in time when I use the archive that the pcp logger creates and use -S (starttime):

[opc@deploy1 ~]$ pmrep -a /var/log/pcp/pmlogger/deploy1/20170113.00.10 -S @-1min :sar-u
               %user       %nice        %sys     %iowait      %steal       %idle
13:22:36         N/A         N/A         N/A         N/A         N/A         N/A
13:22:37        0.65        0.00        0.11        0.00        0.00       99.22
13:22:38        0.65        0.00        0.11        0.00        0.00       99.22
13:22:39        0.65        0.00        0.11        0.00        0.00       99.22
13:22:40        0.65        0.00        0.11        0.00        0.00       99.22
13:22:41        0.65        0.00        0.11        0.00        0.00       99.22
13:22:42        0.65        0.00        0.11        0.00        0.00       99.22
13:22:43        0.65        0.00        0.11        0.00        0.00       99.22
13:22:44        0.65        0.00        0.11        0.00        0.00       99.22
13:22:45        0.65        0.00        0.11        0.00        0.00       99.22
13:22:46        0.65        0.00        0.11        0.00        0.00       99.22
13:22:47        0.65        0.00        0.11        0.00        0.00       99.22
13:22:48        0.65        0.00        0.11        0.00        0.00       99.22

The archive is stored in /var/log/pcp/pmlogger/hostname/YYYYMMDD.00.10 for the current day, the older archives are also stored as YYYYMMDD, but without ‘.00.10’ added to it. Above the syntax for using an archive and specifying a relative time is shown (-S for start, then a number and a time specification like sec, min, hour). Of course you can specify absolute times too, like ‘-S @13:22 -T @13:25’. -T is the until time.

Did you notice the measurements read from the archive above are per second?? The metrics are actually measured per minute (this specified in /var/lib/pcp/config/pmlogger/config.default), the pmrep tool extrapolates the values per minute. I am not really happy about that, but you can use ‘-u’ to disable per second extrapolation.

When doing investigation, you often want all data from a certain point in time because quite often you don’t know upfront what exact you are going to encounter. This is also true for troubleshooting, proof of concept testing, performance analysis, benchmarking, etc. You want to be able to do multidimensional analysis after the investigated event, probably even offline. This is a point where PCP truly shines. It easy to create an archive of only the relevant time using the ‘pmlogextract’ utility:

[opc@deploy1 ~]$ pmlogextract -S 12:30 -T 12:40 /var/log/pcp/pmlogger/deploy1/20170113.00.10 this

The above pmlogextract command reads the (current, there’s .00.10 attached) archive /var/log/pcp/pmlogger/deploy1/20170113.00.10 and creates the archive ‘this’ with all measurements between 12:30 and 12:40. A PCP archive consists of three files:

[opc@deploy1 ~]$ ls -ls this*
236 -rw-rw-r--. 1 opc opc 238212 Jan 13 14:05 this.0
  4 -rw-rw-r--. 1 opc opc    192 Jan 13 14:05 this.index
 20 -rw-rw-r--. 1 opc opc  18456 Jan 13 14:05 this.meta

This archive is endian-independent, and can be transferred to your laptop for analysis (provided you got the PCP executables on it of course). Please mind an archive contains a subset of all available performance metrics, in order to save space. My default installation has 1620 metrics available, there are 280 stored in the archive.

But it doesn’t stop here. PCP not only installs a logger daemon (pmlogger) and a couple of agents to extract data (pmdaroot, pmdaproc, pmdaxfs, pmdalinux), but also the ‘pmcd’ daemon (pcp collector deamon) listening on port 44321. This means you can fetch metrics over the network:

[opc@deploy1 ~]$ pmrep -h oracleserver.local :mem
              free(mb)   bufhead(mb)  pagecache(mb  kernelstack(      slab(mb)  pagetables(m    mapped(mb)      anon(mb)       shm(mb)   swapuse(mb)  hugefree(mb)  hugeused(mb)
15:25:18           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A
15:25:19        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:20        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:21        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:22        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:23        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:24        507244           165          5436            19           499           165          1267           817          3015             0             0             0

This is running my memory overview report fetching current data from the server ‘oracleserver.local’! Sadly, currently is fetching historical data using pmrep via the ‘pmcd’ daemon not supported.

The last point I wanted to focus the attention on is that gathering PCP collected performance data on a central host is built-in in PCP using the logger functionality. Provided the collection host can access the remote host on tcp/44321, it’s really simple (example for above mentioned oracleserver.local host). This is all that needs to be done on the collection host. Nothing needs to be done on the host that needs to provide the data:

[root@deploy1 ~]# echo 'oracleserver.local 	n   n	PCP_LOG_DIR/pmlogger/oracleserver.local	-r -T24h10m -c config.oracleserver.local' >> /etc/pcp/pmlogger/control
[root@deploy1 ~]# service pmlogger restart

The installation of PCP via the oracle linux 6 or 7 repository is really simple (provided you enabled the ‘latest’ repository):

[root@deploy1 ~]# yum install pcp pcp-system-tools
[root@deploy1 ~]# chkconfig pmcd on
[root@deploy1 ~]# chkconfig pmlogger on
[root@deploy1 ~]# service pmcd start
[root@deploy1 ~]# service pmlogger start

ps. I gotten a message ‘Error: Failed to create avahi client: Daemon not running’ when executing ‘service pmcd start’, but it does not seem to impact using PCP.

PS. the above mentioned way appears to still be working on OL7, but OL7 in reality is using systemd (actually, it is doing the systemctl commands under the hood). The correct way to enable and start the pmcd and pmlogger daemons is:

# systemctl enable pmcd
# systemctl enable pmlogger
# systemctl start pmcd pmlogger

A final remark: the pmrep reports (:sar-u, :mem, :nw-pkt, disk-iops) mentioned are NOT reports that are in the default installation of PCP. Create a file in the current working directory of the process executing pmrep named pmrep.conf, and use the following gist to fill the file with my report definitions. Actually, it’s really easy to create a report, and there plenty of examples with the ones I created. If the pmrep.conf file is not in the current working directory, it needs to be specific using -c /path/to/pmrep.conf.

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

This article is about the internals of how the Oracle database handles transactions. In this case the communication mechanism of foreground sessions to the logwriter process is examined. The tests in this article have been executed using the following versions:
– Oracle database 12.1.0.2.161018
– Oracle linux 7.2, kernel 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

In my previous article, ‘Transactions and SCNs’, I talked about redo generation, among other things. When no changes have been made to the way a commit works (parameters commit_logging, commit_wait and commit_write), a commit waits for its change vectors to be written to disk by either the logwriter or the log writer worker processes. In order for the logwriter or its worker processes to write these change vectors as quickly as possible, a foreground process needs to signal the logwriter. This is independent of the two modes of the logwriter, post/wait and polling, because these describe how the logwriter (or the helper processes) signal the foreground process after writing.

Whenever there is a need for a foreground process, it can signal the logwriter to write out the change vectors. This can not only happen during commit, but also during doing DML. Most people are aware that the technical implementation to trigger the logwriter is using the ‘semctl()’ call. Typically, an idle logwriter process is waiting for the wait event ‘rdbms ipc message’ on the Oracle level, which is the instrumentation of the process waiting for the system call ‘semtimedop()’. This way, the logwriter process is off cpu until the semaphore is triggered or semtimedop() times out.

By executing semctl() against the semaphore identifier and semaphore number the logwriter process is sleeping on, the logwriter is made runnable and starts working.

In order to prove this, this small gdb script can be used:

break kcrfw_copy_cv
commands
silent
print $rip
c
end
break kcrf_commit_force_int
commands
silent
print $rip
c
end
break semctl
commands
silent
print $rip
c
end

When this macro is run in a gdb session attached to an Oracle foreground process, and the following commands are executed:

insert into t values ('a');
commit;

The following output is generated:

$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0x7fc43b2189d0 <semctl>
$5 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

Row 1: this is the redo generated by the insert statement copied into the public logbuffer.
Row 2: this is the redo generated by the commit statement copied into the public logbuffer.
Row 3/4: entering the first ‘kcrf_commit_force_int’ phase, where the logwriter is semctl’ed.
Row 5: entering the second ‘kcrf_commit_force_int’ phase, where is waited for the logwriter to finish writing.

Sounds logical, right?

However, when testing this thesis with two sessions doing the insert and commit at exactly the same time (using the broadcast input option in iTerm2), I gotten the following output:

(first session)
$6 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$7 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$8 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$9 = (void (*)()) 0x7fc43b2189d0 <semctl>
$10 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

(second session)
$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

I guess it’s easy to spot the call to semctl being missing for the second session. What is happening here?

First thing to do is investigate how semctl is called. The way to do that, is by changing the commands to execute when gdb breaks on semctl and get a backtrace using the ‘bt’ command:

(gdb) info break 3
Num     Type           Disp Enb Address            What
3       breakpoint     keep y   0x00007fc43b2189d0 <semctl>
	breakpoint already hit 2 times
        silent
        print $rip
        c
(gdb) commands 3
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>print $rip
>bt
>c
>end
(gdb) c
Continuing.

Now execute the insert and the commit again, look at the backtrace:

$11 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$12 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$13 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$14 = (void (*)()) 0x7fc43b2189d0 <semctl>
#0  0x00007fc43b2189d0 in semctl () from /lib64/libc.so.6
#1  0x000000000cdf00a5 in sskgpwpost ()
#2  0x000000000cdedcff in skgpwpost ()
#3  0x000000000c94466f in kslpsprns ()
#4  0x000000000c936553 in kskpthr ()
#5  0x000000000c9442af in ksl_post_proc ()
#6  0x000000000ccc0b28 in ksasnd ()
#7  0x000000000ccc17ff in ksbasend ()
#8  0x000000000ccbc13f in kcrf_commit_force_int ()
#9  0x000000000ccb730a in kcrfw_redo_gen_ext ()
#10 0x000000000c9df1c8 in kcbchg1_main ()
#11 0x000000000c9ddc05 in kcbchg1 ()
#12 0x000000000c99421f in ktucmt ()
#13 0x000000000c975dce in ktcCommitTxn_new ()
#14 0x000000000c97c147 in ktcCommitTxn ()
#15 0x000000000ce504f9 in ktdcmt ()
#16 0x000000000cc44917 in k2lcom ()
#17 0x00000000029cacda in k2send ()
#18 0x000000000c97f87c in xctCommitTxn ()
#19 0x000000000270191f in kksExecuteCommand ()
#20 0x000000000cb6d678 in opiexe ()
#21 0x00000000021cf90e in kpoal8 ()
#22 0x000000000cb60d1d in opiodr ()
#23 0x000000000cd8f4db in ttcpip ()
#24 0x0000000001b96d5c in opitsk ()
#25 0x0000000001b9b721 in opiino ()
#26 0x000000000cb60d1d in opiodr ()
#27 0x0000000001b92eab in opidrv ()
#28 0x000000000269e4d1 in sou2o ()
#29 0x0000000000ba8d9a in opimai_real ()
#30 0x00000000026a8c1c in ssthrdmain ()
#31 0x0000000000ba8cdc in main ()
$15 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

This shows that inside kcrf_commit_force_int, these functions are called which lead up to semctl:

#0  0x00007fc43b2189d0 in semctl () from /lib64/libc.so.6
#1  0x000000000cdf00a5 in sskgpwpost ()
#2  0x000000000cdedcff in skgpwpost ()
#3  0x000000000c94466f in kslpsprns ()
#4  0x000000000c936553 in kskpthr ()
#5  0x000000000c9442af in ksl_post_proc ()
#6  0x000000000ccc0b28 in ksasnd ()
#7  0x000000000ccc17ff in ksbasend ()

This is starting to make sense, ksb is kernel service background processes, ksa is kernel service asynchronous messages. The next thing to do is execute the simultaneous commit again, break on these functions, and see how many of these are called:

(gdb) break sskgpwpost
Breakpoint 4 at 0xcdf0050
(gdb) break skgpwpost
Breakpoint 5 at 0xcdedce0
(gdb) break kslpsprns
Breakpoint 6 at 0xc944310
(gdb) break kskpthr
Breakpoint 7 at 0xc9364d0
(gdb) break ksl_post_proc
Breakpoint 8 at 0xc944270
(gdb) break ksasnd
Breakpoint 9 at 0xccc0a40
(gdb) break ksbasend
Breakpoint 10 at 0xccc1760
(gdb) commands 4-10
Type commands for breakpoint(s) 4-10, one per line.
End with a line saying just "end".
>silent
>print $rip
>c
>end
(gdb) c
Continuing.

This is how the output looks like:

(first session)
$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0xccc1760 <ksbasend>
$5 = (void (*)()) 0xccc0a40 <ksasnd>
$6 = (void (*)()) 0xc944270 <ksl_post_proc>
$7 = (void (*)()) 0xc9364d0 <kskpthr>
$8 = (void (*)()) 0xc944310 <kslpsprns>
$9 = (void (*)()) 0xcdedce0 <skgpwpost>
$10 = (void (*)()) 0xcdf0050 <sskgpwpost>
$11 = (void (*)()) 0x7f06f6f4a9d0 <semctl>
$12 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
(second session)
$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0xccc1760 <ksbasend>
$5 = (void (*)()) 0xccc0a40 <ksasnd>
$6 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$7 = (void (*)()) 0xccc1760 <ksbasend>
$8 = (void (*)()) 0xccc0a40 <ksasnd>

What this shows, is that the first session starts the ‘dive’ from the kcrf_commit_force_int function towards the semctl system call, and the second session is starting the same dive, but stops in the ksasnd function. After that the second session starts the second phase of kcrf_commit_force_int, and starts the same sequence of calls for semctl’ing the logwriter, and ends in ksasnd once again. The first session did do the entire dive to the semctl call, and remembers that when kcrf_commit_force_int is called the second time.

So how does this actually work in the functions ksbasend and ksasnd? The next step is to look how the function ksasnd is called. For that, change the breakpoint for the ksasnd function to print the arguments for the functions. Please mind I do not know the number of arguments, I normally print the first four arguments (which are in the cpu registers RDI, RSI, RDX and RCX in that order) as a starting point. The reason for picking the ksasnd function is there is a high probability (although it truly is a guess) that ksasnd means kernel service asynchronous messages send, and thus gives some indication why it did continue calling the functions up to semctl. This is how that is done:

(gdb) info break 9
Num     Type           Disp Enb Address            What
9       breakpoint     keep y   0x000000000ccc0a40 <ksasnd>
	breakpoint already hit 1 time
        silent
        print $rip
        c
(gdb) commands 9
Type commands for breakpoint(s) 9, one per line.
End with a line saying just "end".
>printf "ksasnd %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
>silent
>c
>end

When another insert/commit is executed, this gives the following output:

$29 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$30 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$31 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$32 = (void (*)()) 0xccc1760 <ksbasend>
ksasnd 95dafd00, 2, 85, 18
$33 = (void (*)()) 0xc944270 <ksl_post_proc>
$34 = (void (*)()) 0xc9364d0 <kskpthr>
$35 = (void (*)()) 0xc944310 <kslpsprns>
$36 = (void (*)()) 0xcdedce0 <skgpwpost>
$37 = (void (*)()) 0xcdf0050 <sskgpwpost>
$38 = (void (*)()) 0x7f06f6f4a9d0 <semctl>
$39 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

Since KSA probably is kernel service asynchronous messages, one of the numbers could be related to the messages listed in x$messages. Upon some investigation it turns out 0x85 (decimal 133) is quite likely a x$messages number:

SYS@testdb AS SYSDBA> select indx, description, dest from x$messages where indx = 133;

      INDX DESCRIPTION               DEST
---------- ------------------------- ----------
       133 Redo writer post action   LGWR

That is what the session is doing, posting the logwriter for action! Just to settle what this means: activating the logwriter using the semctl systemcall is not the only thing that happens. Actually, the main thing happening is that the foreground process sends a specific request (message) to the logwriter, which is asking it to take action. The semctl call is just the way to make the process active.

At this point I got stuck on the function calls. However, there are more ways to see what is going on. I started to do some sampling using ‘perf’. Investigating the perf sampling resulted in finding an additional function that is called during insert and commit: ksaamb (which probably means kernel service asynchronous (messaging) allocate message buffer). Also memory profiling resulted in a memory location that is read during dive towards semctl: 0x60027e20.

The address 0x60027e20 lies within the fixed SGA variable kcrfsg_:

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
  2 where to_number('60027e20','XXXXXXXX')
  3 between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')
  4 and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM	     KSMFSADR	      KSMFSSIZ
-------------------- ---------------- --------
kcrfsg_ 	     0000000060027C30	  1608

However, it is not externalised in x$kcrfws; running an access watch in gdb and selecting the entire x$kcrfws table does not trigger the watchpoint.

With these two new found items, let’s continue the investigation:
– revert the breakpoint at ksasnd to just print it was triggered (print $rip).
– add a breakpoint for ksaamb in the same fashion as the other breakpoints.
– add an access watchpoint on address 0x60027e20 that continues when triggered:

(gdb) awatch *0x60027e20
Hardware access (read/write) watchpoint 1: *0x60027e20
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end

When this is run with two sessions at the same time (doing insert into and commit), the pieces fall into place:

(session 1)
$33 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$34 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$35 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$36 = (void (*)()) 0xccc1760 <ksbasend>
$37 = (void (*)()) 0xccc0a40 <ksasnd>
$38 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()
$39 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$40 = (void (*)()) 0xccc1760 <ksbasend>
$41 = (void (*)()) 0xccc0a40 <ksasnd>
$42 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()

(session 2)
$9 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$10 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$11 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$12 = (void (*)()) 0xccc1760 <ksbasend>
$13 = (void (*)()) 0xccc0a40 <ksasnd>
$14 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 1
New value = 0
0x000000000ccc0c50 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0ca1 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0d68 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 0
New value = 1
0x000000000ccc0d7c in ksaamb ()
$15 = (void (*)()) 0xc944270 <ksl_post_proc>
$16 = (void (*)()) 0xc9364d0 <kskpthr>
$17 = (void (*)()) 0xc944310 <kslpsprns>
$18 = (void (*)()) 0xcdedce0 <skgpwpost>
$19 = (void (*)()) 0xcdf0050 <sskgpwpost>
$20 = (void (*)()) 0x7f7320c8c9d0 <semctl>
$21 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

In this case, both session act completely the same, until the ksaamb function. In the ksaamb function, both sessions read memory address 0x60027e20. The first session reads the value 1, the second session reads 0 (which has been a 1 previously, which is the reason for printing old/new value).

It seems that 0 indicates there are no messages pending for the logwriter, which is why the second session continues to run in the ksaamb function, and the first session actually cancelling the action to execute semctl. In the second session the address 0x60027e20 is read a few more times, until it actually writes the value 1 into it.

To expand a bit on how this works, you can add latch gets and frees:

(gdb) break kslgetl
Breakpoint 13 at 0xc938af0
(gdb) commands
Type commands for breakpoint(s) 13, one per line.
End with a line saying just "end".
>silent
>printf "kslgetl %x\n", $rdi
>c
>end
(gdb) break kslfre
Breakpoint 14 at 0xc9397a0
(gdb) commands
Type commands for breakpoint(s) 14, one per line.
End with a line saying just "end".
>silent
>printf "kslfre %x\n", $rdi
>c
>end

This will show all the latch gets during both insert into and commit. However the interesting bit is:

(session 1)
$61 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()

(session 2)
$6 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0c50 in ksaamb ()
kslgetl 6000e660
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0ca1 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0d68 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 0
New value = 1
0x000000000ccc0d7c in ksaamb ()
kslfre 6000e660

The first session reads the value 1 and bails out. The second session reads the value 0 and continues. It then takes a latch at address 0x6000e660, and eventually sets the value to 1, and frees the latch at address 0x6000e660.

Probably most readers of this blog can already guess the latch name, but for completeness sake:

SYS@testdb AS SYSDBA> select addr, name from v$latch where to_number(addr,'XXXXXXXXXXXXXXXX') = to_number('6000e660','XXXXXXXX');

ADDR		 NAME
---------------- ----------------------------------------------------------------
000000006000E660 messages

The messages latch!

Conclusion
Whenever a session needs its redo to be written by the log writer, it makes sure the redo is available for the logwriter using the function kcrfw_copy_cv, and then triggers the logwriter. When profiling system calls, it looks like the trigger simply is waking up the logwriter using the system call semctl. Upon deeper investigation, the way it works is that actually a message is send to the logwriter explicitly asking the logwriter to take action on the Oracle level. The construction of the logwriter sleeping in semtimedop and a process triggering it using semctl is an implementation detail.

A session that needs to trigger the logwriter uses the Oracle ksa (kernel service asynchronous messages) layer of functions. The function ksaamb (kernel service asynchronous messages allocate message buffer) reads a variable (0x60027e20) in the fixed SGA to make sure this work is not already currently being done. 0 means no other process is doing that, a value of 1 means another session is already working on signalling the logwriter. If 1 is encountered the session stops trying to signal the logwriter, and “hopes” the other session triggers the logwriter and that the logwriter will write it’s redo. However, to be absolutely sure, it will try to signal the logwriter once again when the second “phase” of the kcrf_commit_force_int execution (which is waiting for the logwriter to finish writing the session’s redo).

Setting the fixed SGA variable is protected by a messages latch.

The essence of the article is you might not always see a semctl call where you expect it. This is actually an optimisation which makes sense once you understand the implementation details.

%d bloggers like this: