Archive

Tag Archives: linux

This post shows you how to install Rundeck with the Ansible plugin on Centos 7. The installation is done with nginx as the web server and using SSL with a self signed certificate. Please read the Ansible installation script, and modify anything that should be different for your situation. You will be amazed how well readable the installation script is!

Rundeck is a web based user interface that allows you to run commands against a group of hosts. Rundeck has an ansible plugin. Using that plugin, it could perform the similar functionality as Ansible Tower (commercial product) or Semaphore (open source).

After a fresh installation of Centos 7, do the following as root:

yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ansible git
git clone https://gitlab.com/FritsHoogland/install_rundeck.git

This adds the EPEL (extra packages for Enterprise Linux) repository to Centos, which contains ansible and git. The next yum command installs ansible (the orchestration software) and git (distributed version control). The git clone command pulls the ansible orchestration script from my install_rundeck git repository at gitlab.

It is important to open the install_rundeck/install_rundeck.yml script and modify anything that is different in your situation. The public hostname will most likely be different than the 192.168.66.16 in the script. You might want to change the locality information with the certificate generation (unless you live in Lutjebroek like me :-). If you do a corporate installation, you might want to skip this step altogether and get a certificate pair signed by your company’s CA.

Please mind I ran into issues with ansible inventories:
– The hosts in the inventory need to have ansible run against them to pick up their properties and become visible in rundeck in the nodes tab. For being able to have ansible run against the hosts in the inventory, the host need to exist in the ssh known hosts file of the rundeck user, otherwise ansible can’t run and the host or hosts will not be visible in rundeck. The solution is to log in as the rundeck user and logon to the hosts in the inventory once manually and accept the host or hosts. From a security perspective it’s understandable that you careful need to assess the hosts to trust, but from an automation standpoint this is annoying. Outside of essentially filling out the known hosts file as I described, there are several other workarounds.
– I created an ansible inventory file in the rundeck project directory and entered the hosts in it. Rundeck picked up the hosts (after solving the above point they became visible in the nodes tab), however when executing something using ansible via rundeck it would say ‘[WARNING]: provided hosts list is empty, only localhost is available’. This means ansible was not pointed explicitly to an inventory, so it used the default one. In order to solve this, I symlinked my (rundeck) project inventory to the /etc/ansible/hosts to make it centrally available. Apparently, using a central inventory for ansible using the plugin is by design. I would rather have rundeck generate an inventory per execution, and pointing to it when the plugin executes ansible.

Now install rundeck:

ansible-playbook install_rundeck/install_rundeck.yml

Done!

Prometheus is an open source systems monitoring and alerting toolkit originally build at Soundcloud. This blogpost shows how to install the needed components to do visualisation of linux system statistics via Grafana.

The setup consists of 3 components:
node_exporter, an exporter of system and hardware metrics.
prometheus, a metric collection and persistence layer.
grafana, the visualisation layer.

1. Preparation
The needed components are installed in the home directory of the user ‘prometheus’. In order for that user exist, it must obviously first be created:

# useradd prometheus
# su - prometheus
$

This installation guide uses Oracle Linux 7.3, but should work for RHEL or Centos too.

2. Node exporter
The next thing to do is install the node exporter. Please mind new version do come out, so you might want to verify the latest release on

$ curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz"
$ mkdir -p Prometheus/node_exporter
$ cd $_
$ tar xzf ../../node_exporter-0.14.0.linux-amd64.tar.gz

Now become root and create a unit file to automatically startup the node exporter using systemd:

# echo "[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

[Install]
WantedBy=default.target" > /etc/systemd/system/node_exporter.service

And make systemd start the node exporter:

# systemctl daemon-reload
# systemctl enable node_exporter.service
# systemctl start node_exporter.service

Next you can verify if the node exporter is running by using ‘systemctl status node_exporter.service:

# systemctl status node_exporter.service
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:20:54 UTC; 7s ago
 Main PID: 3017 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─3017 /home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - hwmon" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - infiniband" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - textfile" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - conntrack" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - diskstats" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - entropy" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - loadavg" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - sockstat" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - wifi" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg="Listening on :9100" source="node_exporter.go:186"

Additionally, you can go to hostname:9100, and look if that page says ‘node exporter’, and has a link called ‘metric’, which has all the metrics.

3. Prometheus
After we installed node_exporter to provide measurements, we must install the software that can fetch that information and store it. That is what prometheus does. First, become the prometheus user again, and install prometheus. Here too is important to realise that newer versions will come out after this article has been written:

# su - prometheus
$ curl -LO "https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz"
$ cd Prometheus
$ tar xzf ../prometheus-1.7.1.linux-amd64.tar.gz
$ cd prometheus-1.7.1.linux-amd64
$ echo "scrape_configs:

  - job_name: 'prometheus'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9100']"> prometheus.yml

This downloaded and unzipped prometheus, and created prometheus scrape config to fetch data from prometheus itself and the node exporter. Now become root, and install the systemd unit file for prometheus:

# echo "[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus.yml -storage.local.path=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/data

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/prometheus.service

And make systemd start prometheus:

# systemctl daemon-reload
# systemctl enable prometheus.service
# systemctl start prometheus.service

And verify prometheus is running:

# systemctl status prometheus.service
● prometheus.service - Prometheus Server
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:36:55 UTC; 9s ago
     Docs: https://prometheus.io/docs/introduction/overview/
 Main PID: 22656 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─22656 /home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1....

Jul 31 15:36:55 test.local systemd[1]: Started Prometheus Server.
Jul 31 15:36:55 test.local systemd[1]: Starting Prometheus Server...
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting prometheus (version=1.7.1, branch=mast...n.go:88"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Build context (go=go1.8.3, user=root@0aa1b7fc43...n.go:89"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Host details (Linux 3.10.0-514.26.2.el7.x86_64 ...n.go:90"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading configuration file /home/prometheus/Pro....go:252"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading series map and head chunks..." source="....go:428"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="0 series loaded." source="storage.go:439"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Listening on :9090" source="web.go:259"
Hint: Some lines were ellipsized, use -l to show in full.

Additionally you can go to hostname:9090/targets and verify both node_exporter and prometheus report state=UP.

At this point, system metrics are fetched and stored. All we need to do, is visualise it. An excellent tool for doing so is grafana. This is how grafana is installed:

4. Grafana
This webpage shows installation instructions and a link to the latest version. During the time of writing of this blogpost, the latest version was 4.1.1. This is how grafana is installed: (please mind installation and systemd require root privileges)

# yum install https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.4.1-1.x86_64.rpm

Next up make systemd handle grafana and start it:

# systemctl daemon-reload
# systemctl enable grafana-server.service
# systemctl start grafana-server.service

And check if grafana is running:

# systemctl status grafana-server.service
● grafana-server.service - Grafana instance
   Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:43:11 UTC; 1min 58s ago
     Docs: http://docs.grafana.org
 Main PID: 22788 (grafana-server)
   CGroup: /system.slice/grafana-server.service
           └─22788 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile= cfg:default.paths.logs=/var/log/grafana cfg:default.path...

Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Starting plugin search" logger=plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=warn msg="Plugin dir does not exist" logger=plugins dir=/...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Plugin dir created" logger=plugins dir=/var/lib...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Alerting" logger=alerting.engine
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing CleanUpService" logger=cleanup
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Stream Manager"
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing HTTP Server" logger=http.server ad...socket=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Hint: Some lines were ellipsized, use -l to show in full.

5. Grafana configuration
Next, we need to hook up grafana with prometheus. First, go to hostname:3000.
– Login with admin/admin
– Click ‘add datasource’
– Name: prometheus, Type: Prometheus
– Http settings: http://localhost:9090, select Access: ‘proxy’.
– Click ‘save and test’. This should result in ‘success’ and ‘datasource updated.’

Now click on the grafana symbol in the left upper corner, dashboards, import. Enter ‘2747’ at ‘grafana.com dashboard’. This will say ‘Linux memory’, select the prometheus datasource which you just defined, and click import.

This should result in a dashboard the shows you the linux memory area’s (click on the picture to get a better view!):

When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:

WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957

The current documentation has no description for it. Let’s see what V$EVENT_NAME says:

SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 
  2  from v$event_name where name = 'PGA memory operation';

EVENT# NAME                                  PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
------ ------------------------------------- ---------- ---------- ---------- ---------------
   524 PGA memory operation                                                   Other

Well, that doesn’t help…

Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).

TS@fv122b2 > alter session set events 'sql_trace level 8';

Session altered.

Now use strace to look at the system calls in another session:

# strace -e write=all -e all -p 9426
Process 9426 attached
read(9,

Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:

io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
lseek(7, 0, SEEK_CUR)                   = 164639
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 37 38  30 20 70 31 3d 32 30 39   ela= 780 p1=209 |
 | 00040  37 31 35 32 20 70 32 3d  31 31 31 34 31 31 32 20  7152 p2=1114112  |
 | 00050  70 33 3d 30 20 6f 62 6a  23 3d 32 32 38 33 33 20  p3=0 obj#=22833  |
 | 00060  74 69 6d 3d 31 39 35 31  37 30 32 30 35 36 36 0a  tim=19517020566. |
...
munmap(0x7f5ba8fbd000, 2097152)         = 0
munmap(0x7f5ba91bd000, 2097152)         = 0
mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000
lseek(7, 0, SEEK_CUR)                   = 183409
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 35 39  32 20 70 31 3d 30 20 70   ela= 592 p1=0 p |
 | 00040  32 3d 30 20 70 33 3d 30  20 6f 62 6a 23 3d 32 32  2=0 p3=0 obj#=22 |
 | 00050  38 33 33 20 74 69 6d 3d  31 39 35 32 30 36 33 33  833 tim=19520633 |
 | 00060  36 37 34 0a                                       674.             |

Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:

global wait_event_nr=524
probe begin {
	printf("begin.\n")
}

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthbwt") {
	if ( pid() == target() && register("rdx") == wait_event_nr )
		printf("kskthbwt - %d\n", register("rdx"))
}
probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr )
		printf("kskthewt - %d\n", register("rsi"))
}
probe syscall.mmap2 {
	if ( pid() == target() )
		printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6))
}
probe syscall.mmap2.return {
	if ( pid() == target() )
		printf("return value: %x\n", $return)
}
probe syscall.munmap {
	if ( pid() == target() )
		printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2))
}

Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.

I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:

# stap -x 9426 mmap.stp
begin.
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000
 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000
kskthewt - 524
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000
 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000
kskthewt - 524
kskthbwt - 524
 munmap, addr 7f5ba8fbd000, size 2097152
 munmap, addr 7f5ba91bd000, size 2097152
kskthewt - 524

This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.

Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?

To answer that, we need to look at the flags of the two calls. Here is an example:

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000

The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.

The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.

The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.

Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr ) {
		ksuse = register("r13")-4672
		ksuseopc = user_uint16(ksuse + 2098)
		ksusep1 = user_uint64(ksuse + 2104)
		ksusep2 = user_uint64(ksuse + 2112)
		ksusep3 = user_uint64(ksuse + 2120)
		ksusetim = user_uint32(ksuse + 2128)
		printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3)
	}
}

When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:

kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000
 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0

This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?

Let’s look what happens when I execute the same SQL again:

kskthbwt - 524
 munmap, addr 7f1562130000, size 2097152
 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0
kskthbwt - 524
 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0

Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.

The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.

The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.

Conclusion:
With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.

To me, ‘cloud computing’ is renting a compute resource to perform a task. In order to use that compute resource, you need to instruct it to do something, which is typically done via the network. If the task the compute resource needs to fulfil is being an application server or being a client or both in the case of an application server that uses an Oracle database, the network latency between the client of the database and the database server is a critical property.

I think so far everybody is with me. If we zoom in to the network, it becomes more difficult, and *very* easy to make wrong assumptions. Let me explain. A network, but really any connection between processing and a resource, has two DIFFERENT properties that I see getting mixed up consistently. These are:
* Latency: the time it takes for a signal or (network) packet to travel from the client to the server, or the time it takes to travel from the client to the server and back.
* Bandwidth: the amount of data that can be transported from the client to the server in a certain time.

How do you determine the latency of a network? Probably the most people respond with ‘use ping’. This is how that looks like:

[user@oid1 ~]$ ping -c 3 lsh1
PING lsh1 (x.x.x.x) 56(84) bytes of data.
64 bytes from lsh1: icmp_seq=1 ttl=62 time=680 ms
64 bytes from lsh1: icmp_seq=2 ttl=62 time=0.304 ms
64 bytes from lsh1: icmp_seq=3 ttl=62 time=0.286 ms

The question I often ask myself is: what is that we see actually? How does this work?
In order to answer that question, the tcpdump tool can answer that question. Using tcpdump, you can capture the network packets on which the ping utility based the above outcome. The ‘-ttt’ option calculates the time between each arrived packet:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host lsh1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1 > lsh1: ICMP echo request, id 35879, seq 1, length 64
00:00:00.680289 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 1, length 64
00:00:00.319614 IP oid1 > lsh1: ICMP echo request, id 35879, seq 2, length 64
00:00:00.000287 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 2, length 64
00:00:01.000180 IP oid1 > lsh1: ICMP echo request, id 35879, seq 3, length 64
00:00:00.000269 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 3, length 64

So, ping works by sending a packet (ICMP echo request) requesting a reply (ICMP echo reply) from the remote server, and measure the time it takes to get that reply. Great, quite simple, isn’t it? However, the biggest issue I see this is using a protocol that is not used for sending regular data (!). Most application servers I encounter send data using TCP (transmission control protocol), the traffic ping sends are sent using a protocol called ICMP (internet control message protocol). Especially in the cloud, which means (probably) a lot of the infrastructure is shared, ICMP might be given different priority than TCP traffic, which you quite probably are using when the application on your cloud virtual machine is running. For those of you who haven’t looked into the network side of the IT landscape, you can priorise protocols and even specific ports, throttle traffic and you can even terminate it. In fact, a sensible protected (virtual) machine in the cloud will not respond to ICMP echo requests in order to protected it from attacks.

So, what would be a more sensible approach then? A better way would be to use the same protocol and port number that your application is going to use. This can be done using a tool called hping. Using that tool, you can craft your own packet with the protocol and flags you want. In the case of Oracle database traffic that would be the TCP protocol, port 1521 (it can be any port number, 1521 is the default port). This is how you can do that. In order to mimic starting a connection, the S (SYN) flag is set (-S), one packet is send (-c 1) to port 1521 (-p 1521).

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip

What this does is best investigated with tcpdump once again. The server this is executed against can respond in two ways (three actually). When you send this to TCP port 1521 where a listener (or any other daemon that listens on that port) is listening, this is the response:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [S], seq 1436552830, win 512, length 0
00:00:00.001229 IP db01-vip.ncube-lm > oid1.kjtsiteserver: Flags [S.], seq 2397022511, ack 1436552831, win 14600, options [mss 1460], length 0
00:00:00.000023 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [R], seq 1436552831, win 0, length 0

This is a variation of the classic TCP three way handshake:
1. A TCP packet is sent with the SYN flag set to indicate starting a (client to server) connection.
2. A TCP packet is sent back with SYN flag set to indicate starting a (server to client) connection, and the first packet is acknowledged.
3. This is where the variation is, normally an acknowledgement would be sent of the second packet to establish a two way connection, but in order to stop the communication a packet is sent with the RST (reset) flag set.

However, this is if a process is listening on the port. This is how that looks like when there is no process listening on port 1521:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.vsamredirector > db01.ncube-lm: Flags [S], seq 1975471906, win 512, length 0
00:00:00.001118 IP db01.ncube-lm > oid1.vsamredirector: Flags [R.], seq 0, ack 1975471907, win 0, length 0

This means that if a connection is initiated to a port on which no process is listening (port status ‘closed’), there is communication between the client and the server. This is why firewalls were invented!
1. A TCP packet is sent with the SYN flag set to indicate starting a connection.
2. A TCP packet is sent back to with the RST (reset) flag set to indicate no connection is possible.

The third option, when port 1521 is firewalled on the server, simply means only the first packet (from client to server with the SYN flag set) is sent and no response is coming back.

Okay, let’s pick up the performance aspect again. This hping command:

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip
HPING db01-vip (eth0 x.x.x.x): S set, 40 headers + 0 data bytes
len=44 ip=db01-vip ttl=57 DF id=0 sport=1521 flags=SA seq=0 win=14600 rtt=1.2 ms

Says the roundtrip time is 1.2ms. If we look at the network packets and timing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.mmcal > db01-vip.ncube-lm: Flags [S], seq 1289836562, win 512, length 0
00:00:00.001113 IP db01-vip.ncube-lm > oid1.mmcal: Flags [S.], seq 2504750542, ack 1289836563, win 14600, options [mss 1460], length 0
00:00:00.000016 IP oid1.mmcal > db01-vip.ncube-lm: Flags [R], seq 1289836563, win 0, length 0

It becomes apparent that the 1.2ms time hping reports is the time it takes for the remote server to send back the SYN+ACK package in the TCP three way handshake.

So does that mean that if we take a number of measurements (let’s say 100, or 1000) to have a statistically significant number of measurements we can establish my TCP roundtrip time and then know how fast my connection will be (outside of all the other variables inherent to the internet and potential noisy neighbours to name a few)?

Oracle provides a way to generate and measure SQL-Net traffic in My Oracle Support note: Measuring Network Capacity using oratcptest (Doc ID 2064368.1). This note provides a jar file which contains server and client software, and is aimed at dataguard, but is useful to measure SQL-Net network latency. I have looked at the packets oratcptest generates, and they mimic SQL-Net quite well.

Let’s see if we can redo the test above to measure pure network latency. First on the database server side, setup the server:

[user@db01m ~]$ java -jar oratcptest.jar -server db01 -port=1521

And then on the client side run the client using the same oratcptest jar file:

java -jar oratcptest.jar db01 -mode=sync -length=0 -duration=1s -interval=1s -port=1521

The important bits are -mode=sync (client packet must be acknowledged before sending another packet) and -length=0 (network traffic contains no payload). This is the result:

[Requesting a test]
	Message payload        = 0 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(07:34:42) The server is ready.
                        Throughput                 Latency
(07:34:43)          0.017 Mbytes/s                0.670 ms
(07:34:43) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.017 Mbytes/s
	             Avg. latency = 0.670 ms

If you look at the hping roundtrip time (1.2ms) and the oratcptest roundtrip time (0.7ms) clearly this is different! If you just look at the numbers (1.2 versus 0.7) it might seem like the oratcptest time is only measuring client to server traffic instead of the whole roundtrip? For this too it’s good to use tcpdump once again and look what oratcptest actually is doing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.63602 > db01.ncube-lm: Flags [S], seq 2408800085, win 17920, options [mss 8960,sackOK,TS val 3861246405 ecr 0,nop,wscale 7], length 0
00:00:00.001160 IP db01.ncube-lm > oid1.63602: Flags [S.], seq 2178995555, ack 2408800086, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:00:00.000015 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 1, win 140, length 0
00:00:00.023175 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 1:145, ack 1, win 140, length 144
00:00:00.000520 IP db01.ncube-lm > oid1.63602: Flags [.], ack 145, win 123, length 0
00:00:00.000951 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 1:145, ack 145, win 123, length 144
00:00:00.000008 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 145, win 149, length 0
00:00:00.018839 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 145:157, ack 145, win 149, length 12
00:00:00.000563 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 145:149, ack 157, win 123, length 4
00:00:00.000358 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 157:169, ack 149, win 149, length 12
00:00:00.000486 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 149:153, ack 169, win 123, length 4
00:00:00.000100 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 169:181, ack 153, win 149, length 12
00:00:00.000494 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 153:157, ack 181, win 123, length 4
...
00:00:00.000192 IP oid1.63586 > db01.ncube-lm: Flags [P.], seq 18181:18193, ack 6157, win 149, length 12
00:00:00.000447 IP db01.ncube-lm > oid1.63586: Flags [P.], seq 6157:6161, ack 18193, win 123, length 4
00:00:00.006696 IP oid1.63586 > db01.ncube-lm: Flags [F.], seq 18193, ack 6161, win 149, length 0
00:00:00.000995 IP db01.ncube-lm > oid1.63586: Flags [F.], seq 6161, ack 18194, win 123, length 0
00:00:00.000012 IP oid1.63586 > db01.ncube-lm: Flags [.], ack 6162, win 149, length 0

If you look at rows 4, 5 and 6 you see the typical TCP three-way handshake. What is nice to see, is that the actual response or roundtrip time for the packet from the server on line 5 actually took 1.1ms, which is what we have measured with hping! At lines 7-10 we see there is a packet send from the client to the server which is ACK’ed and a packet send from the server to the client which is ACK’ed. If you add the ‘-A’ flag to tcpdump you can get the values in the packet printed as characters, which shows the client telling the server how it wants to perform the test and the server responding with the requested settings. This is all a preparation for the test.

Starting from line 11, there is a strict repeating sequence of the client sending a packet of length 12, ACK’ing the previous received packet, and then the server responding with a packet of length 4 ACK’ing its previous received packet. This is the actual performance test! This means that the setting ‘-duration=1s -interval=1s’ does not mean it sends one packet, it actually means it’s continuously sending packets for the duration of 1 second. Also another flag is showing: the P or PSH (push) flag. This flag means the kernel/tcpip-stack understands all data to transmit is provided from ‘userland’, and now must be sent immediately, and instructs the receiving side to process it immediately in order to bring it to the receiving userland application as soon as possible too.

Lines 20-22 show how the connection is closed by sending a packet with a FIN flag, which is done for both the client to the server and the server to the client, and because it’s TCP, these need to be ACK’ed, which is why you see a trailing packet without a flag set, only ACK’ing the FIN packet.

The conclusion so far is that for real usable latency calculations you should not use a different protocol (so whilst ICMP (ping) does give an latency indication it should really only be used as an indicator), and that you should measure doing the actual work, not meta-transactions like the TCP three way handshake. Probably because of the PSH flag, the actual minimal latency for SQL-Net traffic is lower than ping and hping showed.

Wait a minute…did you notice the ‘actual minimal latency’? So far we only have been sending empty packets, which means we measured how fast a packet can travel from client to server and back. In reality, you probably want to send actual data back and forth, don’t you? That is something that we actually have not measured yet!

Let’s do actual Oracle transactions. For the sake of testing network latency, we can use Swingbench to execute SQL. This is how that is done:

[user@oid1 bin]$ cd ~/sw/swingbench/bin
[user@oid1 bin]$ ./charbench -c ../configs/stresstest.xml -u soe -p soe -uc 1 -rt 00:01
Author  :	 Dominic Giles
Version :	 2.5.0.971

Results will be written to results.xml.
Hit Return to Terminate Run...

Time		Users	TPM	TPS

8:22:56 AM      1       14450   775

Please mind I am using 1 user (-uc 1) and a testing time of 1 minute (-rt 00:01), which should be longer when you are doing real testing. As a reminder, I am using 1 session because I want to understand the latency, not the bandwidth! In order to understand if the network traffic looks the same as oratcptest.jar, I can use tcpdump once again. Here is a snippet of the traffic:

...
00:00:00.000106 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5839:5852, ack 5986, win 272, length 13
00:00:00.000491 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 5986:6001, ack 5852, win 330, length 15
00:00:00.000234 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5852:6003, ack 6001, win 272, length 151
00:00:00.000562 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6001:6077, ack 6003, win 330, length 76
00:00:00.000098 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6003:6016, ack 6077, win 272, length 13
00:00:00.000484 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6077:6092, ack 6016, win 330, length 15
00:00:00.000238 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6016:6159, ack 6092, win 272, length 143
00:00:00.000591 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6092:6425, ack 6159, win 330, length 333
...

The important bit is this shows the same single packet traffic client to server and back as we saw oratcptest generated, however now with varying packet size (which is logical, different SQL statements are sent to the database), the PSH bit is set, which also is the same as oratcptest generated.

Let’s assume this is a real-life workload. In order to measure and calculate differences in performance between different networks, we need the average packet length. This can be done with a tool called tcpstat (this link provides the EL6 version). In my case I have only one application using a database on this server, so I can just filter on port 1521 to measure my SQL-Net traffic:

[user@oid1 ~]$ sudo tcpstat -i eth0 -o "Packet/s=%p\tmin size: %m\tavg size: %a\tmax size: %M\tstddev: %d\n" -f 'port 1521'
Packet/s=2526.40	min size: 53	avg size: 227.76	max size: 1436	stddev: 289.21
Packet/s=2531.40	min size: 53	avg size: 229.79	max size: 1432	stddev: 291.22
Packet/s=2634.20	min size: 53	avg size: 229.59	max size: 1432	stddev: 293.38
Packet/s=2550.00	min size: 53	avg size: 234.11	max size: 1435	stddev: 296.77
Packet/s=2486.80	min size: 53	avg size: 232.24	max size: 1436	stddev: 293.16

In case you wondered why tcpstat reports a minimum length of 53 and tcpdump (a little up in the article) of 13; tcpstat reports full packet length including packet, protocol and frame headers, tcpdump in this case reports the payload length.

Now we can execute oratcptest.jar again, but with a payload size set that matches the average size that we measured, I have taken 250 as payload size:

[user@oid1 ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(09:39:47) The server is ready.
                        Throughput                 Latency
(09:39:48)          0.365 Mbytes/s                0.685 ms
(09:39:48) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.365 Mbytes/s
	             Avg. latency = 0.685 ms

As you can see, there is a real modest increase in average latency going from 0.670ms to 0.685ms.

In order to test the impact of network latency let’s move the oratcptest client to the server, to get the lowest possible latency. Actually, this is very easy, because the oratcptest.jar file contains both the client and the server, so all I need to do is logon to the server where I started the oratcptest.jar file in server mode, and run it in client mode:

[user@db01m ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(14:49:29) The server is ready.
                        Throughput                 Latency
(14:49:30)         12.221 Mbytes/s                0.020 ms
(14:49:30) Test finished.
	       Socket send buffer = 26010 bytes
	          Avg. throughput = 11.970 Mbytes/s
	             Avg. latency = 0.021 ms

Wow! The roundtrip latency dropped from 0.685ms to 0.021ms! Another test using oratcptest.jar using a true local network connection (with Linux being virtualised using Xen/OVM) shows a latency of 0.161ms.

These are the different network latency figures measured with oratcptest using a payload size that equals my average network payload size:
– Local only RTT: 0.021
– Local network RTT: 0.161
– Different networks RTT: 0.685

If I take swingbench and execute the ‘stresstest’ run local, on a machine directly connected via the local network and across different networks (think cloud), and now measure TPS (transactions per second), I get the following figures:
– Local only TPS: 2356
– Local network TPS: 1567
– Different networks TPS: 854

Do these figures make sense?
– Local only: Time not in network transit per second: 1000-(0.021*2356)=950.524; approximate average time spend on query: 950.523/2356=0.40ms
– Local network: 1000-(0.161*1567)=747.713/1567=0.48ms
– Different networks: 1000-(0.685*854)=415.010/854=0.49ms
It seems that this swingbench test spends roughly 0.40-0.50ms on processing, the difference in transactions per second seem to be mainly caused by the difference in network latency.

This blog post is about two things: one how you can monitor who is bringing you database up and down (there is a twist at the end!) and two how you can very conveniently do that with aggregated logs in a browser with a tool called ‘Kibana’, which is the K in ELK.

What is the ‘ELK stack’?
The ELK stack gets it’s name from Elasticsearch, Logstash and Kibana.
– Elasticsearch is an open source search engine based on Apache Lucene, which provides a distributed, multitenant-capable full-text search engine with a http web interface and schema-free JSON documents.
– Logstash is a fully configurable open source data processing pipeline that can receive data from a multiple sources simultaneously, transform it and output it based on the output plugin, which is the elastic search plugin in this blogpost but could be anything from STDOUT, an unix pipe, a file, a file in CSV, HTTP, email, IRC, Jira, graphite, kafka, mongodb, nagios, S3, SolR, … really whatever you want.
– Kibana is an open source data visualisation plugin for Elasticsearch.
When looking at Kibana, it quite much looks like the splunk interface.

Installing the ELK stack.
Installing the ELK stack in a basic way is easy. In this blogpost I will install everything on the same host, everything being the ELK stack and an Oracle database installation. In reality you should have a log gatherer on every host (called ‘filebeat’) and a dedicated host which runs the rest of the stack (logstash, elasticsearch and kibana). The below install actions were executed on a Linux 64 bit host running Oracle Linux 6.8.
In order to make the installation really easy, I use the yum repository of the elastic company, this is how to set that up (all done as root, ‘#’ indicates root):

# rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
# vi /etc/yum.repos.d/elastic.repo
[elastic-5.x]
name=Elastic repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

Install elasticsearch:

# yum install java-1.8.0-openjdk
# yum install elasticsearch
# chkconfig --add elasticsearch
# service elasticsearch start

Install logstash:

# yum install logstash

Configure logstash input and output:

# vi /etc/logstash/conf.d/input.conf
input {
  beats {
    port => 5044
  }
}
# vi /etc/logstash/conf.d/output.conf
output {
  elasticsearch {
    hosts => "localhost:9200"
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
  }
}

Verify the logstash config files:

# sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash -t
Sending Logstash's logs to /var/log/logstash which is now configured via log4j2.properties
Configuration OK

If you see the ‘Configuration OK’ message, it means logstash could interprent the configuration files. It does not mean it will all work as desired, there could be runtime issues.
Now let’s start logstash. Logstash uses upstart (meaning a startup script in /etc/init) instead of the legacy startup mechanism using the chkconfig and service utilities.

# initctl start logstash

The last part of the data pipeline is ‘filebeat’. There are and could be multiple input products, in this blogpost I use ‘filebeat’, which keeps track of logfiles.

# yum install filebeat
# chkconfig --add filebeat

We are going to look into linux and oracle auditing. So we need to keep track of a couple of files:
– /var/log/secure: this is the default linux logfile which contains all kinds of authentication messages, as defined in /etc/rsyslog.conf (authpriv.* /var/log/secure).
– /u01/app/oracle/admin/*/adump/*.aud: this is the default place where the oracle database stores it’s audit files. These audit files provide what is called ‘mandatory auditing’, and includes at least connections to the instance with administrator privilege, database startup and database shutdown. The default is a normal text based logfile, it could be set to XML.
– /var/log/audit/audit.log: this is the logfile of the linux kernel based audit facility. This is actually a lesser known hidden gem in Linux, and provides audit information from the Linux kernel.

These files need to be configured in filebeat, in the file: /etc/filebeat/filebeat.yml. As the extension of the file indicates, this is a file organised in YAML syntax. The best way to configure the file is to move the file, and create your own file with your desired configuration. First of all we add the output, which is logstash in our case. Please mind the default configuration of filebeat is direct output to elasticsearch, which means we don’t have an option to enrich the data!

# mv /etc/filebeat/filebeat.yml /etc/filebeat/filebeat.yml.orig
# vi /etc/filebeat/filebeat.yml
output.logstash:
  hosts: ["localhost:5044"]

Please mind the two spaces in front of ‘hosts’, which is mandatory for a YAML document!
Next up we add the files to monitor in the configuration file. The linux based logfiles are easy:

filebeat.prospectors:
- input_type: log
  paths:
    - /var/log/secure
  document_type: secure

- input_type: log
  paths:
    - /var/log/audit/audit.log
  document_type: audit

One thing to notice is that a type is set for each file (which is really just a name for the file filebeat monitors), which makes it able to find data from these specific files later on. Now the Oracle audit file:

- input_type: log
  paths:
    - /u01/app/oracle/admin/*/adump/*.aud
  document_type: oracle_audit
  multiline:
    pattern: '^[A-Za-z]{3} [A-Za-z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}'
    negate: true
    match: after

This looks a bit more complicated. The reason for the complication is the multiline specification. An Oracle database audit file contains a timestamp, after which the audit data is written; it looks like this:

Thu Jan 19 13:44:12 2017 +00:00
LENGTH : '198'
ACTION :[49] 'ALTER DATABASE OPEN /* db agent *//* {0:0:476} */'
DATABASE USER:[1] '/'
PRIVILEGE :[6] 'SYSDBA'
CLIENT USER:[6] 'oracle'
CLIENT TERMINAL:[0] ''
STATUS:[1] '0'
DBID:[10] '2622783786'

The important things at this time: the ‘pattern’ keyword specifies the timestamp, you can see you can match it with the timestamp, and all the following data needs to be processed together, this is a single record, written over multiple lines. ‘negate: true’ means that anything that does not fit the pattern needs to be added to this piece of data, ‘match: after’ means that this is added after the pattern is matched.

Now that filebeat is setup, we can start the filebeat daemon:

# service filebeat start

The last component is kibana:

# yum install kibana
# chkconfig --add kibana
# service kibana start

Now that we’ve set the entire pipeline up, a next thing to do is to configure logstash to enrich the data. Here’s the how it’s done for the Oracle database audit file:

# vi /etc/logstash/conf.d/oracle-audit.conf
filter {
  if [type] == "oracle_audit" {
    grok {
      match => { "message" => "^%{DAY} %{MONTH:M} %{MONTHDAY:d} %{HOUR:h}:%{MINUTE:m}:%{SECOND:s} %{YEAR:y}" }
      add_tag => [ "grok", "oracle_audit" ]
    }
    grok {
      match => { "message" => "ACTION :\[[0-9]*\] '(?<ora_audit_action>.*)'.*DATABASE USER:\[[0-9]*\] '(?<ora_audit_dbuser>.*)'.*PRIVILEGE :\[[0-9]*\] '(?<ora_audit_priv>.*)'.*CLIENT USER:\[[0-9]*\] '(?<ora_audit_osuser>.*)'.*CLIENT TERMINAL:\[[0-9]*\] '(?<ora_audit_term>.*)'.*STATUS:\[[0-9]*\] '(?<ora_audit_status>.*)'.*DBID:\[[0-9]*\] '(?<ora_audit_dbid>.*)'" }
    }
    grok {
      match => { "source" => [ ".*/[a-zA-Z0-9_#$]*_[a-z0-9]*_(?<ora_audit_derived_pid>[0-9]*)_[0-9]*\.aud" ] }
    }
    mutate {
      add_field => { "ts" => "%{y}-%{M}-%{d} %{h}:%{m}:%{s}" }
    }
    date {
      locale => "en"
      match => [ "ts", "YYYY-MMM-dd HH:mm:ss" ]
    }
    mutate {
      remove_field => [ "ts", "y", "M", "d", "h", "m", "s" ]
    }
  }
}

It’s beyond the scope of this article to go through every detail, but as you can see we apply a filter. Everything in this filter takes place for the type “oracle_audit”, which is set by filebeat. The next thing we encounter a couple of times is ‘grok’s’. The term grok comes from the Robert Heinlein science-fiction novel ‘Stranger in a Strange land’. Essentially, a grok with logstash means you specify a pattern, for which the actions are applied if the specified pattern matches. The first grok looks for the date pattern for which extra fields are created (M,d,h,m,s, after the colon) in the field ‘message’, and adds a tag (a word in the tags field for the record that is created). The second grok also looks in the ‘message’ field, and specifies text (ACTION for example), some other characters and then (?.*) is visible. This is a custom pattern, for which the field name to be created is in between < and > and is followed by a pattern. This grok line (including all the patterns) creates fields for all the Oracle audit fields in the audit file! The next grok picks up the PID from the filename of the logfile (the filename is in a field ‘source’), and the two mutates create and destroy a new field ts which is used for the date, and date specifies the date/time with the data flowing through logstash is filled with the date and time in the ts field, instead of the time filebeat picked up the data and sent it through logstash. Please mind that if you add (or change) configuration in a logstash configuration file, you need to restart logstash.

We are all set now! Last words on this configuration: kibana and elasticsearch by default do not require authentication. Do not expose the ports of these products to the internet! I am using a tunnel to the kibana website, which runs on port 5601. It’s very easy to ssh into the machine running the ELK stack using ssh user@machine -L 5601:localhost:5601, which creates a port on localhost:5601 on my machine at home (-L = local), for which communication is tunnelled to localhost:5601 on the remote machine (the localhost in the ssh line example is an address on the machine you ssh in to, this could also be another server which is only visible from the machine you ssh into.

First let’s login to the machine, and see what information is revealed with /var/log/secure:
kibana-secure-login
You get this screen when you goto kibana at port 5601, enter: ‘type: secure’ in the search bar to display data of the type secure (which is what is set with document_type: secure in filebeat.yml), and login to the machine where filebeat is watching the /var/log/secure file. As you can see, you get two lines from the ssh deamon, one indicating something’s done with pam (pam_unix), and one line which indicates it authenticated via a public key for user ops from an ip address (which is anonymised) at port 39282 via ssh2.

With a lot of cloud providers you get a user which has public key authentication setup (which you saw above), after which you need to sudo to for example the oracle user. In a lot of companies, you get a personalised account to log on to servers, after which you need to sudo to oracle. In both cases you need to use sudo to become the user that you need to administer, for example oracle. This is what sudo generates in the /var/log/secure file:
kibana-secure-sudo
The secure log displays sudo was invoked by the user opc, on TTY pts/1 and the command executed via sudo was ‘/bin/su – oracle’.

Now that I have became oracle using sudo, I set the environment of my database using oraenv and started up a database. Now go over to kibana, and issued a search for ‘type: oracle_audit’. This is how that looks like:
kibana-oracle_audit
Now if you look at what the audit record provides, the only things that provide something useful for the purpose of investigating who did stop or start a database are ACTION and CLIENT TERMINAL (I assume the database is stopped and started by the ‘oracle’ user). Now change the ‘selected fields’ in kibana and add the (dynamically created!) fields: ora_audit_action, ora_audit_term and ora_audit_derived_pid, and remove message. This is how that looks like:
kibana-oracle-audit-startup
The important thing to look for here is the ora_audit_action ‘startup’, then look at the ora_audit_derived_pid, and two rows down we see terminal ‘pts/1’ was the terminal on which this was entered.

Now that we know the terminal, we can add in searching in the message field for the secure type. Enter ‘type: oracle_audit OR (type: secure AND message: “*pts/1*”)’ in the search bar.
kibana-secure-oracle_audit
Okay, this works. But it’s far from perfect. In fact, it only works if the username of the session doing the sudo is the only session with that username, otherwise if there is more than one session it can be any of these sessions doing the sudo, since there is nothing more than the username. This also means that if there is a direct logon to the oracle user, there is no way to identify a session with a TTY, and thus database startup and shutdown are completely anonymous, there’s no way to link a specific session to that action outside of probably the oracle user and a TTY which can not be linked to anything like for example an ip address.

Is there a better way? Yes! We can also use the linux, kernel based, auditing service, which is on by default. This service keeps a log file at /var/log/secure/secure.log, and gives way more granular auditing events than the /var/log/secure log. Linux audit generates a lot of diverse types of rows, so it’s actually not easy to grok them, but in order to understand which session executed a startup or shutdown, the only audit row that is important for this specific use case is an audit type called ‘CRED_ACQ’. The grok for this type looks like this:

# vi /etc/logstash/conf.d/linux-audit.conf
filter {
  if [type] == "audit" {
    grok {
        match => { "message" => ""type=%{WORD:audit_type} msg=audit\(%{NUMBER:audit_epoch}:%{NUMBER:audit_counter}\): pid=%{NUMBER:audit_pid} uid=%{NUMBER:audit_uid} auid=%{NUMBER:audit_auid} ses=%{NUMBER:audit_ses} msg='op=%{NOTSPACE:audit_op} ((acct=\"%{GREEDYDATA:audit_acct}\")|(id=%{NUMBER:audit_id})|acct=%{BASE16NUM:audit_acct}) exe=\"%{GREEDYDATA:audit_exe}\" hostname=%{NOTSPACE:audit_hostname} addr=%{NOTSPACE:audit_addr} terminal=%{NOTSPACE:audit_terminal} res=%{NOTSPACE:audit_res}'" }
        add_tag => [ "grok", "audit" ]
    }
    date {
      locale => en
      match => [ "audit_epoch", "UNIX" ]
    }
  }
}

This grok matches the CREDIT_ACQ audit type which we will use to trace back the session via the audit log. Another nicety of this logstash configuration is the audit records time using an epoch timestamp, which logstash can translate back to a human readable timestamp. Once this is in place, log in again and use sudo to switch to oracle (or log in directly as oracle, it doesn’t really matter that much now!), and search in kibana for: ‘type: oracle_audit OR (type: audit AND audit_type: CRED_ACQ)’. Now get the relevant fields; remove ‘message’, and add: audit_hostname, audit_acct, audit_ses, audit_terminal, ora_audit_term, ora_audit_derived_pid, ora_audit_action. This probably returns a log of rows, now scroll (“back in time”) and search for the startup or shutdown command, and then follow the trail:
kibana-oracle_audit-audit-raw
Startup points to (oracle server process) PID 17748, which was instantiated by a session using by pts/1 (two rows down), one row further down we see the audit information which shows pts/1, which is connected to audit_ses 4230. The audit_ses number is a number that sticks with a session, regardless of using sudo. If you follow down number 4230, you see multiple rows of audit_ses 4230, some of them with root, which is typical for sudo switching from one user to another. The final row shows the user logging in with it’s ip address. In other words: using the linux kernel audit facility, you can get all available information!

Okay, all happy now? Are you sure? Now for the twist!

Whenever you use RAC, or use ASM, or use both, or you are using the grid infra single instance as a framework to track your your listener(s) and database(s) and start and stop them automatically, you can still stop and start an instance directly using sqlplus, but in most cases you will be using the grid infrastructure crsctl or srvctl commands. When the grid infrastructure crsctl and srvctl commands are used, this is how the Oracle database audit information looks like:
kibana-oracle_audit-crs-shutdown
As you can see, because the cluster ware brought the database down, there is no terminal associated with the shutdown. So the above mentioned way of first searching for startup and shutdown in the oracle audit information, finding the associated terminal, and then tracing it through the audit records can NOT be used whenever the Oracle cluster ware is used, because a grid infrastructure deamon is actually stopping and starting the database, and the grid infrastructure does not keep any information (that I am aware of) about which client invoked a grid infrastructure command. I guess a lot of auditors will be very unhappy about this.

Now the good news: you can solve this issue very easy. The downside is it requires additional configuration of the linux auditing. The solution is to put an ‘execution watch’ on srvctl and crsctl; this is how this is done:

# auditctl -w /u01/app/12.1.0.2/grid/bin/srvctl -p x -k oracrs
# auditctl -w /u01/app/12.1.0.2/grid/bin/crsctl -p x -k oracrs

In order to validate the working, I started a database using srvctl, and searched for: ‘(type: oracle_audit AND ora_audit_action: STARTUP) OR (type: audit AND message: key=\”oracrs\”)’. This is how that looks like:
kibana-oracle_audit-audit-watch
As you can see, there’s the Oracle database record indicating the startup of the database, and a little while back in time there’s the linux audit row indicating the execution of the srvctl executable. Once you are at that point, you can using the earlier mentioned way of using the audit_ses number to trace the session execution, including sudo and ip address at logon time.

There are many posts about the amount of memory that is taken by the Oracle database executables and the database SGA and PGA. The reason for adding yet another one on this topic is a question I recently gotten, and the complexities which surrounds memory usage on modern systems. The intention for this blogpost is to show a tiny bit about page sharing of linux for private pages, then move on to shared pages, and discuss how page allocation looks like with Oracle ASMM (sga_target or manual memory).

The version of linux in this blogpost is Oracle Linux 7.2, using kernel: 4.1.12-37.6.3.el7uek.x86_64 (UEK4)
The version of the Oracle database software is 12.1.0.2.160719 (july 2016).

Memory usage of virtual memory systems is complicated. For that reason I see a lot of people getting very confused about this topic. Let me state a very simple rule: the memory actively being used on a system should fit in physical memory. Swap (a file or partition), increases total virtual memory, but really only is a safety net for saving your system from an out of memory situation at the cost of moving pages from and to disk. Because modern linux kernels have swappiness (willingness to swap) to a non-zero value, it’s not uncommon to have some swap being used, despite physical memory not being oversubscribed. A system stops performing as soon as paging in and out starts to occur, and for that reason should not happen.

1. Private pages for linux executables
When an executable is executed on linux from the shell, the shell executes a fork() call to create a new process, which is implemented as a clone() system call on linux. Using the clone() system call, the virtual memory space of the newly created process is shared (readonly) with it’s parent. This includes the private allocations! Once the child process needs to write in it’s memory space, it will page fault and create it’s own version, abandoning the version of its parent.

Can we actually prove this is happening? Yes, the /proc/ filesystem gives an insight to a process’ virtual memory space.
Let’s start off with a very simple example: we execute ‘cat /proc/self/maps’ to see our own address space:

[oracle@oracle-linux ~]$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
0060c000-0060d000 rw-p 0000c000 fb:00 201666243                          /usr/bin/cat
00e41000-00e62000 rw-p 00000000 00:00 0                                  [heap]
7f69729be000-7f6978ee5000 r--p 00000000 fb:00 576065                     /usr/lib/locale/locale-archive
7f6978ee5000-7f6979099000 r-xp 00000000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979099000-7f6979298000 ---p 001b4000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979298000-7f697929c000 r--p 001b3000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929c000-7f697929e000 rw-p 001b7000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929e000-7f69792a3000 rw-p 00000000 00:00 0
7f69792a3000-7f69792c4000 r-xp 00000000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794b9000-7f69794bc000 rw-p 00000000 00:00 0
7f69794c3000-7f69794c4000 rw-p 00000000 00:00 0
7f69794c4000-7f69794c5000 r--p 00021000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c5000-7f69794c6000 rw-p 00022000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0
7ffdab1c7000-7ffdab1e8000 rw-p 00000000 00:00 0                          [stack]
7ffdab1ea000-7ffdab1ec000 r--p 00000000 00:00 0                          [vvar]
7ffdab1ec000-7ffdab1ee000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here’s a lot to see, but we see the cat executable at 0x00400000. The reason for three memory allocations are (linux/ELF) executables uses different sections with specific functions. A full overview of these can be obtained using the readelf executable. A simpler overview of an executable, which matches the above three memory allocations for the cat executable can be obtained using ‘size -B’ (the size executable, -B means ‘berkeley style’, which is default):

[oracle@oracle-linux ~]$ size -B /usr/bin/cat
   text	   data	    bss	    dec	    hex	filename
  43905	   1712	   2440	  48057	   bbb9	/usr/bin/cat

This describes the three memory sections an linux executable can have: text (the machine instructions, alias ‘the program’), data (all initialised variables declared in the program) and BSS (uninitialised data).
The first section always is the text allocation (not sure if it’s impossible to have the text section not being the first allocation, I have never seen it different). If you look at the memory flags, ‘r-xp’, this totally makes sense: ‘r-‘ meaning: read(only, followed by a’-‘ instead of a ‘w’), ‘x’: executable and ‘p’: this is a private allocation. The next allocation is the data section. We don’t execute variables, we read them, which is reflected in the flags: ‘r–p’. But what if we change the value of a variable? That is where the third section is for: changed values of initialised variables. This can be seen from the flag of this section: ‘rw-p’, read, write and private. The fourth allocation lists [heap], this is a mandatory allocation in every process’ memory space, which holds (small) memory allocations, this is NOT the BSS section. In this case, the BSS section does not seem to be allocated.

By having memory allocations for /usr/lib64/ld-2.17.so we can see this is a dynamically linked executable. You can also see this by executing ‘file’ on the executable:

[oracle@oracle-linux ~]$ file /usr/bin/cat
/usr/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=3207edc47638918ceaeede21947a20a4a496cf63, stripped

If a linux executable is dynamically linked, you can see the libraries that are loaded by the dynamic linker/loader using the ldd utility:

[oracle@oracle-linux ~]$ ldd /usr/bin/cat
       	linux-vdso.so.1 =>  (0x00007ffceb3e4000)
       	libc.so.6 => /lib64/libc.so.6 (0x00007fd46fb7e000)
       	/lib64/ld-linux-x86-64.so.2 (0x000055d5253c9000)

This output shows the dynamic loader (/lib64/ld-linux-x86-64.so.2), and two libraries the dynamic loader loads: libc.so.6 and linux-vdso.so.1. The first one, libc, is the standard C library. The second one, linux-vdso is for virtual dynamic shared object, which is an optimisation for certain system calls to be executed in user space (notably gettimeofday()).
The other allocations that exist in our example are anonymous mappings (usually done by programs using the mmap() call):

7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0

And some allocations for system purposes, like stack, var, vdso and vsyscall.

Now that you have become familiar with some basic linux memory address space specifics, let’s take it a little further. It’s possible to see more about the memory segments using the proc filesystem smaps file:

[oracle@oracle-linux ~]$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
Size:                 44 kB
Rss:                  44 kB
Pss:                  44 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           44 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
Size:                  4 kB
Rss:                   4 kB
...etc...

Per allocation there are a lot of properties to be seen. ‘Size’ is the full size, ‘Rss’ is the resident set size, alias the amount of data of this segment that is truly resident for this process in it’s address space. ‘Pss’ is fairly unknown, and is the proportional size of this segment. The way it is proportional is that if pages in this allocation are shared with other processes, the size of these pages are divided by the number processes it is shared with. In this case, we have loaded the text segment of the cat executable into the process’ address space, which all is resident (size and rss are the same) and it’s not shared with any process (rss equals pss). There are many more properties, but these are out of scope for this blogpost.

Now let’s move on to Oracle. If you look at the maps output of the pmon process for example, you’ll see:

[oracle@oracle-linux 14153]$ cat maps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8f000-10de8000 rw-p 1058f000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10de8000-10e19000 rw-p 00000000 00:00 0
1190f000-11930000 rw-p 00000000 00:00 0                                  [heap]
...

Here we see the Oracle executable, with a text segment, a readonly data segment and a read/write data segment, and we see an anonymous mapping directly following the data segments. That’s the BSS segment!
However, what is more interesting to see, is the properties of the distinct memory allocations in smaps:

[oracle@oracle-linux 14153]$ cat smaps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:             267704 kB
Rss:               40584 kB
Pss:                 819 kB
Shared_Clean:      40584 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:        40584 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:                136 kB
Rss:                 124 kB
...

If we look at the text segment for the oracle binary, we see the total text size is 267704 kB (size), but resident (truly available for this process in its address space) is only 40584 kB (rss), and because the oracle executable’s text segment is shared with a lot of processes, the proportional size is only 819 kB (pss).

If you want to understand how much memory is taken in the system, the size is telling the total size of the segment, but it doesn’t say anything on true memory usage. The rss size tells the amount of pages for the segment that is paged in to the address space of every process, and can (and is, for oracle) different for every process. The pss size is the proportional size for every process. Probably the only way to tell the true amount of memory taken by executables and libraries is to add up all the pss sizes. Any other value only tells something about the process’ point of view on memory usage, but not overall, true consumed space because that would lead to counting too much.

This is different for anonymous allocations. Since anonymous allocations are created when a process is run, I’ve only seen them initialised purely private. For that reason rss and pss sizes are equal, because every process initialises it strictly for itself. This too works in a lazy allocation way. When memory is allocated, the size is defined, but is only really allocated once it’s truly used, which is expressed by a difference between size and rss.

2. shared pages
The Oracle databases relies on shared caches and data structures, which are put into what is called the SGA, the system global area. The main components of the SGA are the shared pool (shared structures), log buffer (change vectors to be written to disk to persist changes) and the buffer cache, amongst others. With any memory management option (manual management, ASMM (automatic shared memory management, sga_target) and AMM (automatic memory management, memory_target)) there is a SGA. Depending on the memory option, these are visible in a different way.

When manual memory or ASMM is used, shared memory is allocated as system V shared memory. The ‘classic’ way of looking at system V shared memory is using ipcs -m (m is for shared memory, you can also use s for semaphores and q for message queues):

[oracle@oracle-linux ~]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 655360     oracle     600        2932736    124
0x00000000 688129     oracle     600        905969664  62
0x00000000 720898     oracle     600        139673600  62
0x5f921964 753667     oracle     600        20480      62

Please mind that if you have more than one instance active, or an ASM instance active, you will see more shared memory allocations.
Apparently, the oracle database allocates a couple of shared memory segments. If you want to understand what these memory allocations are for, you can use the oradebug ipc command to see what their functions are:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
IPC information written to the trace file

This generates a trace file in the ‘trace’ directory in the diagnostics destination. Here is how this looks like (partial output with content of interest to this blogpost):

 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002cbe70 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2   655360 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cc000 00000000002cc000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000036000000 Minimum Subarea size 00400000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0   688129 0x00000060400000 0x00000060400000 0x00000060400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000036000000 0000000036000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1   720898 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008534000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000005000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3   753667 0x0000009ec00000 0x0000009ec00000 0x0000009ec00000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000005000 0000000000005000 default       readwrite

The first allocation is ‘fixed size’, alias the fixed SGA, the second allocation is the ‘variable size’, which contains the shared pool and the buffercache, the third allocation is the ‘redo buffers’ and the fourth is the ‘skgm overhead’ alias the index into the shared memory structures for this instance.

Because any memory allocation is visible in maps and smaps, this method can be used for shared memory too, to see how the shared memory segments are mapped into the process address space. All oracle database server processes have the shared memory segments for the instance mapped into their address space. The usage is different per process, so the amount of shared memory paged into the address space will be different:

...
12bcd000-12bee000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:05 688129                             /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:05 720898                             /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:05 753667                             /SYSV5f921964 (deleted)
7f473004e000-7f47301d4000 r-xp 00000000 fb:02 212635773                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
...

Shared memory is easily identified by the ‘s’, at which “normal” private memory mappings have ‘p’. If you want to know more about the process’ perspective of the shared memory, we can use smaps, just like with private memory mappings (virtual memory space of pmon):

60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd sh mr mw me ms sd
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
Size:               2860 kB
Rss:                 392 kB
Pss:                  36 kB
Shared_Clean:          0 kB
Shared_Dirty:        372 kB
Private_Clean:         0 kB
Private_Dirty:        20 kB
Referenced:          392 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms sd

These two shared memory segments are belonging to the fixed sga. The reason for two segments is the first page (0x1000 equals 4096, alias a single linux page) is readonly (r–s). The other fixed SGA segment is read write (rw-s). Here we see that from the process’ perspective it really doesn’t matter much if a piece of mapped memory is shared or private; it’s exactly handled the same way, which means the full segment is mapped into the process’ virtual memory space, but only once pages are touched (alias truly used), the process registers the address in its pagetable, and the pages become resident (as can be seen in the difference between the total size and the rss). The sole purpose of shared memory is it is shared between process. That the pages are shared is very well visible with the difference between rss and pss size. Its also easy to spot this shared memory segment is created from small pages; MMUPageSize and KernelPageSize is 4kB.

However, this yields an interesting question: shared memory does not belong to any single process. Does that mean that if a shared memory segment is created, it is truly allocated, or can shared memory be lazy allocated as well? Please mind that above statistics are the process’ perspective, not the kernel’s perspective.

One way to see the state of shared memory system wide, is using the ‘-u’ flag with the ipcs command:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  255684
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This is a really useful view! What we can see from the output from this command, is that nearly all pages allocated as shared memory are resident. By having statistics for shared memory pages allocated and resident we can conclude that shared memory too could be allocated in a lazy, alias on demand. Also, there is a difference between resident and allocated, which indicates lazy allocation too.

Inside the database I am aware of two parameters that could influence shared pages usage; pre_page_sga and _touch_sga_pages_during_allocation, see my article on these. However, what is interesting, is that these parameters are different for the instance I am testing with for this blogpost, which is running on a VM:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: page
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%page%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME   						   VALUE       								  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
olap_page_pool_size    				   0   									  TRUE 	   FALSE      FALSE
pre_page_sga   					   TRUE        								  TRUE 	   FALSE      FALSE
use_large_pages        				   TRUE        								  TRUE 	   FALSE      FALSE
_max_largepage_alloc_time_secs 			   10  									  TRUE 	   FALSE      FALSE
_olap_page_pool_expand_rate    			   20  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hi     				   50  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hit_target     			   100 									  TRUE 	   FALSE      FALSE
_olap_page_pool_low    				   262144      								  TRUE 	   FALSE      FALSE
_olap_page_pool_pressure       			   90  									  TRUE 	   FALSE      FALSE
_olap_page_pool_shrink_rate    			   50  									  TRUE 	   FALSE      FALSE
_realfree_heap_pagesize        			   65536       								  TRUE 	   FALSE      FALSE
_realfree_pq_heap_pagesize     			   65536       								  TRUE 	   FALSE      FALSE
_session_page_extent   				   2048        								  TRUE 	   FALSE      FALSE
_touch_sga_pages_during_allocation     		   FALSE       								  TRUE 	   FALSE      FALSE

14 rows selected.

In the database I created on my VM, pre_page_sga equals to TRUE and _touch_sga_pages_during_allocation to FALSE, which is the exact inverse of the settings of a database (PSU 160419) on a huge machine. Perhaps these parameters are dynamically set based on size of the SGA and logic (if _touch_sga_pages_during_allocation is TRUE, it makes sense to set pre_page_sga to FALSE, as it’s function has been performed by the bequeathing session.

However, having pre_page_sga set to TRUE it makes sense almost all SGA (shared) pages are allocated, because pre_page_sga (at least in Oracle 12, not sure about earlier versions, because the Oracle description of this parameter is different from what happens in Oracle 12) spawns a background process (sa00) that scans SGA pages, which means it pages them, resulting in the actual allocation. Let’s test this by setting pre_page_sga to false, it should lead to way lesser shared memory pages allocated, which will eventually be allocated as database processes are paging them in:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force;

And then look at ipcs -mu again:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  92696
pages swapped   0
Swap performance: 0 attempts   	 0 successes

As expected, only the bare necessary pages are resident after startup force, all the other shared pages will be slowly paged in as foreground and background processes touching SGA pages during execution.

How would that work when we set sga_max_size to a different value than sga_target? If the pages beyond the sga_target are never allocated, you could control the amount of SGA pages used by setting sga_target, but ‘reserve’ extra memory to use by setting sga_max_size higher, which is never allocated, so it is not wasted. Let’s setup the instance:

SQL> alter system set pre_page_sga=true scope=spfile;
SQL> show spparameter sga_target

SID    	 NAME  			       TYPE    	   VALUE
-------- ----------------------------- ----------- ----------------------------
*      	 sga_target    		       big integer 1000M
SQL> ! ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  102512
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This sets the pre_page_sga parameter from the spfile, which means the instance will spawn a process to touch SGA pages on next startup.
Currently, the sga_target for sizing the SGA is set to 1000M in the spfile.
ipcs tells us we got 256005 pages are allocated, which makes sense: 256005*4=1024020k, which is slightly more than the set 1000M, which means essentially sga_target equals pages allocated.

SQL> alter system set sga_max_size=2g scope=spfile;
SQL> startup force;
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

This sets sga_max_size to double the amount of sga_target, and ‘startup force’ bounces the instance.

SQL> show parameter sga_target

NAME   				     TYPE      	 VALUE
------------------------------------ ----------- ------------------------------
sga_target     			     big integer 1008M

Here we see the actual parameter in the database is set to 1008M. Now let’s look at the ipcs -mu values again:

> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  521923
pages swapped   0
Swap performance: 0 attempts   	 0 successes

521923*4=2087692. So (almost) all the memory set for sga_max_size is allocated. In fact, if you look at the values at instance startup values reported above, you see ‘Total System Global Area’ showing the 2G, it’s all SGA, so it’s all touched because of pre_page_sga being set to TRUE. So the next test would be to have pre_page_sga being set to FALSE:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

All memory is still declared SGA, as we can see. However, by having _touch_sga_pages_during_allocation set to FALSE and pre_page_sga set to FALSE, we should see only the actual used SGA pages being allocated:

SQL> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  91692
pages swapped   0
Swap performance: 0 attempts   	 0 successes

The above output shows the shared memory status directly after I restart my instance, so this is not only less than sga_max_size, it is even less than sga_target (91692*4=336768, ~ 336M). This will grow up to sga_target, because these pages will get paged in by the database processes.

How does this look like when we add in huge pages? In Oracle 12.1.0.2.160719 in my instance the parameter to tell oracle to allocate huge pages if there are any (‘use_large_pages’) is set to TRUE. This will make Oracle use large pages if any are available. This is true, even if there are not enough huge pages to satisfy the entire SGA; Oracle will just allocate all that can be allocated, and create a new shared memory segment using small pages for the remainder of the needed shared memory.

Sadly, it seems per memory segment statistics like rss, pss, shared and private clean and dirty, etc. are not implemented for huge pages:

[oracle@oracle-linux [testdb] ~]$ cat /proc/$(pgrep pmon)/smaps
...
61000000-d8000000 rw-s 00000000 00:0e 688129                             /SYSV00000000 (deleted)
Size:            1949696 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms de ht sd
...

This is the main shared memory segment, allocated from huge pages (as can be seen with KernelPageSize and MMUPageSize), which means it’s the segment holding the shared pool and buffercache. This can also be seen by the size: 1949696 kB, which is nearly the 2G of sga_max_size.

However, we can just use the global information on system V shared memory (ipcs -mu) and we can use the huge page information in /proc/meminfo:

[oracle@oracle-linux [testdb] ~]$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:    1100
HugePages_Free:      880
HugePages_Rsvd:      805
HugePages_Surp:        0
Hugepagesize:       2048 kB

The statistics of interest are:
hugepages_total: the total number of huge pages allocated. warning: huge pages memory allocated by the kernel is NOT available for allocation of regular sized pages (which means you can starve your processes and the kernel for normal pages by setting the number of huge pages too high).
hugepages_free: the number of huge pages which are not used currently. warning: this includes allocated but not yet initialised pages, which hugepages_rsvd shows.
hugepages_rsvd: the number of huge pages allocated but not yet initialised.
hugepages_surp: the number of huge pages allocated (truly allocated and not yet initialised) greater than the total number of huge pages set. this value can be greater than zero if the kernel setting vm.nr_overcommit_hugepages is greater than zero. The value of this setting is zero by default, and at least for usage with the Oracle database, this value should remain zero.

The same information can be obtained using ipcs -mu, but with a twist:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524803
pages resident  122881
pages swapped   0
Swap performance: 0 attempts   	 0 successes

Some of you might get the twist on this by looking at the number.
It turns out ipcs has no facility for huge pages, it just reports the number of pages as if these were 4 kB.
524803*4 (kB) / 1024 (to make it MB) = 2050.

Now going back to the goal of looking into this: I told shared memory is allocated and paged at startup time when _touch_sga_pages_during_allocation is set to TRUE (set to false as default value in my current database), and it could be explicitly paged by the background process sa00 after startup of the instance when pre_page_sga is set to TRUE. When both are set to false, shared memory allocated from default sized 4kB pages is allocated only when it’s used. In the above examples with huge pages, the tests were done with pre_page_sga set to false. This shows exactly the same ‘lazy allocation’ behaviour as 4kB pages.

When ‘extra’ memory is reserved from the operating system by setting sga_max_size to a higher value than sga_target, this will all be allocated and paged if either _touch_sga_pages_during_allocation or pre_page_sga is set to TRUE, which doesn’t make sense; if the memory is taken, you might as well use it. However, this is different if both _touch_sga_pages_during_allocation and pre_page_sga are set to false. All memory beyond sga_target up to sga_max_size is allocated, but never touched, and thus never paged in, so never truly allocated. Please mind linux itself understands this perfectly (aiming at huge pages and ‘reserved’ pages), however the system V ipc kernel settings do not; you need to set the shared memory values high enough to facilitate the total sum of sga_max_size values, not the truly used sizes as indicated by the sum of sga_target values.

The inspiration for this investigation came from a question on my blog. However, the question was about memory_target and memory_max_target and AIX. I do not have an AIX system at hand. I did not investigate the implementation of memory_target and memory_max_target on AIX. So I can’t comment on that. What I can say, is that on Linux, you really, really should use automatic shared memory management (ASMM) alias setting sga_target or setting it manually (and set huge pages!). If you are used to these memory management settings on databases not on AIX, it probably makes sense to use that on AIX too, even if the automatic memory management (AMM) alias setting memory_target is implemented brilliantly on AIX, for the sake of predictability and standardisation.

This blogpost is about using the linux ftrace kernel facility. If you are familiar with ftrace and specifically the function_graph tracer, you might already be aware of this functionality. This is Linux specific, and this facility is at least available in kernel 2.6.39 (Oracle’s UEK2 kernel).

What is a ‘kernel dive’? Whenever a process is running, it should mostly be in ‘user mode’, executing the program it is supposed to run. However, during running there could be situations (a lot of situations, depending on what the program is doing!) that the program needs something “from” the system. Such a thing could be allocating memory, or using a device that is shared like a block device, or a network device. These things are controlled by the kernel, and require a process to issue a system call. A user process executes a system call to request actions to such things. Starting from the system call, the execution ‘dives’ in the kernel, and executes in kernel or system mode. However, kernel dives are not limited to system calls; for example a page fault (paging in backing memory for allocated virtual memory) switches to system mode, as well as handling an interrupt.

The Linux kernel has a facility that is called ‘ftrace’. The name ftrace originally was named because of function tracing, but it has evolved into a tracing framework. It’s important to point out that ftrace currently ONLY works in KERNELSPACE. This means you miss the userspace code.

Ftrace uses both explicit tracepoints (defined in the linux kernel source), as well as dynamic tracepoints, for which the gcc -pg (profiling data) flag is used to capture function entry. For function exit a ‘trampoline’ is used. A trampoline here is an extra function executed (mcount) at function entry that stores the return address, and replaces the return address with that of the trampoline, so an exit can be detected.

the linux debugfs filesystem must be mounted for ftrace to work. You can check if the debugfs filesystem is mounted using:

[root@bigmachine ~]# mount -t debugfs

It does not return any rows if debugfs is not mounted. You can mount debugfs the following way:

mount -t debugfs none /sys/kernel/debug

Let’s do some basic steps first, just tracing an Oracle session!
First go to the tracing directory:

[root@bigmachine ~]# cd /sys/kernel/debug/tracing/
[root@bigmachine tracing]#

Obtain the PID of an Oracle foreground process, and enable tracing for this PID:

[root@bigmachine tracing]# echo 6431 > set_ftrace_pid

We also need to choose what we want to trace. Something very beneficial for understanding what is going on in the kernel is the tracer ‘function_graph’. Here’s how you enable that trace:

[root@bigmachine tracing]# echo function_graph > current_tracer

The trace output is in ‘trace’. Assuming that the PID to trace is idle, this is how the trace output looks like:

[root@bigmachine tracing]# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |

Now execute something silly (something which does little!).

TS@fv12102 > select * from dual;

D
-
X

Now let’s first show how much information is gathered:

[root@bigmachine tracing]# cat trace | wc -l
1926

That’s correct: approximately 1900 rows of trace data are created during only ‘select * from dual’. Now think about this: most of the things ‘select * from dual’ does are done in userspace. What does ‘select * from dual’ actually do in kernel space? Here’s a little grep to see what the Oracle process did:

[root@bigmachine tracing]# grep \|\ \ [a-zA-Z_0-9]*\(\)\ \{ trace
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
...much more...

Probably you are aware Oracle executes a lot of times() and getrusage() calls. As you can see, the system calls are also audited by the Linux system. Now just open the trace file with your favourite file viewer (I use ‘less’, you can use the vi commands to search for pieces of text), and peek in the file:

   0)               |  sys_getrusage() {
   0)               |    getrusage() {
   0)               |      k_getrusage() {
   0)               |        task_cputime_adjusted() {
   0)               |          cputime_adjust() {
   0)   0.045 us    |            nsecs_to_jiffies();
   0)   0.503 us    |          }
   0)   0.876 us    |        }
   0)   0.041 us    |        jiffies_to_timeval();
   0)   0.042 us    |        jiffies_to_timeval();
   0)               |        get_task_mm() {
   0)   0.047 us    |          _raw_spin_lock();
   0)   0.380 us    |        }
   0)               |        mmput() {
   0)   0.041 us    |          _cond_resched();
   0)   0.351 us    |        }
   0)   3.836 us    |      }
   0)   0.043 us    |      _cond_resched();
   0)   4.596 us    |    }
   0)   5.004 us    |  }

I skipped the beginning of the trace, which is actually the ending of the kernel code of the waiting on a next command of the Oracle process. If you scroll down to the end, you will see how the waiting on a next command (instrumented by the wait event ‘SQL*Net message from client’) is actually implemented on the kernel side using a read function on a pipe. You find the above shown systemcall, getrusage, just after the ending of the read function in the beginning of the trace (or search for ‘sys_getrusage’).

What is interesting is that this trace is showing the different functions in the kernel and which function is calling what function, made visible by accolades and indention, made to look like a c program. In other words: this allows you to see in what specific function of the kernel the time is spend, and how the total time of a kernel function is build up!

Now that basic usage is known, let’s step up to something interesting; the getrusage timing in interesting, but just an example. One of such really interesting things is IO.

First clear the trace file:

[root@bigmachine tracing]# echo > trace

Verify that the current tracer still is function_graph:

[root@bigmachine tracing]# cat current_tracer
function_graph

Verify the process id to trace (it should list the process id of the oracle foreground process you want to trace):

[root@bigmachine tracing]# cat set_ftrace_pid
6319

Now an additional next step to only trace the pread systemcall using the ‘set_graph_function’ facility:

[root@bigmachine tracing]# echo sys_pread64 > set_graph_function

Now make the Oracle session that is traced do a pread call. I used ‘select * from t1 where rownum=1’. Because I flushed the buffer cache prior to doing this (to make sure physical IO is needed), I did get 2 physical IOs, one for the segment header and one for the data block.

To make sure nothing else will get into the trace buffer (the file really is a buffer in memory), do the following to stop further tracing:

[root@bigmachine tracing]# echo 0 > tracing_on

This is how my output looks like:
https://gist.githubusercontent.com/FritsHoogland/758d106f8576ee61df76a0b6ca8e30b1/raw/f9ca7c01a46c760e2142baf48512fc731956b1fa/gistfile1.txt
I added line numbering to it (you can do that yourself too with the nl linux utility), so there are a few things I can point you to.

The first pread call ends at line# 568. That is a lot of information. It also shows how much stuff is done during an IO.
– The filesystem type matters! At line 7 you see vfs_read (the linux filesystem abstraction layer), but at line 19 you see filesystem specific code!
– At line 27 you see a XFS specific function indicating DIO (direct IO) is used.
– At line 33 you see an interaction with the block layer using the blk_start_plug() function. Linux uses a method to group IOs that is logically equal to filling up and later draining a bathtub using a plug.
– At line 328 you see the IO request has been built by the filesystem code, and the the request queue is unplugged: blk_flush_plug_list. You see some IO scheduler functions (starting with ‘elv’) and you see the deadline scheduler is used (deadline_add_request). Not very far after that, we enter the scsi layer (indicated by functions starting with ‘scsi_’).
– At line 431 you see the request being submitted to the device using the mpt_put_msg_frame function. We are now in the low level driver layer. We see there is time involved (82ms, you can see this is a virtual machine). You see all kinds of loops ending here. At this point the IO request has been sent to the IO device. However, there is more interesting stuff coming!
– At line 429 we see the kernel is preparing for waiting on the IO request to return. This is done in using the dio_await_completion function.
– At line 436 we are totally done submitting the IO request, and entering the Linux process scheduler. In the scheduler code we see housekeeping (update_blocked_averages, line 456), and rebalancing the process, which means trying to find the best cpu thread to execute on (load_balance, line 462), before the process finally goes to sleep.
– At line 487 the process is woken by an interrupt (not visible in the trace) from it’s uninterruptible sleep (state ‘D’). The IO then really finishes up; at function dio_bio_complete, line 500 the data of the IO request is put in a BIO structure which can be passed on to userspace, some further housekeeping is done (dio_complete, line 539), the access time is updated (touch_atime, line 552) and some xfs housekeeping, unlocking the inode (xfs_iunlock, line 559).

Let me show you something which shows how valuable this tracing is for finding (linux operating system) issues. This is another trace on the very same system with slightly different kernel settings: https://gist.githubusercontent.com/FritsHoogland/38ab62532988d3672b9d4d0414339506/raw/554d58eb2ec58b85c56c7c858ec6def340b2db6e/gistfile1.txt.

Can you spot the difference? If not, take a look at this diff: https://gist.githubusercontent.com/FritsHoogland/1535eea3469096afd3a6ff83c7fce8c4/raw/3e9aa481163ff3e944e5c1fcd7c7a79b58b68948/gistfile1.txt, at line number 115. The function called the second time is gup_huge_pmd. That’s a function for handling user pages, and this function is using huge pages! The change I made was setting vm.nr_hugepages from zero to a number higher than the database instance needed. So not only the Oracle database can benefit from huge pages, but also the Linux kernel IO subsystem!

If this wetted your appetite, read on! Here’s a trace, once again of pread: https://gist.githubusercontent.com/FritsHoogland/ed45e5bbd2423a5d51e0a74d8315e06a/raw/9d74c0ed297cb07b5ee868e51a2776cc6712db9a/gistfile1.txt. Can you see what is the difference here? If not, I created a diff of this trace and the previous one: https://gist.githubusercontent.com/FritsHoogland/6f0c83ecf39405e9f734c02113e28093/raw/d70dadef86670ec4657938041886e63e10c67c05/gistfile1.txt. A nice example of the difference is at line 42. The first trace file contains all kinds of functions for doing extent management for the XFS filesystem, until line 89. This is all replaced with a single function blkdev_get_block.

The similarity is it’s all the system call pread. The difference between the second and the third is that with the third pread trace I used Oracle’s ASM facility, alias database IO to a block device directly without a filesystem. This trace shows pread is not executing all the filesystem functions, because there is no filesystem. However, please do realise it’s all about spend time, not how many rows there are in a trace.

Now before jumping to conclusions about the greatness of ASM because it can skip a lot of the code path, please do realise that ASM adds code path inside the Oracle database, because some kind of disk space management must be done. Also there is an entire ASM instance for managing the diskspace (which is NOT part of database sessions’ code path). These are facts, not opinion. There are use cases for both filesystem usage and for ASM, although I have a personal preference for ASM.

Conclusion.
Ftrace does not get the attention it deserves. It’s a great tool for investigating time spend in the kernel, and it’s available by mounting the debugfs filesystem. This blogpost describes the function_graph tracer, there are other tracers too, it’s absolutely not limited to the description in this blogpost.

To show how ftrace with the function_graph tracer works I took the pread system call as an example. It turns out the pread system call comes in many shapes. Unless you intimately know a system, there can be different layers in the kernel in play when executing pread. However, on a normal system the main time component of the pread system call should be off CPU in uninterruptible state. And that is exactly what the function_graph tracer can tell.

Again, as a reminder, ftrace only works for kernel level (“system”) execution of a process.

%d bloggers like this: