Archive

Tag Archives: linux

This post is about how to manage grub2 in an easy way.

grub1

In the past, which is before linux EL7, the boot loader was grub, the grand unified bootloader (version 1). Things were very simple; if you installed another kernel (using rpm) it would add an entry to grub’s configuration in /boot/grub/menu.lst. If you wanted to change grub to boot that newly installed kernel by default you edit /boot/grub/menu.lst and set ‘default’ to the number, counting from zero, of the newly installed kernel, in the order of the kernels listed. If you wanted a certain option set for booting the kernel, you added it to the kernel line.

grub2

Then came grub2. Reading up on the grub2 homepage it turns out that grub 1 apparently was old code, and it was clunky, hard to maintain and extend with new features. Other reasons that probably contributed to grub2 acceptance where UEFI support as well as limited support for other architectures than x86 and x86_64.

grub2: the mechanics

The heart of the grub2 configuration is by default in /boot/grub2/grub/grub.cfg. The start of the file looks like bash shell script. A little further in the file, the installed kernels and arguments are found:

### BEGIN /etc/grub.d/10_linux ###
menuentry 'Oracle Linux Server (4.1.12-124.18.6.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $me
nuentry_id_option 'gnulinux-4.1.12-124.18.6.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod xfs
        set root='hd0,msdos1'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  4fc4063c-861b-4b07-8c28-e847a53
5c6cb
        else
          search --no-floppy --fs-uuid --set=root 4fc4063c-861b-4b07-8c28-e847a535c6cb
        fi
        linux16 /vmlinuz-4.1.12-124.18.6.el7uek.x86_64 root=/dev/mapper/ol-root ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet nu
ma=off transparent_hugepage=never
        initrd16 /initramfs-4.1.12-124.18.6.el7uek.x86_64.img
}
menuentry 'Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.1.12-112.16.4.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {

A knee-jerk reaction would be to see if this file can be edited, but the heading of the file quite clearly says to not do that and that any modification could be overwritten:

#
# DO NOT EDIT THIS FILE
#
# It is automatically generated by grub2-mkconfig using templates
# from /etc/grub.d and settings from /etc/default/grub
#

So, this points us /etc/default/grub. To be sure what actually to do, the documentation/manual makes it clear how to control what kernel and options are chosen in the chapter 6.1 simple configuration handling:

grub-mkconfig does have some limitations. While adding extra custom menu entries to the end of the list can be done by editing /etc/grub.d/40_custom or creating /boot/grub/custom.cfg, changing the order of menu entries or changing their titles may require making complex changes to shell scripts stored in /etc/grub.d/. This may be improved in the future. In the meantime, those who feel that it would be easier to write grub.cfg directly are encouraged to do so (see Booting, and Shell-like scripting), and to disable any system provided by their distribution to automatically run grub-mkconfig.

The file /etc/default/grub controls the operation of grub-mkconfig. It is sourced by a shell script, and so must be valid POSIX shell input; normally, it will just be a sequence of ‘KEY=value’ lines, but if the value contains spaces or other special characters then it must be quoted.

What this means to say is that for basic operation and selection, the /etc/default/grub file must be edited, after which grub-mkconfig must be run to “build” the actual grub settings. That’s not extremely intuitive from my point of view.

Let’s look at the /etc/default/grub on my test system, which is not touched as far as I know:

[root@o184 ~]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never"
GRUB_DISABLE_RECOVERY="true"

The first thing to look at is ‘GRUB_DEFAULT’, which is the default grub entry/kernel that is booted. It is set to ‘saved’, which is not really helpful. The utility ‘grub2-editenv’ can be used to list what the saved entry is:

[root@o184 ~]# grub2-editenv list
saved_entry=0

Which means the first (this is counting from zero) entry in /boot/grub2/grub.cfg. I have not come across a utility that is installed by default to list the menu entries with their number in grub.cfg, it’s easy to do with some shell commands, although that feels clumsy to me:

[root@o184 ~]# cat /boot/grub2/grub.cfg | grep ^menuentry | nl -v0
     0	menuentry 'Oracle Linux Server (4.1.12-124.18.6.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.1.12-124.18.6.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
     1	menuentry 'Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.1.12-112.16.4.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
     2	menuentry 'Oracle Linux Server (3.10.0-862.11.6.el7.x86_64 with Linux) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-862.11.6.el7.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
     3	menuentry 'Oracle Linux Server (0-rescue-069a6c1ff25b409fa87b8e587a2f8b4d with Linux) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-0-rescue-069a6c1ff25b409fa87b8e587a2f8b4d-advanced-64c9f287-84a8-48ec-b048-364295362114' {

So in this case kernel 4.1.12-124.18.6.el7uek.x86_64 is started by default.

In fact, the value of ‘saved’ is stored in /boot/grub2/grubenv:

[root@o184 ~]# cat /boot/grub2/grubenv
# GRUB Environment Block
saved_entry=0
################...

There is more to tell about the GRUB_DEFAULT and especially GRUB_SAVEDEFAULT; if GRUB_SAVEDEFAULT is set and set to ‘true’, and GRUB_DEFAULT is set to ‘saved’, then any kernel that is chosen in grub at boot will be saved as default. However, as far as I can see, this option is not set in a default /etc/default/grub file.

Another caveat is that by setting the number of the menuentry, the order could be changed when installing a new kernel, and the numbering may be different. For that reason, it’s also possible to set either the entry description after menuentry, or a name (“identifier”) set with –id at the menuentry line.

grubby

This is all nice, but it feels really indirect to me, multiple files must be combined to understand the current settings, and multiple commands must be entered to make a change. This also means it’s a little less easy to automate.

Now on to the actual reason for this blogpost. There actually is a utility that can do most of the manipulation using a single command, which is installed by default. However, it is not very well known. Let me introduce ‘grubby’ to you!

List grub entries:

[root@o184 ~]# grubby --info=ALL
index=0
kernel=/boot/vmlinuz-4.1.12-124.18.6.el7uek.x86_64
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never "
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-124.18.6.el7uek.x86_64.img
title=Oracle Linux Server (4.1.12-124.18.6.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5
index=1
kernel=/boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never "
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-112.16.4.el7uek.x86_64.img
title=Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5
...

List current default kernel:

[root@o184 ~]# grubby --default-kernel
/boot/vmlinuz-4.1.12-124.18.6.el7uek.x86_64

Set another kernel to boot:

[root@o184 ~]# grubby --set-default /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
[root@o184 ~]# grubby --default-kernel
/boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64

List current settings of the grub2 menuentry:

[root@o184 ~]# grubby --info /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
index=1
kernel=/boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never"
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-112.16.4.el7uek.x86_64.img
title=Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5

Grubby also facilitates making changes to the arguments of the kernel.
For example, if you want to change the setting ‘numa=off’ to ‘numa=on’, you can remove the argument:

[root@o184 ~]# grubby --update-kernel /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 --remove-args="numa=off"
[root@o184 ~]# grubby --info /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 | grep ^args
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet transparent_hugepage=never"

And then add it with the correct argument:

[root@o184 ~]# grubby --update-kernel /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 --args="numa=on"
[root@o184 ~]# grubby --info /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 | grep ^args
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet transparent_hugepage=never numa=on"

Grubby makes it easy to list and manage the grub2 boot loader settings, and also easily be used in scripts.

Advertisements

This is a writeup on some scenario’s for disk issues that you could encounter when running linux systems. The linux systems I am talking about are centos/redhat/oracle (EL) version 7 systems. This is just a writeup for myself to know how to deal with different scenario’s, hopefully other people find this interesting too. I don’t believe there is any difference for the below scenario’s and resolutions between running physical/bare metal, virtualised or in the cloud, provided you can get access the you need (like the BIOS POST console). The test configuration uses (legacy) MBR (meaning non UEFI), grub2 as boot loader, LVM for meta devices (you don’t want to run without LVM!!) and XFS for the filesystems.

I included a way to introduce each corruption yourself, so you can actually rehearse these kind of corruptions. These self introduced corruptions are simply parts overwritten and can be recovered from. In reality this might be different: if a disk is truly physically broken, there is no way you can make it come to live again with software, and the data on it is simply gone. Also, if the physical disk is not harmed, but the filesystem is damaged, I think most of you understand that if the amount of damage is big enough, at a certain point it is beyond repair.

REALLY IMPORTANT NOTICE

Modern filesystems like XFS, EXT3/4, NTFS are ‘logging filesystems’. This means changes to metadata are written into a log on disk called the ‘intent log’, after which the true change is written and then the corresponding log in the intent log is flagged as done. This means that unclean mounts or crashes can be recovered from by replaying the intent log to make sure all transactions are fully written.

However…by default these filesystems do not log actual data changes. The protected writes are filesystem structure changes. This is especially important to realise if writes are not done via O_DIRECT, and thus end up in the page cache.

This means that data corruption can still occur, despite having a modern, logging filesystem. Things like databases therefore keep consistency mechanisms in play to validate data consistency.

I do not know enough about BTRFS and ZFS to know if these do things differently from the logging perspective. Not logging all IO for “normal” filesystems is a performance choice.

Let’s start: know your system!

It really helps if you understand how your system looks like. This is how my system looks like:

# parted /dev/sda print
Model: ATA VBOX HARDDISK (scsi)
Disk /dev/sda: 68.7GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  1075MB  1074MB  primary  xfs          boot
 2      1075MB  68.7GB  67.6GB  primary               lvm

The first partition of /dev/sda is set to be the boot partition (flags: boot), and is an XFS filesystem partition. The second partition is an LVM physical volume. ‘lsblk’ can look a bit more into the functions of the partitions, and reveals any other blockdevices.

# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda           8:0    0   64G  0 disk
├─sda1        8:1    0    1G  0 part /boot
└─sda2        8:2    0   63G  0 part
  ├─ol-root 249:0    0   41G  0 lvm  /
  ├─ol-swap 249:1    0    2G  0 lvm  [SWAP]
  └─ol-home 249:2    0   20G  0 lvm  /home
sr0          11:0    1  540M  0 rom

This shows that /dev/sda1 is mounted as /boot on the linux installation, and the sda2 LVM volume contains three logical volumes named: root, swap and home, which are mounted on /, as swap and on /home. There also is a ‘rom’ block device, which is the (oracle) linux boot image. This image can be downloaded from Oracle’s software delivery cloud, and is 566MB, and contains the rescue boot environment that is used in this blogpost. Any serious sysadmin should have such an image on hand to perform emergency recovery if needed.

Normally, the local disks are the first ones to boot from before other bootable devices such as a CD/DVD disk image in the BIOS. This means that I can leave the linux boot image connected, the system will normally boot from its hard disks and essentially not touch the DVD image. If for some reason the BIOS can’t find the MBR on the local disks, it will scan for MBR records on other disks in the order as set in the BIOS, and find the linux boot image and boot from that. On a real production system you don’t leave such an image connected, normally it’s attached when there is a need.

If you want or must use the linux boot image and not the local installation, you have to interfere the BIOS POST, and change the boot order. I am using virtualbox for most of my tests, the way to do that is: press F12 (Mac: fn-F12) and select CD-ROM as boot device and press enter.

To use the rescue mode on the linux boot image, boot it, then select “troubleshooting” (down arrow, enter), and “rescue a Oracle linux system” (down arrow, enter). This will boot linux from the boot image, and run anaconda. Anaconda gives you the following options:

1. Continue

This will scan your available disks, and mount the filesystems it could find under /mnt/sysimage in read-write mode. Not all filesystem issues will allow mounting read-write.

2. Read-only mount

This will scan your available disks, and mount the filesystems it could find under /mnt/sysimage in read-only mode. Obviously, not all issues allow read only mounting too.

3. Skip to shell

This will do no scanning of disks, so if you want to prevent any access to the local disks, this is the options to choose.

4. Quit (reboot)

This is a safe reboot (stopping all the started units). This is the same that would happen if the shell is exited from using option 1, 2 and 3.

Let’s look at a couple of different scenario’s:

Corrupted MBR

A MBR linux system scans the disks it can see in the order that is set in the BIOS, and boots from the first MBR found. This will normally be your linux installation on the local disks. If you have a corrupted or otherwise unbootable MBR on your local disks, and the linux boot image attached, it will boot from the linux boot image, unless it is set to explicitly not boot that device.

If no usable MBR is found, it will tell you on the console. On virtualbox, it shows:

FATAL: Could not read from the boot medium! System halted.

If just the MBR is damaged/corrupted, it can be recovered. If you want to rehearse recovering from a corrupted MBR, overwrite the MBR in the following way. Please mind this is machine dependent, look for specifics about the system to the overview above with parted and lsblk:

# dd if=/dev/zero of=/dev/sda bs=448 count=1

If you detach any other bootable devices, and reboot your system after writing over the MBR, it should not find anything to boot from, and your system will give an error like seen above, indicating it’s unable to boot.

In order to restore the MBR for grub2, perform the following tasks:
– Boot from the linux boot image.
– Start rescue mode.
– Choose ‘1’/continue, and enter to go to the shell.
– Change the root to your local disks: ‘chroot /mnt/sysimage’.
– Next thing is to understand your system. Based on the parted executable information above, we know the boot flag is set on device /dev/sda. This means we have to enter the following in the console:

grub2-install /dev/sda

This will report, if your disk is well enough to hold the information, and not really broken:

Installing for i386-pc platform.
Installation finished. No error reported.

– Now the MBR is written, and you can reboot your system by exiting the chroot jail and shell: ‘exit’ and ‘exit’ to boot your system from the local disks again.

Corrupted /boot filesystem

If the /boot filesystem, which holds the linux kernel and grub settings, is corrupted, the system will boot, because that is handled by the MBR, but then very quickly tell you it doesn’t recognise the filesystem:

.
error: unknown filesystem.
Entering rescue mode...
grub rescue>

A corrupted /boot filesystem can be recovered from in most cases, obviously depending on what and how much is corrupted in the filesystem, and the underlying device must be a able to function after the corruption. This can be rehearsed by overwriting part of the filesystem. In order to be absolutely sure the overwritten data on the disk is not changed by anything else, boot the linux boot image in rescue mode, and choose option ‘3’, and then overwrite part of the filesystem in /dev/sda1:

# dd if=/dev/zero of=/dev/sda1 bs=1k skip=2 count=1

In order to recover /boot, perform the following tasks:
– Boot from the linux boot image.
– Start rescue mode.
– Choose ‘1’/continue, and enter to go to the shell. If you are rehearsing: the dd command overwritten part of the filesystem header, it will not be recognised by lsblk, and if you take a look in the systems fstab (/mnt/sysimage/etc/fstab), it is mounted by UUID, but the UUID can’t be taken from the corrupted filesystem. The root and home filesystems are found in LVM, and mounted. This means you have to know /dev/sda1 is /boot.
– Repair the filesystem:

# xfs_repair /dev/sda1

Depending on the corruption, this might or might not succeed. If you are rehearsing this using the above dd command, you will see a lot of dots being printed searching for a superblock, which it will find and then recover from.
– Exit the shell to reboot the system: ‘exit’.

Corrupted root filesystem in LVM

When the root filesystem is corrupted, the system will boot as usual, show the different kernels in the grub dialogue, and then when the choice times out, boot the pre-selected kernel and try to boot linux. After the kernel is loaded, the root filesystem will be mounted, and it will fail. This generates the following message:

Generating "/run/initramfs/rdsosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt to a USB stick or /boot after mounting them and attach it to a bug report.

In order to rehearse recovering a corrupted root filesystem, boot the linux boot image in rescue mode, and choose option ‘3’. You will find that the volume group is not active, it needs to be activated using:

# vgchange -ay
 3 logical volume(s) in volume group "ol" now active

Then overwrite the filesystem header for the logical volume ‘root’ in the volume group ‘ol’:

# dd if=/dev/zero of=/dev/ol/root bs=1k skip=2 count=1

A corrupted root filesystem can be recovered from in most cases, obviously depending on what and how much is corrupted in the filesystem, and the underlying device must be able to function after the corruption. In order to recover perform the following tasks:
– Boot from the linux boot image.
– Start rescue mode.
– Choose ‘3’.
– Activate LVM:

# vgchange -ay

– Repair the filesystem in the logical volume

# xfs_repair /dev/ol/root

– Exit the shell to reboot the system: ‘exit’.

Corrupted non root filesystem in LVM

On my system, I only got /home as a non root filesystem in LVM. When I corrupted this filesystem, the linux system will came up without any complaining on the console. To be honest, this was kind of surprising to me.

I needed to dig around in the /var/log/messages file to find an error message:

Oct  2 10:57:11 localhost lvm: 3 logical volume(s) in volume group "ol" now active
Oct  2 10:57:11 localhost systemd: Started LVM2 PV scan on device 8:2.
Oct  2 10:57:11 localhost mount: mount: wrong fs type, bad option, bad superblock on /dev/mapper/ol-home,
Oct  2 10:57:11 localhost mount: missing codepage or helper program, or other error
Oct  2 10:57:11 localhost mount: In some cases useful info is found in syslog - try
Oct  2 10:57:11 localhost mount: dmesg | tail or so.
Oct  2 10:57:11 localhost systemd: Found device /dev/mapper/ol-home.
Oct  2 10:57:11 localhost systemd: Mounting /home...
Oct  2 10:57:11 localhost systemd: Starting LVM2 vgchange on device 8:2...

Based on what I see, I interpret this as systemd activating LVM, and then mounting the filesystem, which gives an error. What I don’t understand is what systemd is doing when it says it found the logical volume device and mounts it.

This can be rehearsed by booting into rescue mode, choose option ‘3’, activate LVM ‘vgchange -ay’, and then overwrite the filesystem header in the logical volume: ‘dd if=/dev/zero of=/dev/ol/home bs=1k skip=2 count=1’.

A corrupted non root filesystem in LVM can recovered from in most cases. There is no need to use the linux boot image for rescue mode. A linux system will come up when the root filesystem is valid and it can load enough from it to boot linux. The following needs to be done:
– Start a shell as root.
– Repair the filesystem in the logical volume: ‘xfs_repair /dev/ol/home’.
– Mount all filesystems: ‘mount -a’. This should mount the just repaired filesystem.

Corrupted swap in LVM

Just like a non root filesystem corruption mentioned above, the inability to activate a swap device is logged in the /var/log/messages file, but there is no explicit message on the console. Outside of logging the inability to active the swap device the system will come up happily, without swap if the only swap device is corrupted. These are the messages I could find in the /var/log/messages file:

Oct  2 11:32:39 localhost systemd: Found device /dev/mapper/ol-swap.
Oct  2 11:32:39 localhost systemd: Activating swap /dev/mapper/ol-swap...
Oct  2 11:32:39 localhost kernel: alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
Oct  2 11:32:39 localhost kernel: XFS (sda1): Mounting V4 Filesystem
Oct  2 11:32:39 localhost swapon: swapon: /dev/mapper/ol-swap: swapon failed: Invalid argument
Oct  2 11:32:39 localhost systemd: dev-mapper-ol\x2dswap.swap swap process exited, code=exited status=255
Oct  2 11:32:39 localhost systemd: Failed to activate swap /dev/mapper/ol-swap.
Oct  2 11:32:39 localhost systemd: Dependency failed for Swap.
Oct  2 11:32:39 localhost systemd: Job swap.target/start failed with result 'dependency'.
Oct  2 11:32:39 localhost systemd: Unit dev-mapper-ol\x2dswap.swap entered failed state.
Oct  2 11:32:39 localhost kernel: Unable to handle swap header version 0

It seems systemd understands the swap device is not usable, previously we saw systemd didn’t detect filesystem issues on an LVM device.

This can be rehearsed by booting into rescue mode, choose option ‘3’, activate LVM ‘vgchange -ay’, and then overwrite the swap header: ‘dd if=/dev/zero of=/dev/ol/swap bs=1k count=2’.

A corrupted swap device in LVM can be recovered from if the underlying device is able to function.
– Start a shell as root.
– Format the swap device again, in my case the swap device is /dev/ol/swap: ‘mkswap /dev/ol/swap’.
– Activate swap: ‘swapon -a’.
– Validate swap has been activated: ‘swapon -s’.

This blogpost is about using ansible vault. Vault is a way to encrypt sensitive information in ansible scripts by encrypting it. The motivation for this blogpost is the lack of a description that makes sense to me of what the possibilities are for using vault, and how to use the vault options in playbooks.

The basic way ansible vault works, is that when ansible-playbook reads a yaml file, it encounters $ANSIBLE_VAULT;1.1;AES256 indicating ansible vault is used to encrypt the directly following lines, it will use a password to decrypt it, and then uses the decrypted version in memory only. This way secrets can be hidden from being visible. Obviously, the password will allow decrypting it, and the password must be used in order for ansible-playbook to decrypt it.

The original use of vault is to encrypt an entire yaml file. As of Ansible version 2.3, ansible allows the encryption of single values in a yaml file.

1. File encryption
A very silly example of ansible-vault, which shouldn’t be used (!) is this:

$ echo "---
- hosts: localhost
  vars:
    a: aaa
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls" > test1.yml

When this runs, it will execute on localhost, set to ansible variables (a and b) and then use the shell module to execute “ls”. Again, this is just an example, and is silly on itself. This is how it’s run:

$ ansible-playbook test1.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

This can be encrypted using ansible vault in the following way:

$ ansible-vault encrypt test1.yml
New Vault password:
Confirm new Vault password:
Encryption successful

Now the yaml file is entirely encrypted:

$ cat test1.yml
$ANSIBLE_VAULT;1.1;AES256
32663536343532643864353135643431373830333236636230613264383531336562323530386632
3933353561316530333538363032633731336135396434630a626631303839646161363432396230
32336130633264343766393562656138363132393835646137633065393739336234336463336138
6564613730333436650a336537326233343566633761383436356136666533326561346530613566
64316336393036346538656338303734623161386435373138656532663737666662633765656438
37303666393939356161393134666130303435333463636630386434626335383164663761383035
35613437356361366665633931346664386535393936396637363561386164613337373562313264
63333530353263363437386536393765393930653234353733313433643637333932353638353339
31623366376566613766366236356163303837666664646465363439313265383562656637623266
6463663065383030626232643365623437666332313631376262

It still can be run, but you need to specify the password. In this case it asks to enter a password:

$ ansible-playbook test1.yml --ask-vault-pass
Vault password:

PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Of course this is not practical, the most common and probable way you want to use this is to have variables with secret information, like passwords, tokens, private keys, personal data, etc. encrypted, but the playbook still be readable. You should use a version control tool like git, and use functionality like diff!

The first thing is to move the variables out of the playbook into a dedicated section. In this case I chosen to assign them based on host, in a directory called ‘host_vars’, and then for localhost ‘host_vars/localhost’:

$ cat host_vars/localhost/vars.yml
---
a: aaa
b: bbb

And then of course remove them from the playbook itself:

$ cat test2.yml
---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

If this is run, it will pick up the variables (for which I put in a debug task):

$ ansible-playbook test2.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

Okay, now let’s say I need to encrypt the variable ‘a’. I can bluntly encrypt vars.yml, and that will work, however, then you don’t know where the variable comes from. The solution to that is ‘indirect referencing’. This is done by adding a second file to host_vars/localhost called ‘vault.yml’, which will be the encrypted file. In the case of encrypting variable a, do the following:
1. Change the vars.yml file to assign {{ vault_a }} to a:

$ cat host_vars/localhost/vars.yml
---
a: "{{ vault_a }}"
b: bbb

2. Define the value of a to vault_a in vault.yml:

$ cat host_vars/localhost/vault.yml
---
vault_a: aaa

3. Encrypt vault.yml:

$ ansible-vault encrypt host_vars/localhost/vault.yml
New Vault password:
Confirm New Vault password:
Encryption successful

Now it can be used by specifying the password again:

$ ansible-playbook test2.yml --ask-vault-pass
Vault password:
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

This should not be a surprise, as I said, ansible will recognise an encrypted file, and decrypt it automatically. However, if you looked carefully you saw something else that is important to realise: I encrypted the variable a containing ‘aaa’, and it was revealed to me using the debug statement. I think this is reasonable, because it’s been explicitly programmed to do so. You should write your playbooks not to reveal any information that you encrypted previously.

2. Variable encryption
If you have larger sets of variables to be encrypted, the above method makes sense. However, if you want to encrypt only a few variables it’s probably simpler to encrypt only the variable you want to be encrypted. The ansible-vault executable provides a facility for that called ‘encrypt_string’. The way this works is that you simply specify the variable and the variable value with ansible-vault, which then outputs the encrypted value, which you need to put in a playbook yourself. For example your ‘a’ variable with the value ‘aaa’:

$ ansible-vault encrypt_string --name 'a' 'aaa'
New Vault password:
Confirm New Vault password:
a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
Encryption successful

Now use this to input to a playbook:

---
- hosts: localhost
  vars:
    a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

3. Vault passwords
The next thing to look at is how to specify the decryption password. One way or another, you need a password specified to decrypt. The first option is to specify the password at the CLI, which is done in the previous examples by specifying ‘–ask-vault-pass’ as an argument to ansible-playbook. This is a good option for running playbooks ad-hoc.

If you want playbooks to run from a facility like ansible tower, rundeck or semaphore, you need to specify the password in non-interactive, in a file. This is often implemented as a hidden file, like ‘.vault_pass.txt’. Of course this then means the password is readable if you got command line access. The main idea here is this gives you an option to encrypt sensitive data in a playbook and store it in a version control system, and do NOT store the password (file) in version control.

$ echo "mypassword" > .vault_pass.txt
$ ansible-playbook --vault-password-file .vault_pass.txt test2.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

There is a caveat here too, into which I ran because my removable HDD enforces 777 on my files, which means the execute bit is set:

$ ansible-playbook --vault-password-file .ttt test2.yml
 [WARNING]: Error in vault password file loading (default): Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this
is not a script, remove the executable bit from the file.

ERROR! Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this is not a script, remove the executable bit from the file.

The message is self-explanatory: if ansible finds the passwordfile specified to be executable, it will execute it. In my case, I created a little shell script to do that:

#!/bin/bash
echo "mypassword"

3.a. Vault-id for passwords
Since ansible version 2.4 the official way to use encryption is by using vault-id. Vault-id is more or less the same as the above usage, however, it adds a new feature that could be important: the option to use multiple vault passwords in the same playbook. The way this works is quite simple: ansible tries all the passwords it has been given until it finds one that works. Please mind vault-id only works with encrypted files, not with encrypted variables! Below is an example of the use of vault-id with multiple passwords:

$ echo "---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}" > test3.yml
$ echo '---
a: "{{ vault_a }}"
b: "{{ vault_b }}"' > host_vars/localhost/vars.yml
$ echo "password_a" > .a.vault_pass.txt
$ echo "password_b" > .b.vault_pass.txt
$ echo "---
vault_a: aaa" > host_vars/localhost/vault_a.yml
$ echo "---
vault_b: bbb" > host_vars/localhost/vault_b.yml
$ ansible-vault --vault-id a@.a.vault_pass.txt encrypt host_vars/localhost/vault_a.yml
Encryption successful
$ ansible-vault --vault-id b@.b.vault_pass.txt encrypt host_vars/localhost/vault_b.yml
Encryption successful
$ ansible-playbook --vault-id a@.a.vault_pass.txt --vault-id b@.b.vault_pass.txt test3.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

During investigating how Oracle works with regards to waiting, I discovered an oddity. I was researching for my redo blog posts, and found that in a certain case, Oracle uses the ‘nanosleep’ system call. As the name of this system call suggests, this call allows you to let a process sleep with nanosecond precision.

The oddity that I found, was the following:

$ strace -Tp 42
...
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
...

I executed strace with the ‘-T’ argument (show time spend in system calls), and ‘-p’ to attach it to the process I want to measure. The nanosleep function was called with struct time spec set to {0, 1000}, which means 0 seconds and 1000 nanoseconds. However, the actual time spend was between 0.000191 and 0.000253 seconds, which is 1910000 and 253000 in nanoseconds. That’s approximately 200 times longer than it’s set to wait!

Since then I have spend a fair amount of time to understand why this is happening. So far I found a number of facts that I think explain this, but if anyone else has something to throw in here, please leave a comment.

The first thing I did was create a C program that just runs nanosleep in a loop, so I can run the nanosleep systemcall in isolation. Here’s the nanosleep PoC code (measure_time.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       loop_counter, loop_total;
  struct    timespec sleep;

  sleep.tv_sec=0;
  sleep.tv_nsec=1000;

  loop_total=10;
  loop_counter=0;

  while ( loop_counter &lt; loop_total ) {
    nanosleep(&amp;sleep,NULL);
    loop_counter++;
  }
}

Then, when researching I found that there is a systemcall that shows the actual clock resolution time! I created another very small C program to run just that (measure_resulution.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       result;
  struct    timespec resolution;
  clockid_t clk_id;
  clk_id = CLOCK_MONOTONIC;

  clock_getres(clk_id, &amp;resolution);

  printf("Resolution: %ld s, %ld ns\n", resolution.tv_sec, resolution.tv_nsec);
}

This c program can be compiled using ‘gcc measure_resolution.c -o measure_resolution’. This is what it showed:

$ ./measure_resolution
Resolution: 0 s, 1 ns

So, my system has a 1 ns resolution, despite my nanosleep systemcalls taking way longer. A little later I found out this information can be obtained directly in /proc/timer_list:

$ grep resolution /proc/timer_list | uniq
  .resolution: 1 nsecs

The first thing I found while researching, is that when I change the priority of the process (not the 1000ns), I can get lower sleeping times:

# strace -T chrt --rr 99 ./measure_time
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

The lowest time here is 0.036ms (36000ns). In order to look further, I found that newer kernel versions have an addition to perf that can measure wait time, scheduler delay (‘run queue’) and actual run time; perf sched timehist. So at this point it seemed that was a way to understand more about the timing of nanosleep. In order to look at that I created a virtual machine with fedora 27 (kernel version 4.14.13), and compiled my measure_time.c program on it.

The next thing to do is run the measure_time executable with perf sched record:

# perf sched record chrt --rr 99 ./measure_time
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.183 MB perf.data (54 samples) ]

Then run perf sched timehist to look at the data and find the pid of the measure_time executable:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.760817 [0000]                                0.000      0.000      0.000
   37745.760823 [0000]  rcu_sched[8]                        0.000      0.000      0.005
   37745.762367 [0002]  perf[2777]                          0.000      0.000      0.000
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.762431 [0002]  migration/2[21]                     0.000      0.001      0.011
   37745.762479 [0003]                                0.000      0.000      0.000
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764108 [0003]                                1.583      0.000      0.045
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005

So it’s pid 2778, now add that to perf sched timehist:

# perf sched timehist -p 2778
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005
   37745.764153 [0003]  measure_time[2778]                  0.034      0.002      0.004
   37745.764195 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764236 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764291 [0003]  measure_time[2778]                  0.050      0.002      0.004
   37745.764347 [0003]  measure_time[2778]                  0.051      0.002      0.004
   37745.764405 [0003]  measure_time[2778]                  0.052      0.002      0.004
   37745.764478 [0003]  measure_time[2778]                  0.067      0.006      0.005
   37745.764506 [0003]  measure_time[2778]                  0.022      0.002      0.005
   37745.764581 [0003]  measure_time[2778]                  0.022      0.002      0.052

The way to look at this: first perf is run. Because we just started running, there is no wait time, the task gets the state running, then it needs to be scheduled to run. That time is in ‘sch delay’, obviously scheduling delay. Then it gets truly is running and runs for some time, which time is visible in the ‘run time’ column. Next, the measure_time executable is run.

The first time measure_time runs, it needs to do some initialisation, like the dynamic loader loading libc, allocate memory (mmap calls), etcetera. This is visible in the run time of 1.583 ms. However, then it runs into the nanosleep call. The nanosleep call puts the process into an interruptible sleep. The sleep is visible in the wait time column. Here we see it’s waiting/sleeping for between 0.022 and 0.067 ms. After waiting, the process gets the status running, after which the scheduler must schedule the task and make it running. That time is in scheduling time again. Then it runs, but all it needs to do is end the nanosleep call, add one to the loop_counter variable, jump back and enter the nanosleep call again, as we can see, that only takes 0.004 ms.

Okay, so far it was quite a geeking out session, but not really an answer has been found to why a 1000ns sleep turns out to be way higher. I sought out help via social media, and gotten some sane advise from Kevin Closson. Essentially, the advise was to run it on physical hardware instead of my test virtual machine, and make the load as low as possible in order for the scheduler to be able to schedule my process as quickly as it can. Yes, very obvious actually.

So, the next target is a physical machine in our Enkitec lab. This is on a X3 generation machine (Sandy Bridge EP). I compiled measure_time.c, and ran ‘strace -T chrt –rr 99 ./measure_time’ again:

nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

That’s way better! But still the nanosleep timing is approximately 13000ns, instead of 1000ns that the call is supposed to take.

At this point I had no clue, and obviously needed to rethink it again; the time the call is supposed to sleep and thus take and the wallclock time I measured are still quite far apart in my measurements. On one hand you could argue that the time it is set to wait is so small that this could be simply caused by the code in the nanosleep function itself, on the other hand it’s not even close.

Then I gotten a eureka moment: strace is using ptrace facilities to attach to the process, and ptrace is known to quite severely slow processes down. Hat tip to Martin Bach who contacted me with this at the very same moment I realised this. So then the question becomes: can I measure nanosleep in another way?

Then I thought about the linux kernel ftrace facilities. In fact, the physical machines at my disposal in the Enkitec lab all run an Exadata images, and the Exadata images do not contain bleeding edge kernels like fedora 27 has, so I couldn’t repeat the perf sched timehist tracing. I wrote about ftrace in the past.

So, let’s redo the timing investigation again using ftrace:
1. Make sure the debug filesystem is mounted:

# mount /sys/kernel/debug || mount -t debugfs none /sys/kernel/debug

2. Enter the trace directory in /sys/kernel/debug:

# cd /sys/kernel/debug/tracing

3. Create another session on the same machine and obtain the process id:

# echo $$
62733

4. In the debug/tracing session, setup nanosleep system call tracing:
Enter the process id as the pid to trace:

echo 62733 > set_ftrace_pid

Set the type of trace. The default ‘tracer’ is nop (no tracing). Set it to ‘function_graph’:

echo function_graph > current_tracer

If there is an option ‘(no)function-fork’ in trace_options, set it to ‘function-fork’ to make ftrace trace child processes too (function-fork is an addition in recent kernels):

# egrep ^nofunction-fork trace_options && echo function-fork > trace_options

Set the function to trace. For the nanosleep system call, this is SyS_nanosleep:

# echo SyS_nanosleep > set_ftrace_filter

Now enable tracing:

# echo 1 > tracing_on

5. In the session to trace, execute chrt –rr 99 ./measure_time.
6. In the debug/tracing session, stop the trace and look at the results:

# echo 0 > tracing_on
# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 20)   8.958 us    |  SyS_nanosleep();
 20)   5.077 us    |  SyS_nanosleep();
 20)   4.260 us    |  SyS_nanosleep();
 20)   9.384 us    |  SyS_nanosleep();
 20)   5.495 us    |  SyS_nanosleep();
 20)   5.526 us    |  SyS_nanosleep();
 20)   5.039 us    |  SyS_nanosleep();
 20)   4.936 us    |  SyS_nanosleep();
 20)   4.898 us    |  SyS_nanosleep();
 20)   4.889 us    |  SyS_nanosleep();

This shows the waiting time still fluctuating, but going down to as low as 4.260 us, alias 4260 nanoseconds. This still is roughly four times the sleeping time set (1000ns), but it gotten way lower than earlier! Probably, the tracing increased the latency a bit, so my guess would be the true sleeping time when no trace is applied is around 2000ns. Please mind this is with changed (increased) process priorities (chrt –rr 99); when measure_time is ran without any priorities set, it looks like this:

# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 21) + 60.583 us   |  SyS_nanosleep();
 21) + 56.504 us   |  SyS_nanosleep();
 21) + 55.940 us   |  SyS_nanosleep();
 21) + 56.118 us   |  SyS_nanosleep();
 21) + 56.076 us   |  SyS_nanosleep();
 21) + 56.078 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 56.100 us   |  SyS_nanosleep();
 21) + 56.868 us   |  SyS_nanosleep();

But: there is Oracle grid infrastructure running, and multiple databases are running.

What would be interesting to see is how nanosleep is actually executed. Ftrace provides a way to do exactly that! The important thing to keep in mind is that ftrace is kernel only. In order to see how SyS_nanosleep is executed, do the following:

# > set_ftrace_filter
# echo SyS_nanosleep > set_graph_function
# > trace

Then execute ‘chrt –rr 99 ./measure_time’ again, and look in trace again:
(please mind I picked the second occurrence of SyS_nanosleep, of course the trace shows them all)

 14)               |  SyS_nanosleep() {
 14)               |    hrtimer_nanosleep() {
 14)   0.045 us    |      hrtimer_init();
 14)               |      do_nanosleep() {
 14)               |        hrtimer_start_range_ns() {
 14)               |          __hrtimer_start_range_ns() {
 14)               |            lock_hrtimer_base() {
 14)   0.039 us    |              _raw_spin_lock_irqsave();
 14)   0.527 us    |            } /* lock_hrtimer_base */
 14)   0.048 us    |            ktime_get();
 14)   0.043 us    |            get_nohz_timer_target();
 14)   0.079 us    |            enqueue_hrtimer();
 14)               |            tick_program_event() {
 14)               |              clockevents_program_event() {
 14)   0.048 us    |                ktime_get();
 14)   0.446 us    |              } /* clockevents_program_event */
 14)   0.876 us    |            } /* tick_program_event */
 14)   0.040 us    |            _raw_spin_unlock();
 14)   0.040 us    |            __raise_softirq_irqoff();
 14)   4.318 us    |          } /* __hrtimer_start_range_ns */
 14)   4.792 us    |        } /* hrtimer_start_range_ns */
 14)               |        schedule() {
 14)               |          __schedule() {
 14)   0.041 us    |            rcu_note_context_switch();
 14)   0.039 us    |            _raw_spin_lock_irq();
 14)               |            deactivate_task() {
 14)               |              dequeue_task() {
 14)   0.056 us    |                update_rq_clock();
 14)               |                dequeue_task_rt() {
 14)               |                  update_curr_rt() {
 14)   0.057 us    |                    cpuacct_charge();
 14)   0.053 us    |                    sched_avg_update();
 14)   0.046 us    |                    _raw_spin_lock();
 14)               |                    sched_rt_runtime_exceeded() {
 14)   0.041 us    |                      balance_runtime();
 14)   0.402 us    |                    } /* sched_rt_runtime_exceeded */
 14)   0.046 us    |                    _raw_spin_unlock();
 14)   2.701 us    |                  } /* update_curr_rt */
 14)               |                  dequeue_rt_entity() {
 14)               |                    dequeue_rt_stack() {
 14)               |                      __dequeue_rt_entity() {
 14)   0.062 us    |                        cpupri_set();
 14)   0.057 us    |                        update_rt_migration();
 14)   0.861 us    |                      } /* __dequeue_rt_entity */
 14)   1.258 us    |                    } /* dequeue_rt_stack */
 14)   0.047 us    |                    enqueue_top_rt_rq();
 14)   1.908 us    |                  } /* dequeue_rt_entity */
 14)   0.048 us    |                  dequeue_pushable_task();
 14)   5.564 us    |                } /* dequeue_task_rt */
 14)   6.445 us    |              } /* dequeue_task */
 14)   6.789 us    |            } /* deactivate_task */
 14)   0.044 us    |            pick_next_task_stop();
 14)   0.041 us    |            pick_next_task_dl();
 14)               |            pick_next_task_rt() {
 14)   0.042 us    |              pull_rt_task();
 14)   0.050 us    |              update_curr_rt();
 14)   0.823 us    |            } /* pick_next_task_rt */
 14)               |            pick_next_task_fair() {
 14)               |              put_prev_task_rt() {
 14)   0.042 us    |                update_curr_rt();
 14)   0.377 us    |              } /* put_prev_task_rt */
 14)               |              pick_next_entity() {
 14)   0.039 us    |                clear_buddies();
 14)   0.340 us    |              } /* pick_next_entity */
 14)               |              set_next_entity() {
 14)   0.065 us    |                __dequeue_entity();
 14)   0.426 us    |              } /* set_next_entity */
 14)   2.159 us    |            } /* pick_next_task_fair */
 14)   0.052 us    |            finish_task_switch();
 14) + 15.656 us   |          } /* __schedule */
 14) + 16.003 us   |        } /* schedule */
 14)   0.039 us    |        _cond_resched();
 14)               |        hrtimer_cancel() {
 14)               |          hrtimer_try_to_cancel() {
 14)               |            lock_hrtimer_base() {
 14)   0.040 us    |              _raw_spin_lock_irqsave();
 14)   0.365 us    |            } /* lock_hrtimer_base */
 14)   0.046 us    |            _raw_spin_unlock_irqrestore();
 14)   1.006 us    |          } /* hrtimer_try_to_cancel */
 14)   1.403 us    |        } /* hrtimer_cancel */
 14) + 23.559 us   |      } /* do_nanosleep */
 14) + 24.444 us   |    } /* hrtimer_nanosleep */
 14) + 24.842 us   |  } /* SyS_nanosleep */

This gives a full overview of the functions that SyS_nanosleep is executing in the kernel, including timing.
Essentially, everything happens in ‘do_nanosleep’. Inside do_nanosleep, the following functions are directly called:
– hrtimer_start_range_ns -> 4.792 us
– schedule -> 16.003 us
– _cond_resched -> 0.039 us
– hrtimer_cancel -> 1.403 us
This is more than the 4260 ns we saw earlier, it’s clear the tracing influences the latency of the execution.

In order to get a breakdown on where the time goes for the 4260 ns, remove SyS_nanosleep from set_graph_function, empty the trace and add the four functions mentioned above to set_ftrace_filter:

# > set_graph_function
# > trace
# echo 'hrtimer_start_range_ns schedule _cond_resched hrtimer_cancel' > set_ftrace_filter

Now execute ‘./chrt –rr 99 ./measure_time’ again, and look in trace; this is a snippet that shows a single nanosleep invocation:

...
  2)   0.938 us    |  hrtimer_start_range_ns();
  2)   3.406 us    |  schedule();
  2)   0.077 us    |  _cond_resched();
  2)   0.123 us    |  hrtimer_cancel();
...

What this shows is that apparently the time it takes for the schedule function to finish is taking the majority of the time. There will be influence of ftrace being active, but taking that into account on this system the time to get rescheduled after waiting for a short time is around 3 to 4 us (microseconds).

Conclusion
There is a certain point at which timing is so fine grained, that anything you do, even running a (kernel) function itself distorts time. It looks like this is 1us (microsecond) granularity for the systems I’ve looked at.

Obviously, when you run on a virtualised platform, especially like I did with a ‘desktop’ virtualisation product like Virtualbox, but it applies to any virtualisation, scheduling times increase because the virtualisation needs to abstract running on a CPU. With modern virtualisation running user-mode this overhead is minimal or non-existent. However, kernel mode access needs to be abstracted in order to keep virtual machines from invading each others resources. This can become significant if resources are oversubscribed (we gotten a glimpse of that in this article with the first nanosleep measurements at roughly 220 us).

The nanosleep times gotten lower when the process priority was set as high as possible. Then when moving onto physical hardware, the latency times gotten even lower. The priority of the process strongly influences the scheduling time, which is kind of logical, because that’s the entire purpose of the priority.

The linux strace utility influences process execution latency quite significantly when looking at a microsecond granularity. Ftrace provides a way to trace in a very flexible way with way lesser overhead than strace, although it still does impose an overhead on the process it is tracing.

Using ftrace, the minimal time that I could measure for a process to get scheduled again after sleeping was between 3 to 4 microseconds. This latency is influenced by the tracing, so it probably comes down to between 2 to 3 microseconds without it. Also, this is strongly influenced by the priority, I needed to set it to the highest priority in order to get the low latency. Another big influencer will be the hardware that I am running, and there are a fair amount of processes running, some at high priority (cluster ware ocssd, Oracle database lms processes, linux daemons).

This post shows you how to install Rundeck with the Ansible plugin on Centos 7. The installation is done with nginx as the web server and using SSL with a self signed certificate. Please read the Ansible installation script, and modify anything that should be different for your situation. You will be amazed how well readable the installation script is!

Rundeck is a web based user interface that allows you to run commands against a group of hosts. Rundeck has an ansible plugin. Using that plugin, it could perform the similar functionality as Ansible Tower (commercial product) or Semaphore (open source).

After a fresh installation of Centos 7, do the following as root:

yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ansible git
git clone https://gitlab.com/FritsHoogland/install_rundeck.git

This adds the EPEL (extra packages for Enterprise Linux) repository to Centos, which contains ansible and git. The next yum command installs ansible (the orchestration software) and git (distributed version control). The git clone command pulls the ansible orchestration script from my install_rundeck git repository at gitlab.

It is important to open the install_rundeck/install_rundeck.yml script and modify anything that is different in your situation. The public hostname will most likely be different than the 192.168.66.16 in the script. You might want to change the locality information with the certificate generation (unless you live in Lutjebroek like me :-). If you do a corporate installation, you might want to skip this step altogether and get a certificate pair signed by your company’s CA.

Please mind I ran into issues with ansible inventories:
– The hosts in the inventory need to have ansible run against them to pick up their properties and become visible in rundeck in the nodes tab. For being able to have ansible run against the hosts in the inventory, the host need to exist in the ssh known hosts file of the rundeck user, otherwise ansible can’t run and the host or hosts will not be visible in rundeck. The solution is to log in as the rundeck user and logon to the hosts in the inventory once manually and accept the host or hosts. From a security perspective it’s understandable that you careful need to assess the hosts to trust, but from an automation standpoint this is annoying. Outside of essentially filling out the known hosts file as I described, there are several other workarounds.
– I created an ansible inventory file in the rundeck project directory and entered the hosts in it. Rundeck picked up the hosts (after solving the above point they became visible in the nodes tab), however when executing something using ansible via rundeck it would say ‘[WARNING]: provided hosts list is empty, only localhost is available’. This means ansible was not pointed explicitly to an inventory, so it used the default one. In order to solve this, I symlinked my (rundeck) project inventory to the /etc/ansible/hosts to make it centrally available. Apparently, using a central inventory for ansible using the plugin is by design. I would rather have rundeck generate an inventory per execution, and pointing to it when the plugin executes ansible.

Now install rundeck:

ansible-playbook install_rundeck/install_rundeck.yml

Done!

Prometheus is an open source systems monitoring and alerting toolkit originally build at Soundcloud. This blogpost shows how to install the needed components to do visualisation of linux system statistics via Grafana.

Addition June 29, 2018: a really quick simple install is provided in this blogpost: very quick install of prometheus, node exporter and grafana This uses an ansible script that does most of the installation and configuration for you.

The setup consists of 3 components:
node_exporter, an exporter of system and hardware metrics.
prometheus, a metric collection and persistence layer.
grafana, the visualisation layer.

1. Preparation
The needed components are installed in the home directory of the user ‘prometheus’. In order for that user exist, it must obviously first be created:

# useradd prometheus
# su - prometheus
$

This installation guide uses Oracle Linux 7.3, but should work for RHEL or Centos too.

2. Node exporter
The next thing to do is install the node exporter. Please mind new version do come out, so you might want to verify the latest release on

$ curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz"
$ mkdir -p Prometheus/node_exporter
$ cd $_
$ tar xzf ../../node_exporter-0.14.0.linux-amd64.tar.gz

Now become root and create a unit file to automatically startup the node exporter using systemd:

# echo "[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

[Install]
WantedBy=default.target" > /etc/systemd/system/node_exporter.service

And make systemd start the node exporter:

# systemctl daemon-reload
# systemctl enable node_exporter.service
# systemctl start node_exporter.service

Next you can verify if the node exporter is running by using ‘systemctl status node_exporter.service:

# systemctl status node_exporter.service
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:20:54 UTC; 7s ago
 Main PID: 3017 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─3017 /home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - hwmon" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - infiniband" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - textfile" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - conntrack" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - diskstats" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - entropy" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - loadavg" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - sockstat" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - wifi" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg="Listening on :9100" source="node_exporter.go:186"

Additionally, you can go to hostname:9100, and look if that page says ‘node exporter’, and has a link called ‘metric’, which has all the metrics.

3. Prometheus
After we installed node_exporter to provide measurements, we must install the software that can fetch that information and store it. That is what prometheus does. First, become the prometheus user again, and install prometheus. Here too is important to realise that newer versions will come out after this article has been written:

# su - prometheus
$ curl -LO "https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz"
$ cd Prometheus
$ tar xzf ../prometheus-1.7.1.linux-amd64.tar.gz
$ cd prometheus-1.7.1.linux-amd64
$ echo "scrape_configs:

  - job_name: 'prometheus'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9100']"> prometheus.yml

This downloaded and unzipped prometheus, and created prometheus scrape config to fetch data from prometheus itself and the node exporter. Now become root, and install the systemd unit file for prometheus:

# echo "[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus.yml -storage.local.path=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/data

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/prometheus.service

And make systemd start prometheus:

# systemctl daemon-reload
# systemctl enable prometheus.service
# systemctl start prometheus.service

And verify prometheus is running:

# systemctl status prometheus.service
● prometheus.service - Prometheus Server
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:36:55 UTC; 9s ago
     Docs: https://prometheus.io/docs/introduction/overview/
 Main PID: 22656 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─22656 /home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1....

Jul 31 15:36:55 test.local systemd[1]: Started Prometheus Server.
Jul 31 15:36:55 test.local systemd[1]: Starting Prometheus Server...
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting prometheus (version=1.7.1, branch=mast...n.go:88"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Build context (go=go1.8.3, user=root@0aa1b7fc43...n.go:89"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Host details (Linux 3.10.0-514.26.2.el7.x86_64 ...n.go:90"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading configuration file /home/prometheus/Pro....go:252"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading series map and head chunks..." source="....go:428"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="0 series loaded." source="storage.go:439"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Listening on :9090" source="web.go:259"
Hint: Some lines were ellipsized, use -l to show in full.

Additionally you can go to hostname:9090/targets and verify both node_exporter and prometheus report state=UP.

At this point, system metrics are fetched and stored. All we need to do, is visualise it. An excellent tool for doing so is grafana. This is how grafana is installed:

4. Grafana
This webpage shows installation instructions and a link to the latest version. During the time of writing of this blogpost, the latest version was 4.1.1. This is how grafana is installed: (please mind installation and systemd require root privileges)

# yum install https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.4.1-1.x86_64.rpm

Next up make systemd handle grafana and start it:

# systemctl daemon-reload
# systemctl enable grafana-server.service
# systemctl start grafana-server.service

And check if grafana is running:

# systemctl status grafana-server.service
● grafana-server.service - Grafana instance
   Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:43:11 UTC; 1min 58s ago
     Docs: http://docs.grafana.org
 Main PID: 22788 (grafana-server)
   CGroup: /system.slice/grafana-server.service
           └─22788 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile= cfg:default.paths.logs=/var/log/grafana cfg:default.path...

Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Starting plugin search" logger=plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=warn msg="Plugin dir does not exist" logger=plugins dir=/...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Plugin dir created" logger=plugins dir=/var/lib...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Alerting" logger=alerting.engine
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing CleanUpService" logger=cleanup
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Stream Manager"
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing HTTP Server" logger=http.server ad...socket=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Hint: Some lines were ellipsized, use -l to show in full.

5. Grafana configuration
Next, we need to hook up grafana with prometheus. First, go to hostname:3000.
– Login with admin/admin
– Click ‘add datasource’
– Name: prometheus, Type: Prometheus
– Http settings: http://localhost:9090, select Access: ‘proxy’.
– Click ‘save and test’. This should result in ‘success’ and ‘datasource updated.’

Now click on the grafana symbol in the left upper corner, dashboards, import. Enter ‘2747’ at ‘grafana.com dashboard’. This will say ‘Linux memory’, select the prometheus datasource which you just defined, and click import.

This should result in a dashboard the shows you the linux memory area’s (click on the picture to get a better view!):

When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:

WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957

The current documentation has no description for it. Let’s see what V$EVENT_NAME says:

SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 
  2  from v$event_name where name = 'PGA memory operation';

EVENT# NAME                                  PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
------ ------------------------------------- ---------- ---------- ---------- ---------------
   524 PGA memory operation                                                   Other

Well, that doesn’t help…

Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).

TS@fv122b2 > alter session set events 'sql_trace level 8';

Session altered.

Now use strace to look at the system calls in another session:

# strace -e write=all -e all -p 9426
Process 9426 attached
read(9,

Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:

io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
lseek(7, 0, SEEK_CUR)                   = 164639
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 37 38  30 20 70 31 3d 32 30 39   ela= 780 p1=209 |
 | 00040  37 31 35 32 20 70 32 3d  31 31 31 34 31 31 32 20  7152 p2=1114112  |
 | 00050  70 33 3d 30 20 6f 62 6a  23 3d 32 32 38 33 33 20  p3=0 obj#=22833  |
 | 00060  74 69 6d 3d 31 39 35 31  37 30 32 30 35 36 36 0a  tim=19517020566. |
...
munmap(0x7f5ba8fbd000, 2097152)         = 0
munmap(0x7f5ba91bd000, 2097152)         = 0
mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000
lseek(7, 0, SEEK_CUR)                   = 183409
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 35 39  32 20 70 31 3d 30 20 70   ela= 592 p1=0 p |
 | 00040  32 3d 30 20 70 33 3d 30  20 6f 62 6a 23 3d 32 32  2=0 p3=0 obj#=22 |
 | 00050  38 33 33 20 74 69 6d 3d  31 39 35 32 30 36 33 33  833 tim=19520633 |
 | 00060  36 37 34 0a                                       674.             |

Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:

global wait_event_nr=524
probe begin {
	printf("begin.\n")
}

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthbwt") {
	if ( pid() == target() && register("rdx") == wait_event_nr )
		printf("kskthbwt - %d\n", register("rdx"))
}
probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr )
		printf("kskthewt - %d\n", register("rsi"))
}
probe syscall.mmap2 {
	if ( pid() == target() )
		printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6))
}
probe syscall.mmap2.return {
	if ( pid() == target() )
		printf("return value: %x\n", $return)
}
probe syscall.munmap {
	if ( pid() == target() )
		printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2))
}

Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.

I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:

# stap -x 9426 mmap.stp
begin.
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000
 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000
kskthewt - 524
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000
 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000
kskthewt - 524
kskthbwt - 524
 munmap, addr 7f5ba8fbd000, size 2097152
 munmap, addr 7f5ba91bd000, size 2097152
kskthewt - 524

This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.

Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?

To answer that, we need to look at the flags of the two calls. Here is an example:

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000

The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.

The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.

The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.

Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr ) {
		ksuse = register("r13")-4672
		ksuseopc = user_uint16(ksuse + 2098)
		ksusep1 = user_uint64(ksuse + 2104)
		ksusep2 = user_uint64(ksuse + 2112)
		ksusep3 = user_uint64(ksuse + 2120)
		ksusetim = user_uint32(ksuse + 2128)
		printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3)
	}
}

When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:

kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000
 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0

This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?

Let’s look what happens when I execute the same SQL again:

kskthbwt - 524
 munmap, addr 7f1562130000, size 2097152
 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0
kskthbwt - 524
 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0

Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.

The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.

The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.

Conclusion:
With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.

%d bloggers like this: