Archive

Tag Archives: linux

The primary source of information regarding any change or issue on a linux system is the /var/log/messages file. I am often annoyed when a linux system is setup in such a way that certain messages are written to syslog with a high frequency swamping the messages file with information that is not important. The reason for my annoyance is that this makes it very hard to actually spot important information because you have to skip through a lot of lines before you find the important information, especially if you do not know for sure if there a message in the first place.

Please mind this blogpost is created on a Centos 7 server which uses rsyslog.

There are a couple of ways to manage this. The standard syslog way of managing this is the following, which can be found in /etc/rsyslog.conf:

#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog


# Log cron stuff
cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 :omusrmsg:*

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log

What this shows is that certain principal important tasks have their own ‘facility’, which in this example are ‘mail’, ‘cron’, ‘authpriv’, etc. shown above, which are put in their own file.

But how about processes that do not have their own facility, and just write to syslog? One example of these this is dhclient, which produces messages like these:

Jun 30 03:28:32 ip-172-31-12-40 dhclient[19604]: DHCPREQUEST on eth0 to 172.31.0.1 port 67 (xid=0x5dfd88c4)
Jun 30 03:28:32 ip-172-31-12-40 dhclient[19604]: DHCPACK from 172.31.0.1 (xid=0x5dfd88c4)
Jun 30 03:28:35 ip-172-31-12-40 dhclient[19604]: bound to 172.31.12.40 -- renewal in 1726 seconds.
Jun 30 03:57:21 ip-172-31-12-40 dhclient[19604]: DHCPREQUEST on eth0 to 172.31.0.1 port 67 (xid=0x5dfd88c4)
Jun 30 03:57:21 ip-172-31-12-40 dhclient[19604]: DHCPACK from 172.31.0.1 (xid=0x5dfd88c4)
Jun 30 03:57:23 ip-172-31-12-40 dhclient[19604]: bound to 172.31.12.40 -- renewal in 1798 seconds.

Which it does every 5 minutes. This is actually truly annoying…

Luckily, there is a solution: rsyslog has the option to filter based on more properties than the logging facility a process is using. This is done using a script in /etc/rsyslog.d/.

On my server, the majority of the messages are from the dhclient and systemd daemons, and both the messages seem to be informal. In order not to miss anything, I still want that information to be logged, but not in the /var/log/messages file.

This can be actually quite simply be accomplished using the following two scripts in /etc/rsyslog.d/:

dhclient.conf:

if $programname == 'dhclient' then /var/log/dhclient.log
&stop

systemd.conf

if $programname = 'systemd' then /var/log/systemd.log
&stop

Once you created these scripts, you need to make rsyslogd read this new configuration. I thought killall -HUP rsyslogd would accomplish this, but outside of a message in the /var/log/messages file saying it got the HUP signal, it doesn’t execute a new task.

However, executing:

systemctl stop rsyslog.service
systemctl start rsyslog.service

Does make rsyslog read the new configuration, and then both dhclient and systemd log to their own files and do not write to the messages file anymore!

There is one last thing that needs to be done: make sure the newly defined logfiles are cleaned just like all the other files in /var/log. Otherwise these files will endlessly grow, eventually occupying all the space in the filesystem where /var/log is part of.

This too is really easy, the newly defined logfiles can be added to the list of syslog files for logrotate, which is defined in /etc/logrotate.d/syslog:

/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
/var/log/dhclient.log
/var/log/systemd.log
{
    missingok
    sharedscripts
    postrotate
	/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
    endscript
}

As you can see, I simply added /var/log/dhclient.log and /var/log/systemd.log to the list.

Please mind that the filter for dhclient and systemd is the executable name, so even if the severity of the logging message is high, a message from these daemons will still go to the file it is configured to log to.

Advertisements

This blogpost is about the reason and solving getting the following message, or messages alike these when logging i to a linux box using ssh:

-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory

However, this is a warning. Please mind such an issue might be come up in another way, which can be more disrupting; at least in the past I had issues running perl for the same issue:

[root@dmcel01 ~]# /usr/local/bin/ipconf -verify
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "en_US.UTF-8",
LC_ALL = "UTF-8",
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

This is a utility that comes with exadata, and I needed it during patching, but it didn’t execute because of it. This was more than 4 years ago, and I actually created a SR for it. But because I was patching during a downtime window, I had to quickly solve it. I managed to find it, and reported it back to Oracle, which was turned into a MOS note (however, as Oracle seems to have a habit of) without proper attribution to me, and a bit wrong. You can see the note here (MOS account required) Running /Usr/Local/Bin/Ipconf -Verify Results In ‘Perl: Warning: Setting Locale Failed.’ (Doc ID 1393517.1)

Anyway, today I ran into this again. The issue is the CLIENT (the machine from which an ssh session is created) explicitly requested locale settings on the machine it connected to. This is done in the client’s global ssh settings in a file called ssh_config.

I should add I use MacOS/OSX, which is derived from BSD, which is unix/linux like.

The file ssh_config was found in /etc/ssh (/etc/ssh/ssh_config), but at least on 10.14.3 (Mojave) this has changed to /private/etc/ssh/ssh_config. The MOS note describes /etc/ssh_config, but I don’t think ssh_config is present in /etc directly in any OS by default, and probably mostly found in /etc/ssh.

In my case, after updating to OSX Mojave, the ssh_config file is overwritten, giving me back the locale issue. This is how the original default host settings look like:

Host *
SendEnv LANG LC_*

To solve the locale issues (again, for me), change it very simply to:

Host *
SendEnv LANG

This post is about how to manage grub2 in an easy way.

grub1

In the past, which is before linux EL7, the boot loader was grub, the grand unified bootloader (version 1). Things were very simple; if you installed another kernel (using rpm) it would add an entry to grub’s configuration in /boot/grub/menu.lst. If you wanted to change grub to boot that newly installed kernel by default you edit /boot/grub/menu.lst and set ‘default’ to the number, counting from zero, of the newly installed kernel, in the order of the kernels listed. If you wanted a certain option set for booting the kernel, you added it to the kernel line.

grub2

Then came grub2. Reading up on the grub2 homepage it turns out that grub 1 apparently was old code, and it was clunky, hard to maintain and extend with new features. Other reasons that probably contributed to grub2 acceptance where UEFI support as well as limited support for other architectures than x86 and x86_64.

grub2: the mechanics

The heart of the grub2 configuration is by default in /boot/grub2/grub/grub.cfg. The start of the file looks like bash shell script. A little further in the file, the installed kernels and arguments are found:

### BEGIN /etc/grub.d/10_linux ###
menuentry 'Oracle Linux Server (4.1.12-124.18.6.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $me
nuentry_id_option 'gnulinux-4.1.12-124.18.6.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod xfs
        set root='hd0,msdos1'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  4fc4063c-861b-4b07-8c28-e847a53
5c6cb
        else
          search --no-floppy --fs-uuid --set=root 4fc4063c-861b-4b07-8c28-e847a535c6cb
        fi
        linux16 /vmlinuz-4.1.12-124.18.6.el7uek.x86_64 root=/dev/mapper/ol-root ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet nu
ma=off transparent_hugepage=never
        initrd16 /initramfs-4.1.12-124.18.6.el7uek.x86_64.img
}
menuentry 'Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.1.12-112.16.4.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {

A knee-jerk reaction would be to see if this file can be edited, but the heading of the file quite clearly says to not do that and that any modification could be overwritten:

#
# DO NOT EDIT THIS FILE
#
# It is automatically generated by grub2-mkconfig using templates
# from /etc/grub.d and settings from /etc/default/grub
#

So, this points us /etc/default/grub. To be sure what actually to do, the documentation/manual makes it clear how to control what kernel and options are chosen in the chapter 6.1 simple configuration handling:

grub-mkconfig does have some limitations. While adding extra custom menu entries to the end of the list can be done by editing /etc/grub.d/40_custom or creating /boot/grub/custom.cfg, changing the order of menu entries or changing their titles may require making complex changes to shell scripts stored in /etc/grub.d/. This may be improved in the future. In the meantime, those who feel that it would be easier to write grub.cfg directly are encouraged to do so (see Booting, and Shell-like scripting), and to disable any system provided by their distribution to automatically run grub-mkconfig.

The file /etc/default/grub controls the operation of grub-mkconfig. It is sourced by a shell script, and so must be valid POSIX shell input; normally, it will just be a sequence of ‘KEY=value’ lines, but if the value contains spaces or other special characters then it must be quoted.

What this means to say is that for basic operation and selection, the /etc/default/grub file must be edited, after which grub-mkconfig must be run to “build” the actual grub settings. That’s not extremely intuitive from my point of view.

Let’s look at the /etc/default/grub on my test system, which is not touched as far as I know:

[root@o184 ~]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never"
GRUB_DISABLE_RECOVERY="true"

The first thing to look at is ‘GRUB_DEFAULT’, which is the default grub entry/kernel that is booted. It is set to ‘saved’, which is not really helpful. The utility ‘grub2-editenv’ can be used to list what the saved entry is:

[root@o184 ~]# grub2-editenv list
saved_entry=0

Which means the first (this is counting from zero) entry in /boot/grub2/grub.cfg. I have not come across a utility that is installed by default to list the menu entries with their number in grub.cfg, it’s easy to do with some shell commands, although that feels clumsy to me:

[root@o184 ~]# cat /boot/grub2/grub.cfg | grep ^menuentry | nl -v0
     0	menuentry 'Oracle Linux Server (4.1.12-124.18.6.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.1.12-124.18.6.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
     1	menuentry 'Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.1.12-112.16.4.el7uek.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
     2	menuentry 'Oracle Linux Server (3.10.0-862.11.6.el7.x86_64 with Linux) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-862.11.6.el7.x86_64-advanced-64c9f287-84a8-48ec-b048-364295362114' {
     3	menuentry 'Oracle Linux Server (0-rescue-069a6c1ff25b409fa87b8e587a2f8b4d with Linux) 7.5' --class oracle --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-0-rescue-069a6c1ff25b409fa87b8e587a2f8b4d-advanced-64c9f287-84a8-48ec-b048-364295362114' {

So in this case kernel 4.1.12-124.18.6.el7uek.x86_64 is started by default.

In fact, the value of ‘saved’ is stored in /boot/grub2/grubenv:

[root@o184 ~]# cat /boot/grub2/grubenv
# GRUB Environment Block
saved_entry=0
################...

There is more to tell about the GRUB_DEFAULT and especially GRUB_SAVEDEFAULT; if GRUB_SAVEDEFAULT is set and set to ‘true’, and GRUB_DEFAULT is set to ‘saved’, then any kernel that is chosen in grub at boot will be saved as default. However, as far as I can see, this option is not set in a default /etc/default/grub file.

Another caveat is that by setting the number of the menuentry, the order could be changed when installing a new kernel, and the numbering may be different. For that reason, it’s also possible to set either the entry description after menuentry, or a name (“identifier”) set with –id at the menuentry line.

grubby

This is all nice, but it feels really indirect to me, multiple files must be combined to understand the current settings, and multiple commands must be entered to make a change. This also means it’s a little less easy to automate.

Now on to the actual reason for this blogpost. There actually is a utility that can do most of the manipulation using a single command, which is installed by default. However, it is not very well known. Let me introduce ‘grubby’ to you!

List grub entries:

[root@o184 ~]# grubby --info=ALL
index=0
kernel=/boot/vmlinuz-4.1.12-124.18.6.el7uek.x86_64
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never "
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-124.18.6.el7uek.x86_64.img
title=Oracle Linux Server (4.1.12-124.18.6.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5
index=1
kernel=/boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never "
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-112.16.4.el7uek.x86_64.img
title=Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5
...

List current default kernel:

[root@o184 ~]# grubby --default-kernel
/boot/vmlinuz-4.1.12-124.18.6.el7uek.x86_64

Set another kernel to boot:

[root@o184 ~]# grubby --set-default /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
[root@o184 ~]# grubby --default-kernel
/boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64

List current settings of the grub2 menuentry:

[root@o184 ~]# grubby --info /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
index=1
kernel=/boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet numa=off transparent_hugepage=never"
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-112.16.4.el7uek.x86_64.img
title=Oracle Linux Server (4.1.12-112.16.4.el7uek.x86_64 with Unbreakable Enterprise Kernel) 7.5

Grubby also facilitates making changes to the arguments of the kernel.
For example, if you want to change the setting ‘numa=off’ to ‘numa=on’, you can remove the argument:

[root@o184 ~]# grubby --update-kernel /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 --remove-args="numa=off"
[root@o184 ~]# grubby --info /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 | grep ^args
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet transparent_hugepage=never"

And then add it with the correct argument:

[root@o184 ~]# grubby --update-kernel /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 --args="numa=on"
[root@o184 ~]# grubby --info /boot/vmlinuz-4.1.12-112.16.4.el7uek.x86_64 | grep ^args
args="ro net.ifnames=0 biosdevname=0 crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet transparent_hugepage=never numa=on"

Grubby makes it easy to list and manage the grub2 boot loader settings, and also easily be used in scripts.

This is a writeup on some scenario’s for disk issues that you could encounter when running linux systems. The linux systems I am talking about are centos/redhat/oracle (EL) version 7 systems. This is just a writeup for myself to know how to deal with different scenario’s, hopefully other people find this interesting too. I don’t believe there is any difference for the below scenario’s and resolutions between running physical/bare metal, virtualised or in the cloud, provided you can get access the you need (like the BIOS POST console). The test configuration uses (legacy) MBR (meaning non UEFI), grub2 as boot loader, LVM for meta devices (you don’t want to run without LVM!!) and XFS for the filesystems.

I included a way to introduce each corruption yourself, so you can actually rehearse these kind of corruptions. These self introduced corruptions are simply parts overwritten and can be recovered from. In reality this might be different: if a disk is truly physically broken, there is no way you can make it come to live again with software, and the data on it is simply gone. Also, if the physical disk is not harmed, but the filesystem is damaged, I think most of you understand that if the amount of damage is big enough, at a certain point it is beyond repair.

REALLY IMPORTANT NOTICE

Modern filesystems like XFS, EXT3/4, NTFS are ‘logging filesystems’. This means changes to metadata are written into a log on disk called the ‘intent log’, after which the true change is written and then the corresponding log in the intent log is flagged as done. This means that unclean mounts or crashes can be recovered from by replaying the intent log to make sure all transactions are fully written.

However…by default these filesystems do not log actual data changes. The protected writes are filesystem structure changes. This is especially important to realise if writes are not done via O_DIRECT, and thus end up in the page cache.

This means that data corruption can still occur, despite having a modern, logging filesystem. Things like databases therefore keep consistency mechanisms in play to validate data consistency.

I do not know enough about BTRFS and ZFS to know if these do things differently from the logging perspective. Not logging all IO for “normal” filesystems is a performance choice.

Let’s start: know your system!

It really helps if you understand how your system looks like. This is how my system looks like:

# parted /dev/sda print
Model: ATA VBOX HARDDISK (scsi)
Disk /dev/sda: 68.7GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  1075MB  1074MB  primary  xfs          boot
 2      1075MB  68.7GB  67.6GB  primary               lvm

The first partition of /dev/sda is set to be the boot partition (flags: boot), and is an XFS filesystem partition. The second partition is an LVM physical volume. ‘lsblk’ can look a bit more into the functions of the partitions, and reveals any other blockdevices.

# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda           8:0    0   64G  0 disk
├─sda1        8:1    0    1G  0 part /boot
└─sda2        8:2    0   63G  0 part
  ├─ol-root 249:0    0   41G  0 lvm  /
  ├─ol-swap 249:1    0    2G  0 lvm  [SWAP]
  └─ol-home 249:2    0   20G  0 lvm  /home
sr0          11:0    1  540M  0 rom

This shows that /dev/sda1 is mounted as /boot on the linux installation, and the sda2 LVM volume contains three logical volumes named: root, swap and home, which are mounted on /, as swap and on /home. There also is a ‘rom’ block device, which is the (oracle) linux boot image. This image can be downloaded from Oracle’s software delivery cloud, and is 566MB, and contains the rescue boot environment that is used in this blogpost. Any serious sysadmin should have such an image on hand to perform emergency recovery if needed.

Normally, the local disks are the first ones to boot from before other bootable devices such as a CD/DVD disk image in the BIOS. This means that I can leave the linux boot image connected, the system will normally boot from its hard disks and essentially not touch the DVD image. If for some reason the BIOS can’t find the MBR on the local disks, it will scan for MBR records on other disks in the order as set in the BIOS, and find the linux boot image and boot from that. On a real production system you don’t leave such an image connected, normally it’s attached when there is a need.

If you want or must use the linux boot image and not the local installation, you have to interfere the BIOS POST, and change the boot order. I am using virtualbox for most of my tests, the way to do that is: press F12 (Mac: fn-F12) and select CD-ROM as boot device and press enter.

To use the rescue mode on the linux boot image, boot it, then select “troubleshooting” (down arrow, enter), and “rescue a Oracle linux system” (down arrow, enter). This will boot linux from the boot image, and run anaconda. Anaconda gives you the following options:

1. Continue

This will scan your available disks, and mount the filesystems it could find under /mnt/sysimage in read-write mode. Not all filesystem issues will allow mounting read-write.

2. Read-only mount

This will scan your available disks, and mount the filesystems it could find under /mnt/sysimage in read-only mode. Obviously, not all issues allow read only mounting too.

3. Skip to shell

This will do no scanning of disks, so if you want to prevent any access to the local disks, this is the options to choose.

4. Quit (reboot)

This is a safe reboot (stopping all the started units). This is the same that would happen if the shell is exited from using option 1, 2 and 3.

Let’s look at a couple of different scenario’s:

Corrupted MBR

A MBR linux system scans the disks it can see in the order that is set in the BIOS, and boots from the first MBR found. This will normally be your linux installation on the local disks. If you have a corrupted or otherwise unbootable MBR on your local disks, and the linux boot image attached, it will boot from the linux boot image, unless it is set to explicitly not boot that device.

If no usable MBR is found, it will tell you on the console. On virtualbox, it shows:

FATAL: Could not read from the boot medium! System halted.

If just the MBR is damaged/corrupted, it can be recovered. If you want to rehearse recovering from a corrupted MBR, overwrite the MBR in the following way. Please mind this is machine dependent, look for specifics about the system to the overview above with parted and lsblk:

# dd if=/dev/zero of=/dev/sda bs=448 count=1

If you detach any other bootable devices, and reboot your system after writing over the MBR, it should not find anything to boot from, and your system will give an error like seen above, indicating it’s unable to boot.

In order to restore the MBR for grub2, perform the following tasks:
– Boot from the linux boot image.
– Start rescue mode.
– Choose ‘1’/continue, and enter to go to the shell.
– Change the root to your local disks: ‘chroot /mnt/sysimage’.
– Next thing is to understand your system. Based on the parted executable information above, we know the boot flag is set on device /dev/sda. This means we have to enter the following in the console:

grub2-install /dev/sda

This will report, if your disk is well enough to hold the information, and not really broken:

Installing for i386-pc platform.
Installation finished. No error reported.

– Now the MBR is written, and you can reboot your system by exiting the chroot jail and shell: ‘exit’ and ‘exit’ to boot your system from the local disks again.

Corrupted /boot filesystem

If the /boot filesystem, which holds the linux kernel and grub settings, is corrupted, the system will boot, because that is handled by the MBR, but then very quickly tell you it doesn’t recognise the filesystem:

.
error: unknown filesystem.
Entering rescue mode...
grub rescue>

A corrupted /boot filesystem can be recovered from in most cases, obviously depending on what and how much is corrupted in the filesystem, and the underlying device must be a able to function after the corruption. This can be rehearsed by overwriting part of the filesystem. In order to be absolutely sure the overwritten data on the disk is not changed by anything else, boot the linux boot image in rescue mode, and choose option ‘3’, and then overwrite part of the filesystem in /dev/sda1:

# dd if=/dev/zero of=/dev/sda1 bs=1k skip=2 count=1

In order to recover /boot, perform the following tasks:
– Boot from the linux boot image.
– Start rescue mode.
– Choose ‘1’/continue, and enter to go to the shell. If you are rehearsing: the dd command overwritten part of the filesystem header, it will not be recognised by lsblk, and if you take a look in the systems fstab (/mnt/sysimage/etc/fstab), it is mounted by UUID, but the UUID can’t be taken from the corrupted filesystem. The root and home filesystems are found in LVM, and mounted. This means you have to know /dev/sda1 is /boot.
– Repair the filesystem:

# xfs_repair /dev/sda1

Depending on the corruption, this might or might not succeed. If you are rehearsing this using the above dd command, you will see a lot of dots being printed searching for a superblock, which it will find and then recover from.
– Exit the shell to reboot the system: ‘exit’.

Corrupted root filesystem in LVM

When the root filesystem is corrupted, the system will boot as usual, show the different kernels in the grub dialogue, and then when the choice times out, boot the pre-selected kernel and try to boot linux. After the kernel is loaded, the root filesystem will be mounted, and it will fail. This generates the following message:

Generating "/run/initramfs/rdsosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt to a USB stick or /boot after mounting them and attach it to a bug report.

In order to rehearse recovering a corrupted root filesystem, boot the linux boot image in rescue mode, and choose option ‘3’. You will find that the volume group is not active, it needs to be activated using:

# vgchange -ay
 3 logical volume(s) in volume group "ol" now active

Then overwrite the filesystem header for the logical volume ‘root’ in the volume group ‘ol’:

# dd if=/dev/zero of=/dev/ol/root bs=1k skip=2 count=1

A corrupted root filesystem can be recovered from in most cases, obviously depending on what and how much is corrupted in the filesystem, and the underlying device must be able to function after the corruption. In order to recover perform the following tasks:
– Boot from the linux boot image.
– Start rescue mode.
– Choose ‘3’.
– Activate LVM:

# vgchange -ay

– Repair the filesystem in the logical volume

# xfs_repair /dev/ol/root

– Exit the shell to reboot the system: ‘exit’.

Corrupted non root filesystem in LVM

On my system, I only got /home as a non root filesystem in LVM. When I corrupted this filesystem, the linux system will came up without any complaining on the console. To be honest, this was kind of surprising to me.

I needed to dig around in the /var/log/messages file to find an error message:

Oct  2 10:57:11 localhost lvm: 3 logical volume(s) in volume group "ol" now active
Oct  2 10:57:11 localhost systemd: Started LVM2 PV scan on device 8:2.
Oct  2 10:57:11 localhost mount: mount: wrong fs type, bad option, bad superblock on /dev/mapper/ol-home,
Oct  2 10:57:11 localhost mount: missing codepage or helper program, or other error
Oct  2 10:57:11 localhost mount: In some cases useful info is found in syslog - try
Oct  2 10:57:11 localhost mount: dmesg | tail or so.
Oct  2 10:57:11 localhost systemd: Found device /dev/mapper/ol-home.
Oct  2 10:57:11 localhost systemd: Mounting /home...
Oct  2 10:57:11 localhost systemd: Starting LVM2 vgchange on device 8:2...

Based on what I see, I interpret this as systemd activating LVM, and then mounting the filesystem, which gives an error. What I don’t understand is what systemd is doing when it says it found the logical volume device and mounts it.

This can be rehearsed by booting into rescue mode, choose option ‘3’, activate LVM ‘vgchange -ay’, and then overwrite the filesystem header in the logical volume: ‘dd if=/dev/zero of=/dev/ol/home bs=1k skip=2 count=1’.

A corrupted non root filesystem in LVM can recovered from in most cases. There is no need to use the linux boot image for rescue mode. A linux system will come up when the root filesystem is valid and it can load enough from it to boot linux. The following needs to be done:
– Start a shell as root.
– Repair the filesystem in the logical volume: ‘xfs_repair /dev/ol/home’.
– Mount all filesystems: ‘mount -a’. This should mount the just repaired filesystem.

Corrupted swap in LVM

Just like a non root filesystem corruption mentioned above, the inability to activate a swap device is logged in the /var/log/messages file, but there is no explicit message on the console. Outside of logging the inability to active the swap device the system will come up happily, without swap if the only swap device is corrupted. These are the messages I could find in the /var/log/messages file:

Oct  2 11:32:39 localhost systemd: Found device /dev/mapper/ol-swap.
Oct  2 11:32:39 localhost systemd: Activating swap /dev/mapper/ol-swap...
Oct  2 11:32:39 localhost kernel: alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
Oct  2 11:32:39 localhost kernel: XFS (sda1): Mounting V4 Filesystem
Oct  2 11:32:39 localhost swapon: swapon: /dev/mapper/ol-swap: swapon failed: Invalid argument
Oct  2 11:32:39 localhost systemd: dev-mapper-ol\x2dswap.swap swap process exited, code=exited status=255
Oct  2 11:32:39 localhost systemd: Failed to activate swap /dev/mapper/ol-swap.
Oct  2 11:32:39 localhost systemd: Dependency failed for Swap.
Oct  2 11:32:39 localhost systemd: Job swap.target/start failed with result 'dependency'.
Oct  2 11:32:39 localhost systemd: Unit dev-mapper-ol\x2dswap.swap entered failed state.
Oct  2 11:32:39 localhost kernel: Unable to handle swap header version 0

It seems systemd understands the swap device is not usable, previously we saw systemd didn’t detect filesystem issues on an LVM device.

This can be rehearsed by booting into rescue mode, choose option ‘3’, activate LVM ‘vgchange -ay’, and then overwrite the swap header: ‘dd if=/dev/zero of=/dev/ol/swap bs=1k count=2’.

A corrupted swap device in LVM can be recovered from if the underlying device is able to function.
– Start a shell as root.
– Format the swap device again, in my case the swap device is /dev/ol/swap: ‘mkswap /dev/ol/swap’.
– Activate swap: ‘swapon -a’.
– Validate swap has been activated: ‘swapon -s’.

This blogpost is about using ansible vault. Vault is a way to encrypt sensitive information in ansible scripts by encrypting it. The motivation for this blogpost is the lack of a description that makes sense to me of what the possibilities are for using vault, and how to use the vault options in playbooks.

The basic way ansible vault works, is that when ansible-playbook reads a yaml file, it encounters $ANSIBLE_VAULT;1.1;AES256 indicating ansible vault is used to encrypt the directly following lines, it will use a password to decrypt it, and then uses the decrypted version in memory only. This way secrets can be hidden from being visible. Obviously, the password will allow decrypting it, and the password must be used in order for ansible-playbook to decrypt it.

The original use of vault is to encrypt an entire yaml file. As of Ansible version 2.3, ansible allows the encryption of single values in a yaml file.

1. File encryption
A very silly example of ansible-vault, which shouldn’t be used (!) is this:

$ echo "---
- hosts: localhost
  vars:
    a: aaa
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls" > test1.yml

When this runs, it will execute on localhost, set to ansible variables (a and b) and then use the shell module to execute “ls”. Again, this is just an example, and is silly on itself. This is how it’s run:

$ ansible-playbook test1.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

This can be encrypted using ansible vault in the following way:

$ ansible-vault encrypt test1.yml
New Vault password:
Confirm new Vault password:
Encryption successful

Now the yaml file is entirely encrypted:

$ cat test1.yml
$ANSIBLE_VAULT;1.1;AES256
32663536343532643864353135643431373830333236636230613264383531336562323530386632
3933353561316530333538363032633731336135396434630a626631303839646161363432396230
32336130633264343766393562656138363132393835646137633065393739336234336463336138
6564613730333436650a336537326233343566633761383436356136666533326561346530613566
64316336393036346538656338303734623161386435373138656532663737666662633765656438
37303666393939356161393134666130303435333463636630386434626335383164663761383035
35613437356361366665633931346664386535393936396637363561386164613337373562313264
63333530353263363437386536393765393930653234353733313433643637333932353638353339
31623366376566613766366236356163303837666664646465363439313265383562656637623266
6463663065383030626232643365623437666332313631376262

It still can be run, but you need to specify the password. In this case it asks to enter a password:

$ ansible-playbook test1.yml --ask-vault-pass
Vault password:

PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Of course this is not practical, the most common and probable way you want to use this is to have variables with secret information, like passwords, tokens, private keys, personal data, etc. encrypted, but the playbook still be readable. You should use a version control tool like git, and use functionality like diff!

The first thing is to move the variables out of the playbook into a dedicated section. In this case I chosen to assign them based on host, in a directory called ‘host_vars’, and then for localhost ‘host_vars/localhost’:

$ cat host_vars/localhost/vars.yml
---
a: aaa
b: bbb

And then of course remove them from the playbook itself:

$ cat test2.yml
---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

If this is run, it will pick up the variables (for which I put in a debug task):

$ ansible-playbook test2.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

Okay, now let’s say I need to encrypt the variable ‘a’. I can bluntly encrypt vars.yml, and that will work, however, then you don’t know where the variable comes from. The solution to that is ‘indirect referencing’. This is done by adding a second file to host_vars/localhost called ‘vault.yml’, which will be the encrypted file. In the case of encrypting variable a, do the following:
1. Change the vars.yml file to assign {{ vault_a }} to a:

$ cat host_vars/localhost/vars.yml
---
a: "{{ vault_a }}"
b: bbb

2. Define the value of a to vault_a in vault.yml:

$ cat host_vars/localhost/vault.yml
---
vault_a: aaa

3. Encrypt vault.yml:

$ ansible-vault encrypt host_vars/localhost/vault.yml
New Vault password:
Confirm New Vault password:
Encryption successful

Now it can be used by specifying the password again:

$ ansible-playbook test2.yml --ask-vault-pass
Vault password:
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

This should not be a surprise, as I said, ansible will recognise an encrypted file, and decrypt it automatically. However, if you looked carefully you saw something else that is important to realise: I encrypted the variable a containing ‘aaa’, and it was revealed to me using the debug statement. I think this is reasonable, because it’s been explicitly programmed to do so. You should write your playbooks not to reveal any information that you encrypted previously.

2. Variable encryption
If you have larger sets of variables to be encrypted, the above method makes sense. However, if you want to encrypt only a few variables it’s probably simpler to encrypt only the variable you want to be encrypted. The ansible-vault executable provides a facility for that called ‘encrypt_string’. The way this works is that you simply specify the variable and the variable value with ansible-vault, which then outputs the encrypted value, which you need to put in a playbook yourself. For example your ‘a’ variable with the value ‘aaa’:

$ ansible-vault encrypt_string --name 'a' 'aaa'
New Vault password:
Confirm New Vault password:
a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
Encryption successful

Now use this to input to a playbook:

---
- hosts: localhost
  vars:
    a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

3. Vault passwords
The next thing to look at is how to specify the decryption password. One way or another, you need a password specified to decrypt. The first option is to specify the password at the CLI, which is done in the previous examples by specifying ‘–ask-vault-pass’ as an argument to ansible-playbook. This is a good option for running playbooks ad-hoc.

If you want playbooks to run from a facility like ansible tower, rundeck or semaphore, you need to specify the password in non-interactive, in a file. This is often implemented as a hidden file, like ‘.vault_pass.txt’. Of course this then means the password is readable if you got command line access. The main idea here is this gives you an option to encrypt sensitive data in a playbook and store it in a version control system, and do NOT store the password (file) in version control.

$ echo "mypassword" > .vault_pass.txt
$ ansible-playbook --vault-password-file .vault_pass.txt test2.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

There is a caveat here too, into which I ran because my removable HDD enforces 777 on my files, which means the execute bit is set:

$ ansible-playbook --vault-password-file .ttt test2.yml
 [WARNING]: Error in vault password file loading (default): Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this
is not a script, remove the executable bit from the file.

ERROR! Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this is not a script, remove the executable bit from the file.

The message is self-explanatory: if ansible finds the passwordfile specified to be executable, it will execute it. In my case, I created a little shell script to do that:

#!/bin/bash
echo "mypassword"

3.a. Vault-id for passwords
Since ansible version 2.4 the official way to use encryption is by using vault-id. Vault-id is more or less the same as the above usage, however, it adds a new feature that could be important: the option to use multiple vault passwords in the same playbook. The way this works is quite simple: ansible tries all the passwords it has been given until it finds one that works. Please mind vault-id only works with encrypted files, not with encrypted variables! Below is an example of the use of vault-id with multiple passwords:

$ echo "---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}" > test3.yml
$ echo '---
a: "{{ vault_a }}"
b: "{{ vault_b }}"' > host_vars/localhost/vars.yml
$ echo "password_a" > .a.vault_pass.txt
$ echo "password_b" > .b.vault_pass.txt
$ echo "---
vault_a: aaa" > host_vars/localhost/vault_a.yml
$ echo "---
vault_b: bbb" > host_vars/localhost/vault_b.yml
$ ansible-vault --vault-id a@.a.vault_pass.txt encrypt host_vars/localhost/vault_a.yml
Encryption successful
$ ansible-vault --vault-id b@.b.vault_pass.txt encrypt host_vars/localhost/vault_b.yml
Encryption successful
$ ansible-playbook --vault-id a@.a.vault_pass.txt --vault-id b@.b.vault_pass.txt test3.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

During investigating how Oracle works with regards to waiting, I discovered an oddity. I was researching for my redo blog posts, and found that in a certain case, Oracle uses the ‘nanosleep’ system call. As the name of this system call suggests, this call allows you to let a process sleep with nanosecond precision.

The oddity that I found, was the following:

$ strace -Tp 42
...
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
...

I executed strace with the ‘-T’ argument (show time spend in system calls), and ‘-p’ to attach it to the process I want to measure. The nanosleep function was called with struct time spec set to {0, 1000}, which means 0 seconds and 1000 nanoseconds. However, the actual time spend was between 0.000191 and 0.000253 seconds, which is 1910000 and 253000 in nanoseconds. That’s approximately 200 times longer than it’s set to wait!

Since then I have spend a fair amount of time to understand why this is happening. So far I found a number of facts that I think explain this, but if anyone else has something to throw in here, please leave a comment.

The first thing I did was create a C program that just runs nanosleep in a loop, so I can run the nanosleep systemcall in isolation. Here’s the nanosleep PoC code (measure_time.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       loop_counter, loop_total;
  struct    timespec sleep;

  sleep.tv_sec=0;
  sleep.tv_nsec=1000;

  loop_total=10;
  loop_counter=0;

  while ( loop_counter &lt; loop_total ) {
    nanosleep(&amp;sleep,NULL);
    loop_counter++;
  }
}

Then, when researching I found that there is a systemcall that shows the actual clock resolution time! I created another very small C program to run just that (measure_resulution.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       result;
  struct    timespec resolution;
  clockid_t clk_id;
  clk_id = CLOCK_MONOTONIC;

  clock_getres(clk_id, &amp;resolution);

  printf("Resolution: %ld s, %ld ns\n", resolution.tv_sec, resolution.tv_nsec);
}

This c program can be compiled using ‘gcc measure_resolution.c -o measure_resolution’. This is what it showed:

$ ./measure_resolution
Resolution: 0 s, 1 ns

So, my system has a 1 ns resolution, despite my nanosleep systemcalls taking way longer. A little later I found out this information can be obtained directly in /proc/timer_list:

$ grep resolution /proc/timer_list | uniq
  .resolution: 1 nsecs

The first thing I found while researching, is that when I change the priority of the process (not the 1000ns), I can get lower sleeping times:

# strace -T chrt --rr 99 ./measure_time
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

The lowest time here is 0.036ms (36000ns). In order to look further, I found that newer kernel versions have an addition to perf that can measure wait time, scheduler delay (‘run queue’) and actual run time; perf sched timehist. So at this point it seemed that was a way to understand more about the timing of nanosleep. In order to look at that I created a virtual machine with fedora 27 (kernel version 4.14.13), and compiled my measure_time.c program on it.

The next thing to do is run the measure_time executable with perf sched record:

# perf sched record chrt --rr 99 ./measure_time
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.183 MB perf.data (54 samples) ]

Then run perf sched timehist to look at the data and find the pid of the measure_time executable:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.760817 [0000]                                0.000      0.000      0.000
   37745.760823 [0000]  rcu_sched[8]                        0.000      0.000      0.005
   37745.762367 [0002]  perf[2777]                          0.000      0.000      0.000
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.762431 [0002]  migration/2[21]                     0.000      0.001      0.011
   37745.762479 [0003]                                0.000      0.000      0.000
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764108 [0003]                                1.583      0.000      0.045
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005

So it’s pid 2778, now add that to perf sched timehist:

# perf sched timehist -p 2778
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005
   37745.764153 [0003]  measure_time[2778]                  0.034      0.002      0.004
   37745.764195 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764236 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764291 [0003]  measure_time[2778]                  0.050      0.002      0.004
   37745.764347 [0003]  measure_time[2778]                  0.051      0.002      0.004
   37745.764405 [0003]  measure_time[2778]                  0.052      0.002      0.004
   37745.764478 [0003]  measure_time[2778]                  0.067      0.006      0.005
   37745.764506 [0003]  measure_time[2778]                  0.022      0.002      0.005
   37745.764581 [0003]  measure_time[2778]                  0.022      0.002      0.052

The way to look at this: first perf is run. Because we just started running, there is no wait time, the task gets the state running, then it needs to be scheduled to run. That time is in ‘sch delay’, obviously scheduling delay. Then it gets truly is running and runs for some time, which time is visible in the ‘run time’ column. Next, the measure_time executable is run.

The first time measure_time runs, it needs to do some initialisation, like the dynamic loader loading libc, allocate memory (mmap calls), etcetera. This is visible in the run time of 1.583 ms. However, then it runs into the nanosleep call. The nanosleep call puts the process into an interruptible sleep. The sleep is visible in the wait time column. Here we see it’s waiting/sleeping for between 0.022 and 0.067 ms. After waiting, the process gets the status running, after which the scheduler must schedule the task and make it running. That time is in scheduling time again. Then it runs, but all it needs to do is end the nanosleep call, add one to the loop_counter variable, jump back and enter the nanosleep call again, as we can see, that only takes 0.004 ms.

Okay, so far it was quite a geeking out session, but not really an answer has been found to why a 1000ns sleep turns out to be way higher. I sought out help via social media, and gotten some sane advise from Kevin Closson. Essentially, the advise was to run it on physical hardware instead of my test virtual machine, and make the load as low as possible in order for the scheduler to be able to schedule my process as quickly as it can. Yes, very obvious actually.

So, the next target is a physical machine in our Enkitec lab. This is on a X3 generation machine (Sandy Bridge EP). I compiled measure_time.c, and ran ‘strace -T chrt –rr 99 ./measure_time’ again:

nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

That’s way better! But still the nanosleep timing is approximately 13000ns, instead of 1000ns that the call is supposed to take.

At this point I had no clue, and obviously needed to rethink it again; the time the call is supposed to sleep and thus take and the wallclock time I measured are still quite far apart in my measurements. On one hand you could argue that the time it is set to wait is so small that this could be simply caused by the code in the nanosleep function itself, on the other hand it’s not even close.

Then I gotten a eureka moment: strace is using ptrace facilities to attach to the process, and ptrace is known to quite severely slow processes down. Hat tip to Martin Bach who contacted me with this at the very same moment I realised this. So then the question becomes: can I measure nanosleep in another way?

Then I thought about the linux kernel ftrace facilities. In fact, the physical machines at my disposal in the Enkitec lab all run an Exadata images, and the Exadata images do not contain bleeding edge kernels like fedora 27 has, so I couldn’t repeat the perf sched timehist tracing. I wrote about ftrace in the past.

So, let’s redo the timing investigation again using ftrace:
1. Make sure the debug filesystem is mounted:

# mount /sys/kernel/debug || mount -t debugfs none /sys/kernel/debug

2. Enter the trace directory in /sys/kernel/debug:

# cd /sys/kernel/debug/tracing

3. Create another session on the same machine and obtain the process id:

# echo $$
62733

4. In the debug/tracing session, setup nanosleep system call tracing:
Enter the process id as the pid to trace:

echo 62733 > set_ftrace_pid

Set the type of trace. The default ‘tracer’ is nop (no tracing). Set it to ‘function_graph’:

echo function_graph > current_tracer

If there is an option ‘(no)function-fork’ in trace_options, set it to ‘function-fork’ to make ftrace trace child processes too (function-fork is an addition in recent kernels):

# egrep ^nofunction-fork trace_options && echo function-fork > trace_options

Set the function to trace. For the nanosleep system call, this is SyS_nanosleep:

# echo SyS_nanosleep > set_ftrace_filter

Now enable tracing:

# echo 1 > tracing_on

5. In the session to trace, execute chrt –rr 99 ./measure_time.
6. In the debug/tracing session, stop the trace and look at the results:

# echo 0 > tracing_on
# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 20)   8.958 us    |  SyS_nanosleep();
 20)   5.077 us    |  SyS_nanosleep();
 20)   4.260 us    |  SyS_nanosleep();
 20)   9.384 us    |  SyS_nanosleep();
 20)   5.495 us    |  SyS_nanosleep();
 20)   5.526 us    |  SyS_nanosleep();
 20)   5.039 us    |  SyS_nanosleep();
 20)   4.936 us    |  SyS_nanosleep();
 20)   4.898 us    |  SyS_nanosleep();
 20)   4.889 us    |  SyS_nanosleep();

This shows the waiting time still fluctuating, but going down to as low as 4.260 us, alias 4260 nanoseconds. This still is roughly four times the sleeping time set (1000ns), but it gotten way lower than earlier! Probably, the tracing increased the latency a bit, so my guess would be the true sleeping time when no trace is applied is around 2000ns. Please mind this is with changed (increased) process priorities (chrt –rr 99); when measure_time is ran without any priorities set, it looks like this:

# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 21) + 60.583 us   |  SyS_nanosleep();
 21) + 56.504 us   |  SyS_nanosleep();
 21) + 55.940 us   |  SyS_nanosleep();
 21) + 56.118 us   |  SyS_nanosleep();
 21) + 56.076 us   |  SyS_nanosleep();
 21) + 56.078 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 56.100 us   |  SyS_nanosleep();
 21) + 56.868 us   |  SyS_nanosleep();

But: there is Oracle grid infrastructure running, and multiple databases are running.

What would be interesting to see is how nanosleep is actually executed. Ftrace provides a way to do exactly that! The important thing to keep in mind is that ftrace is kernel only. In order to see how SyS_nanosleep is executed, do the following:

# > set_ftrace_filter
# echo SyS_nanosleep > set_graph_function
# > trace

Then execute ‘chrt –rr 99 ./measure_time’ again, and look in trace again:
(please mind I picked the second occurrence of SyS_nanosleep, of course the trace shows them all)

 14)               |  SyS_nanosleep() {
 14)               |    hrtimer_nanosleep() {
 14)   0.045 us    |      hrtimer_init();
 14)               |      do_nanosleep() {
 14)               |        hrtimer_start_range_ns() {
 14)               |          __hrtimer_start_range_ns() {
 14)               |            lock_hrtimer_base() {
 14)   0.039 us    |              _raw_spin_lock_irqsave();
 14)   0.527 us    |            } /* lock_hrtimer_base */
 14)   0.048 us    |            ktime_get();
 14)   0.043 us    |            get_nohz_timer_target();
 14)   0.079 us    |            enqueue_hrtimer();
 14)               |            tick_program_event() {
 14)               |              clockevents_program_event() {
 14)   0.048 us    |                ktime_get();
 14)   0.446 us    |              } /* clockevents_program_event */
 14)   0.876 us    |            } /* tick_program_event */
 14)   0.040 us    |            _raw_spin_unlock();
 14)   0.040 us    |            __raise_softirq_irqoff();
 14)   4.318 us    |          } /* __hrtimer_start_range_ns */
 14)   4.792 us    |        } /* hrtimer_start_range_ns */
 14)               |        schedule() {
 14)               |          __schedule() {
 14)   0.041 us    |            rcu_note_context_switch();
 14)   0.039 us    |            _raw_spin_lock_irq();
 14)               |            deactivate_task() {
 14)               |              dequeue_task() {
 14)   0.056 us    |                update_rq_clock();
 14)               |                dequeue_task_rt() {
 14)               |                  update_curr_rt() {
 14)   0.057 us    |                    cpuacct_charge();
 14)   0.053 us    |                    sched_avg_update();
 14)   0.046 us    |                    _raw_spin_lock();
 14)               |                    sched_rt_runtime_exceeded() {
 14)   0.041 us    |                      balance_runtime();
 14)   0.402 us    |                    } /* sched_rt_runtime_exceeded */
 14)   0.046 us    |                    _raw_spin_unlock();
 14)   2.701 us    |                  } /* update_curr_rt */
 14)               |                  dequeue_rt_entity() {
 14)               |                    dequeue_rt_stack() {
 14)               |                      __dequeue_rt_entity() {
 14)   0.062 us    |                        cpupri_set();
 14)   0.057 us    |                        update_rt_migration();
 14)   0.861 us    |                      } /* __dequeue_rt_entity */
 14)   1.258 us    |                    } /* dequeue_rt_stack */
 14)   0.047 us    |                    enqueue_top_rt_rq();
 14)   1.908 us    |                  } /* dequeue_rt_entity */
 14)   0.048 us    |                  dequeue_pushable_task();
 14)   5.564 us    |                } /* dequeue_task_rt */
 14)   6.445 us    |              } /* dequeue_task */
 14)   6.789 us    |            } /* deactivate_task */
 14)   0.044 us    |            pick_next_task_stop();
 14)   0.041 us    |            pick_next_task_dl();
 14)               |            pick_next_task_rt() {
 14)   0.042 us    |              pull_rt_task();
 14)   0.050 us    |              update_curr_rt();
 14)   0.823 us    |            } /* pick_next_task_rt */
 14)               |            pick_next_task_fair() {
 14)               |              put_prev_task_rt() {
 14)   0.042 us    |                update_curr_rt();
 14)   0.377 us    |              } /* put_prev_task_rt */
 14)               |              pick_next_entity() {
 14)   0.039 us    |                clear_buddies();
 14)   0.340 us    |              } /* pick_next_entity */
 14)               |              set_next_entity() {
 14)   0.065 us    |                __dequeue_entity();
 14)   0.426 us    |              } /* set_next_entity */
 14)   2.159 us    |            } /* pick_next_task_fair */
 14)   0.052 us    |            finish_task_switch();
 14) + 15.656 us   |          } /* __schedule */
 14) + 16.003 us   |        } /* schedule */
 14)   0.039 us    |        _cond_resched();
 14)               |        hrtimer_cancel() {
 14)               |          hrtimer_try_to_cancel() {
 14)               |            lock_hrtimer_base() {
 14)   0.040 us    |              _raw_spin_lock_irqsave();
 14)   0.365 us    |            } /* lock_hrtimer_base */
 14)   0.046 us    |            _raw_spin_unlock_irqrestore();
 14)   1.006 us    |          } /* hrtimer_try_to_cancel */
 14)   1.403 us    |        } /* hrtimer_cancel */
 14) + 23.559 us   |      } /* do_nanosleep */
 14) + 24.444 us   |    } /* hrtimer_nanosleep */
 14) + 24.842 us   |  } /* SyS_nanosleep */

This gives a full overview of the functions that SyS_nanosleep is executing in the kernel, including timing.
Essentially, everything happens in ‘do_nanosleep’. Inside do_nanosleep, the following functions are directly called:
– hrtimer_start_range_ns -> 4.792 us
– schedule -> 16.003 us
– _cond_resched -> 0.039 us
– hrtimer_cancel -> 1.403 us
This is more than the 4260 ns we saw earlier, it’s clear the tracing influences the latency of the execution.

In order to get a breakdown on where the time goes for the 4260 ns, remove SyS_nanosleep from set_graph_function, empty the trace and add the four functions mentioned above to set_ftrace_filter:

# > set_graph_function
# > trace
# echo 'hrtimer_start_range_ns schedule _cond_resched hrtimer_cancel' > set_ftrace_filter

Now execute ‘./chrt –rr 99 ./measure_time’ again, and look in trace; this is a snippet that shows a single nanosleep invocation:

...
  2)   0.938 us    |  hrtimer_start_range_ns();
  2)   3.406 us    |  schedule();
  2)   0.077 us    |  _cond_resched();
  2)   0.123 us    |  hrtimer_cancel();
...

What this shows is that apparently the time it takes for the schedule function to finish is taking the majority of the time. There will be influence of ftrace being active, but taking that into account on this system the time to get rescheduled after waiting for a short time is around 3 to 4 us (microseconds).

Conclusion
There is a certain point at which timing is so fine grained, that anything you do, even running a (kernel) function itself distorts time. It looks like this is 1us (microsecond) granularity for the systems I’ve looked at.

Obviously, when you run on a virtualised platform, especially like I did with a ‘desktop’ virtualisation product like Virtualbox, but it applies to any virtualisation, scheduling times increase because the virtualisation needs to abstract running on a CPU. With modern virtualisation running user-mode this overhead is minimal or non-existent. However, kernel mode access needs to be abstracted in order to keep virtual machines from invading each others resources. This can become significant if resources are oversubscribed (we gotten a glimpse of that in this article with the first nanosleep measurements at roughly 220 us).

The nanosleep times gotten lower when the process priority was set as high as possible. Then when moving onto physical hardware, the latency times gotten even lower. The priority of the process strongly influences the scheduling time, which is kind of logical, because that’s the entire purpose of the priority.

The linux strace utility influences process execution latency quite significantly when looking at a microsecond granularity. Ftrace provides a way to trace in a very flexible way with way lesser overhead than strace, although it still does impose an overhead on the process it is tracing.

Using ftrace, the minimal time that I could measure for a process to get scheduled again after sleeping was between 3 to 4 microseconds. This latency is influenced by the tracing, so it probably comes down to between 2 to 3 microseconds without it. Also, this is strongly influenced by the priority, I needed to set it to the highest priority in order to get the low latency. Another big influencer will be the hardware that I am running, and there are a fair amount of processes running, some at high priority (cluster ware ocssd, Oracle database lms processes, linux daemons).

This post shows you how to install Rundeck with the Ansible plugin on Centos 7. The installation is done with nginx as the web server and using SSL with a self signed certificate. Please read the Ansible installation script, and modify anything that should be different for your situation. You will be amazed how well readable the installation script is!

Rundeck is a web based user interface that allows you to run commands against a group of hosts. Rundeck has an ansible plugin. Using that plugin, it could perform the similar functionality as Ansible Tower (commercial product) or Semaphore (open source).

After a fresh installation of Centos 7, do the following as root:

yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ansible git
git clone https://gitlab.com/FritsHoogland/install_rundeck.git

This adds the EPEL (extra packages for Enterprise Linux) repository to Centos, which contains ansible and git. The next yum command installs ansible (the orchestration software) and git (distributed version control). The git clone command pulls the ansible orchestration script from my install_rundeck git repository at gitlab.

It is important to open the install_rundeck/install_rundeck.yml script and modify anything that is different in your situation. The public hostname will most likely be different than the 192.168.66.16 in the script. You might want to change the locality information with the certificate generation (unless you live in Lutjebroek like me :-). If you do a corporate installation, you might want to skip this step altogether and get a certificate pair signed by your company’s CA.

Please mind I ran into issues with ansible inventories:
– The hosts in the inventory need to have ansible run against them to pick up their properties and become visible in rundeck in the nodes tab. For being able to have ansible run against the hosts in the inventory, the host need to exist in the ssh known hosts file of the rundeck user, otherwise ansible can’t run and the host or hosts will not be visible in rundeck. The solution is to log in as the rundeck user and logon to the hosts in the inventory once manually and accept the host or hosts. From a security perspective it’s understandable that you careful need to assess the hosts to trust, but from an automation standpoint this is annoying. Outside of essentially filling out the known hosts file as I described, there are several other workarounds.
– I created an ansible inventory file in the rundeck project directory and entered the hosts in it. Rundeck picked up the hosts (after solving the above point they became visible in the nodes tab), however when executing something using ansible via rundeck it would say ‘[WARNING]: provided hosts list is empty, only localhost is available’. This means ansible was not pointed explicitly to an inventory, so it used the default one. In order to solve this, I symlinked my (rundeck) project inventory to the /etc/ansible/hosts to make it centrally available. Apparently, using a central inventory for ansible using the plugin is by design. I would rather have rundeck generate an inventory per execution, and pointing to it when the plugin executes ansible.

Now install rundeck:

ansible-playbook install_rundeck/install_rundeck.yml

Done!

%d bloggers like this: