Tag Archives: linux

This is a quick post on using git on a server. I use my Synology NAS as a fileserver, but also as a git repository server. The default git package for Synology enables git usage on the command line, which means via ssh, or via web-DAV. Both require a logon to do anything with the repository. That is not very handy if you want to clone and pull from the repository in an automated way. Of course there are ways around that (basically setting up password-less authentication, probably via certificates), but I wanted simple, read-only access without authentication. If you installed git on a linux or unix server you get the binaries, but no daemon, which means you can only use ssh if you want to use that server for central git repositories.

Running git via inetd
What I did is using inetd daemon to launch the git daemon. On any linux or unix server with the inetd daemon, and on Synology too, because it uses linux under the covers, it’s easy to setup git as a server.

First, check /etc/services for the following lines:

git               9418/tcp                   # git pack transfer service
git               9418/udp                   # git pack transfer service

Next, add the following line in the inetd.conf (which is /etc/inetd.conf on my synology):

git stream tcp nowait gituser /usr/bin/git git daemon --inetd --verbose --export-all --base-path=/volume1/homes/gituser

What you should look for in your setup is:
– gituser: this is the user which is used to run the daemon. I created a user ‘gituser’ for this.
– /usr/bin/git: of course your git binary should be at that fully specified path, otherwise inetd can’t find it.
– git daemon:
— –inetd: notify the git executable that it is running under inetd
— –export-all: all git repositories underneath the base directory will be available
— –base-path: this makes the git root directory be set to this directory. In my case, I wanted to have all the repositories in the home directory of the gituser, which is /volume1/homes/gituser in my case.

And make the inetd deamon reload it’s configuration with kill -HUP:

# killall -HUP inetd

Please mind this is a simple and limited setup, if you want to set it up in a way with more granular security, you should look into gitolite for example.

This post looks like I am jumping on the bandwagon of IT orchestration like a lot of people are doing. Maybe I should say ‘except for (die hard) Oracle DBA’s’. Or maybe not, it up to you to decide.

Most people who are interested in IT in general will have noticed IT orchestration has gotten attention, especially in the form of Puppet and/or Chef. I _think_ IT orchestration has gotten important with the rise of “web scale” (scaling up and down applications by adding virtual machines to horizontal scale resource intensive tasks), in order to provision/configure the newly added machines without manual intervention, and people start picking it up now to use it for more tasks than provisioning of virtual machines for web applications.

I am surprised by that. I am not surprised that people want boring tasks like making settings in configuration files and restarting daemons, installing software with all the correct options, etc. being automated. Instead, I am surprised that people are now picking this up after it has been around for so long.

A little history.
As far as I know, IT orchestration started with cfengine, which was really a configuration engine (hence the name). Despite having a little different purpose (configuration vs. orchestration), this tool is the parent of all the orchestration/configuration tools which exist nowaday. cfengine started off as a study in 1993 by Mark Burgess at the university of Oslo, with the creation of the cfengine software as a result. As far as I can see, it has been available as open source software since the beginning.

Now back to what I am surprised at: with cfengine, there has been a way to configure linux/unix systems in a structured way, and install and configure software on systems since the mid-nineties. Yet, this wasn’t picked up (of course with a few exceptions). Fast forward to today, we see it is being adopted. And that is a good thing.

I created a setup with cfengine for a client a long time ago, which had the ability to install the Oracle software, different PSU’s in different home’s, and remove it by adding or removing machines to groups in a cfengine configuration file. It wasn’t picked up by the client, it’s way more fun running X to install the software, and make the choices by hand, and redo this over and over on every machine, right?

I almost forgotten about my work with cfengine, until I spoke with Alex Gorbatchev at a conference, at which he pointed me to Ansible. At first I didn’t do a lot with it, but lately I’ve given it a go, and I am very happy with it.

Another redo of cfengine?
From what I read, most of the configuration/orchestration engines created after cfengine are created to circumvent all kinds of difficulties with cfengine. I can understand that. It took me a while to learn cfengine, and indeed it forces you to think in a different way.

The Ansible project decided to radically do it different than all the other engines. It is different in the sense that it advertises itself as simple, agentless and powerful.

Simple is a terrific goal. For those of you that have worked with configuration/orchestration engines, there is a steep learning curve. It is just hard to get the basic principles in your head. To be honest, also Ansible took me a while too, to grasp the basic principles, and get the picture correctly in my head. Yet, having worked with cfengine comparing it with Ansible’s playbooks, which are the scripts to do things on the targets, it is a breath of fresh air. Playbooks are so clean they (almost) can be read and understood as plain english.

This is where Ansible is truly different than any of the other configuration/orchestration tools. Ansible does not require any agent installation on the targets. The obvious next question then is: how can this work? Well, quite simple: Ansible uses ssh to connect to the host, and executes commands via the shell. Having that said, it requires a little more detail; Ansible uses python on the remote host for it’s normal execution. However, you can use it without python, for example to setup the host up for the Ansible normal usage mode Which requires python and the simple-json module.

This is truly important, and makes it an excellent fit for my daily work as an IT consultant.

Ansible is powerful in the way that you can do the configuration and orchestration tasks in a simple clean way.

Summary on the introduction.
Above was a brief personal history, and some of the “marketed” features of Ansible. I think being agentless is the true “killer feature” here. All the other configuration/orchestration engines require you to setup and configure a fixed client-server connection, and install a deamon and a central server process. In case you wondered, yes, authentication is important, and it’s simply brilliant that the ssh password authentication or public key infrastructure can be used.

Because there’s no daemon to install, you can run your created play books everywhere. So instead of a fixed client configuration, you can create play books to do routine tasks, and repeat it at multiple sites.

Okay, how does this work?

Installation: add EPEL and install ansible.
If you are on one of the clones of RedHat Enterprise Linux (I use Oracle Linux), you simply need to add the EPEL repository to your yum source list, and run:

# yum install ansible

First steps.
One of the first things I do, is create a directory for a typical ansible ‘project’. Project means a set of tasks you want to do to a set of hosts here. Next, I create a file called ‘hosts’ which is the list of hosts you want to use for executing tasks on. By default, Ansible looks in /etc/ansible/hosts. In this case, I put a single machine in it (a test VM), but it can be a list of machines (ip addresses or hostnames).

$ cat hosts

In fact, you can create groups in the hosts file in the “ini style”. But I just put one host in for this example.
The next thing is to check if Ansible reads the file correctly. This is done in the following way:

$ ansible all -i hosts --list-hosts

Okay, this means Ansible will operate on this one host if invoked. The next logical thing (typically done when you are in a new client environment to check if you can reach the hosts):

$ ansible all -i hosts -m ping | FAILED => FAILED: Authentication failed.

Ping might be a bit misleading for some people. What ping does here (-m means module), is trying to connect to the host over ssh, and log in. Because I didn’t specify a user, it used the username of the current user on the machine, which is ‘ansible’. A user ‘ansible’ typically doesn’t exist on a normal server (and is not necessary or should be created), and also not on my test server. So it failed, as the message said, on authentication.

My test VM is a basic installed (OL) linux 6 server. This means there’s only one user: root.

So, let’s specify the user root as user:

$ ansible all -i hosts -m ping -u root | FAILED => FAILED: Authentication failed.

The authentication failed again. And it should! What this is doing, is trying to log on as root, and we haven’t given any password, nor have I put my local user’s public key in the remote authorised_keys file. So there is no way this could work. This is typically also the state when you want to do stuff with a “fresh” client system. Let’s add the ‘-k’ option (ask ssh password), and run again:

$ ansible all -i hosts -m ping -u root -k
SSH password: | success >> {
    "changed": false,
    "ping": "pong"

To walk you through the output: It now asks for a password, which I’ve filled out, then lists the host and the state: success. During this execution, there was nothing changed on the remote host, and the ping command resulted in a pong (alike the ICMP ping response).

With what we have learned now, we can do things like this:

$ ansible all -i hosts -u root -k -a "ifconfig"
SSH password: | success | rc=0 >>
eth0      Link encap:Ethernet  HWaddr 00:0C:29:14:65:ED
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::20c:29ff:fe14:65ed/64 Scope:Link
          RX packets:47 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:6293 (6.1 KiB)  TX bytes:2594 (2.5 KiB)

eth1      Link encap:Ethernet  HWaddr 00:0C:29:14:65:F7
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::20c:29ff:fe14:65f7/64 Scope:Link
          RX packets:188 errors:0 dropped:0 overruns:0 frame:0
          TX packets:112 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:142146 (138.8 KiB)  TX bytes:15545 (15.1 KiB)

lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Does this look familiar for you Exadata DBA’s? Yes, this replicates some of the functionality of dcli (although dcli is aimed at executing simple tasks to a group of hosts, whilst Ansible is aimed at enterprise configuration and orchestration).

One step beyond! Playbooks.
Now let’s progress to playbooks. An Ansible playbook is where the true strength lies of Ansible. It allows you to specify tasks to execute on the remote hosts, and create sequences of tasks and make decisions based on the outcome of a tasks for further execution. Let me show you a simple playbook, and guide you through it:

- hosts: all
  gather_facts: no
  remote_user: root

  - name: upgrade all packages
    yum: name=* state=latest

  - name: install python-selinux
    yum: name=libselinux-python state=installed

  - name: add public key to authorized_key file of root
    authorized_key: user=root state=present key="{{ lookup('file','/home/ansible/.ssh/') }}"

As you can see, this is a playbook with three tasks: upgrade all packages, install libselinux-python and adding my (local) public key to the authorised key file of root (to allow passwordless access).

Line 1 shows three dashes, which means the start of a YAML document.
Line 2 starts with a single dash, which indicates a list. There is one dash at this indention level, so it’s a list of one. The fields of this member are hosts, gather_facts and tasks. Tasks got his own list (mind the indention level, that is important). The fields are key/value pairs, with the separation indicated by the colon (:). The first field is ‘hosts’, with the value ‘all’. This means that all hosts in the hosts file are used for this playbook. I don’t think it’s hard to imagine how useful it can be to specify a group/kind of servers the playbook can run on. The next one is ‘gather_facts’. A normal playbook execution first gathers a lot of information from all the hosts it is going to run on before execution. These can be used during playbook execution. Next ‘remote_user’. This indicates with which user ansible is going to logon, so we don’t have to specify it on the command line. Then we see ‘tasks’ to indicate the list of tasks to be executed on the hosts.

It’s easy to spot we got three tasks. What is extremely important, is the indention of this list (it’s got a dash, so it’s a list!). Name is not mandatory, but it makes it easy to read if you give the tasks useful names and these will be shown when the playbook is executed. The first task has the name ‘upgrade all packages’. The next field shows the key is ‘yum’ indicating it is making use of the yum module. This key got two values: name=*, which means all ‘all packages’, and state=latest, which means we want all packages to be at the latest version. This means this command is the equivalent of ‘yum update’.

The second task is called ‘install python-selinux’. It makes use of the yum module again, and is self explanatory, it installs the libselinux-python package. This packages is necessary to work on a host which has selinux enabled on things that are protected by selinux.

The next task is called ‘add public key to authorised_key file of root’. It is making use of the authorized_key module. This module requires a parameter ‘key’, for which we use the lookup function to look up the local (!) public key, of the user with which I execute ansible, which is ‘ansible’. ‘state=present’ means we want this key to be present; ‘present’ is the default value, so it wasn’t necessary to put this in. Next ‘user=root': we want the public key to be added to the authorized_keys file of the user root.

Of course these tasks could be executed using the ‘ansible’ executable as single tasks. To show the importance of the installation of the libselinux-python module on a host with selinux enabled (which is the state of selinux on a fresh installed Oracle Linux machine), let’s execute the task using the authorized_key module:

$ ansible all -i hosts -k -u root -m authorized_key -a "user=root state=present key=\"{{ lookup('file','/home/ansible/.ssh/') }}\""
SSH password: | FAILED >> {
    "failed": true,
    "msg": "Aborting, target uses selinux but python bindings (libselinux-python) aren't installed!"

Clear, right? The host is selinux protected. Now, let’s execute the installation of the libselinux package as single task, and then add our public key to the authorized_key file of root:

$ ansible all -i hosts -k -u root -m yum -a "name=libselinux-python state=installed"
SSH password: | success >> {
    "changed": true,
    "msg": "",
    "rc": 0,
    "results": [
        "Loaded plugins: security\nSetting up Install Process\nResolving Dependencies\n--> Running transaction check\n---> Package libselinux-python.x86_64 0:2.0.94-5.3.el6_4.1 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package             Arch     Version                 Repository           Size\n================================================================================\nInstalling:\n libselinux-python   x86_64   2.0.94-5.3.el6_4.1      public_ol6_latest   201 k\n\nTransaction Summary\n================================================================================\nInstall       1 Package(s)\n\nTotal download size: 201 k\nInstalled size: 653 k\nDownloading Packages:\nRunning rpm_check_debug\nRunning Transaction Test\nTransaction Test Succeeded\nRunning Transaction\n\r  Installing : libselinux-python-2.0.94-5.3.el6_4.1.x86_64                  1/1 \n\r  Verifying  : libselinux-python-2.0.94-5.3.el6_4.1.x86_64                  1/1 \n\nInstalled:\n  libselinux-python.x86_64 0:2.0.94-5.3.el6_4.1                                 \n\nComplete!\n"

$ ansible all -i hosts -k -u root -m authorized_key -a "user=root state=present key=\"{{ lookup('file','/home/ansible/.ssh/') }}\""
SSH password: | success >> {
    "changed": true,
    "key": "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAliR905hxLnsOCRlOGnmN0H9dGH4NPV88ySC6GMv0KNnU7FfCXYE51Bkk97p2IWFsPhYO9qJDyAFxRm/lia1IZRDpCFcKKKMh5eXmEJC5XSrHWFdmGZRlFcS3VQ3rpCIyU3qFM6xMazh3JHKKEtE1J6nvw/hW3slY9G/6VoJ8CzpfeQMLDOdVXUIcZXqtCPuIEDBQ7yjfMzTGz+hEmz7ImbLaUyB4MDGrDnl33L8mkBEVYu8RrwgBcagDQSiQKnIca/EL45eX/74NG1e/6vxZkHZJz/W0ak4KD+o9vF4ikz0bdrGPMZ5gRYXWoSSHrVA+Rqk8A93qBXNKUUkzGoQYTQ== ansible@ansiblevm.local",
    "key_options": null,
    "keyfile": "/root/.ssh/authorized_keys",
    "manage_dir": true,
    "path": null,
    "state": "present",
    "unique": false,
    "user": "root"

Maybe your customer doesn’t want you to store your keys in their servers. It’s easy to do the reverse, and remove your key from the authorized_key file:

$ ansible all -i hosts -u root -m authorized_key -a "user=root state=absent key=\"{{ lookup('file','/home/ansible/.ssh/') }}\"" | success >> {
    "changed": true,
    "key": "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAliR905hxLnsOCRlOGnmN0H9dGH4NPV88ySC6GMv0KNnU7FfCXYE51Bkk97p2IWFsPhYO9qJDyAFxRm/lia1IZRDpCFcKKKMh5eXmEJC5XSrHWFdmGZRlFcS3VQ3rpCIyU3qFM6xMazh3JHKKEtE1J6nvw/hW3slY9G/6VoJ8CzpfeQMLDOdVXUIcZXqtCPuIEDBQ7yjfMzTGz+hEmz7ImbLaUyB4MDGrDnl33L8mkBEVYu8RrwgBcagDQSiQKnIca/EL45eX/74NG1e/6vxZkHZJz/W0ak4KD+o9vF4ikz0bdrGPMZ5gRYXWoSSHrVA+Rqk8A93qBXNKUUkzGoQYTQ== ansible@ansiblevm.local",
    "key_options": null,
    "keyfile": "/root/.ssh/authorized_keys",
    "manage_dir": true,
    "path": null,
    "state": "absent",
    "unique": false,
    "user": "root"

Please mind I didn’t specify ‘-k’ on the command line to send a password: in the previous step we added our key, so we can access our host using our public key. Another extremely important thing is ‘changed’. ‘changed’ indicates if the task did actually change something on the destination server.

I have ran single task until now, I changed the state of my test VM back to it’s state before I started changing it with ansible (by removing the libselinux package using ‘ansible all -i hosts -k -u root -m yum -a “name=libselinux-python state=absent”‘

Let’s run the above described playbook:

$ ansible-playbook -i hosts -k linux_setup_example.yml
 [WARNING]: The version of gmp you have installed has a known issue regarding
timing vulnerabilities when used with pycrypto. If possible, you should update
it (ie. yum update gmp).

SSH password:

PLAY [all] ********************************************************************

TASK: [upgrade all packages] **************************************************
changed: []

TASK: [install python-selinux] ************************************************
changed: []

TASK: [add public key to authorized_key file of root] *************************
changed: []

PLAY RECAP ********************************************************************              : ok=3    changed=3    unreachable=0    failed=0

Now at this point you might think: I get it, but these are all pretty simple tasks, it’s not special at all. Well, let me show you an actual thing which totally shows what the importance of using this is, even on a single machine, but even more when you got a large group of servers you have to administer.

The next example is a playbook created to apply PSU3 to an Oracle home. It’s still quite simple, it just applies PSU3 to the Oracle home. But totally automatic. The point I am trying to make is that this is already nice to have automated a lot of work for a single home, but it saves a lot of hours (read: a lot of money), and saves you from human error.

- hosts: all
    u01_size_gb: 1
    tmp_size_gb: 1
    oracle_base: /u01/app/oracle
    oracle_home: /u01/app/oracle/product/
    patch_dir: /u01/install
  remote_user: oracle

  - name: check u01 free disk space
    action: shell df -P /u01 | awk 'END { print $4 }'
    register: u01size
    failed_when: u01size.stdout|int < {{ u01_size_gb }} * 1024 * 1024

  - name: check tmp free disk space
    action: shell df -P /tmp | awk 'END { print $4 }'
    register: tmpsize
    failed_when: tmpsize.stdout|int < {{ tmp_size_gb }} * 1024 * 1024

  - name: create directory for installation files
    action: file dest={{ patch_dir }} state=directory owner=oracle group=oinstall

  - name: copy opatch and psu
    copy: src=files/{{ item }} dest={{ patch_dir }} owner=oracle group=oinstall mode=0644
     - ocm.rsp

  - name: install opatch in database home
    action: shell unzip -oq {{ patch_dir }}/ -d {{ oracle_home }}

  - name: unzip psu patch
    action: shell unzip -oq {{ patch_dir }}/ -d {{ patch_dir }}

  - name: patch conflict detection
    action: shell export ORACLE_HOME={{ oracle_home }}; cd {{ patch_dir }}/18522509; $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    register: conflict_detection
    failed_when: "'Prereq \"checkConflictAgainstOHWithDetail\" passed.' not in conflict_detection.stdout"

  - name: apply psu
    action: shell export ORACLE_HOME={{ oracle_home}}; cd {{ patch_dir }}/18522509; $ORACLE_HOME/OPatch/opatch apply -silent -ocmrf {{ patch_dir }}/ocm.rsp
    register: apply_psu
    failed_when: "'Composite patch 18522509 successfully applied.' not in apply_psu.stdout"

  - name: clean up install directory
    file: path={{ patch_dir }} state=absent

Let me run you through this playbook! It starts off with the indication of a YAML document: ‘—‘. Next hosts: all again. I just put all the hosts in the hosts file, I did not create all kinds of groups of hosts (which would be fitting when you use it at a fixed environment, but I use it for various customers). Then vars, with a list of variables. As you can see, I can use the variables, which are shown in the playbook as {{ variable }}. Then remote_user: oracle and tasks.

The first and second task use variables, and use the argument ‘register’ to save all response into a named variable. I also use ‘failed_when’ to make the playbook stop executing when the argument after ‘failed_when’ is true. Arguments of ‘failed_when’ is the named variable, for which the output of the standard out is used (.stdout). Then a filter is used to cast the output to integer, and is compared with a calculation of the variable.

The third task is using the files module to create a directory. The fourth task is using the copy module. The copy module means a file or files (in this case) are copied from the machine from which the playbook is run, onto the destination host or hosts. Here is also another trick used, to process the task with a list of items. As you can see, the copy line contains a variable {{ items }}, and the task is executed for all the items in the list ‘with_items’. I found this is fine for smaller files (up to a few hundred of megabytes), but too slow for bigger files. I use http (the get_url module) to speed up file transfer.

The fifth and sixth tasks execute a shell command, unzip, to extract the contents of a zip file into a specific place.

The seventh task is executing a small list of shell commands, in order to be able to run the conflict detection option of opatch. The same trick as with the first two tasks is used, register a name for the output of the conflict detection. Here I check if the stdout contains what I would manually check for when I would run it. The eighth task is the main task of the whole playbook: the actual patch. However, it uses the same technique as task seven. The last task simply removes a directory, in order to remove the files we used for this patch.

I hope this shows what a tremendous help Ansible can be for a consultant. This kind of tool is simply mandatory if you got an environment with more than approximately ten to twenty servers to administer. Ansible can be used even if the organisation does not want to spend time on the implementation of a configuration tool.

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).

Some time back, I investigated the options to do profiling of processes in Linux. One of the things I investigated was systemtap. After careful investigation I came to the conclusion that systemtap was not really useful for my investigations, because it only worked in kernelspace, only very limited in userspace. The limitation of working in userspace was that you had to define your own markers in the source code of the program you wanted to profile with systemtap and compile that. Since my investigations are mostly around Oracle products, which are closed source, this doesn’t help me at all.

Some time ago, Frank Eigler responded to a blog article I posted on my blog about using gdb (GNU debugger) for doing userspace profiling, indicating that systemtap could do userspace function profiling too. I was quite shocked, because I carefully investigated that option, and came to the conclusion that exactly this did not work. After some communication on this, the conclusion was that this indeed did NOT work with the version of systemtap which is included with current versions of RHEL (and therefore Oracle Linux). But in the current source version of systemtap userspace ‘probing’ is included.

But that is not all…in order to give systemtap the opportunity to do userspace probing, it needs userspace ‘trace hooks’. This is only available in the current stock kernels if the source is of the kernel patched with the ‘utrace patch’, enabled, and compiled. That means a custom compiled kernel. On itself a custom compiled kernel is fine, but in much environments where you work with closed source products, products are certified against stock kernels, and supported only on stock kernels. From a support point of view I very much understand this, and from the viewpoint from me as a consultant too. To put it in a different way: it is an enormous red flag which is raised if I encountered an environment where people compile their own kernel on Linux.

But there is good news. Since linux kernel version 3.5 userspace probing support is included in the linux kernel, which means there is no patch needed against the kernel source in order to be able to profile in userspace. If you take a look at the kernels Oracle provides (for red hat: I am sorry, there is no way that I know to obtain RHEL online for free for testing, which for me rules out using it. I know about the merger with CentOS, but haven’t looked if that makes it attractive for me again), we can see that Oracle provides UEK (2.6.32), UEK2 (2.6.39) and UEK3 (3.8.13). Yes! That means that I can hook up a yum repo and install a kernel that allows userspace probing!

I installed a testmachine with Oracle Linux 6.5, installed the UEK3 kernel, and installed systemtap. When doing testing of the primary desired functionality (profile userland functions without debug symbols), I encountered this problem:

[root@ol6-uekbeta ~]# /usr/bin/stap -e 'probe process("/u01/app/oracle/product/").function("*") { probefunc() }'
WARNING: cannot find module /u01/app/oracle/product/ debuginfo: No DWARF information found [man warning::debuginfo]
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("/u01/app/oracle/product/").function("*") { probefunc() }

semantic error: no match
Pass 2: analysis failed.  [man error::pass2]

This strongly looks like systemtap does not understand the ‘process’ probe, where Frank warned about. So. Is this the end of the journey? No!

The userland function probing is documented in the documentation on the systemtap website. This means it should be available. Let’s clone the systemtap source, and build systemtap ourselves. This has a few implications. For starters, this eliminates the usage of systemtap for userland functions on “real” systems. With “real” I mean systems that have a function, and need to be supported and need to be stable. Because on this kind of systems no beta or preview software can and should be installed, no matter how much we want it, need it or want it. But to have an investigation system where we can mimic one of the most desired functions of dtrace, this is fine!

So. I have got a X86_64 Oracle Linux 6.5 installation (default install, and the meta-rpm oracle-rdbms-server-11gR2-preinstall.x86_64 installed), installed the UEK3 kernel on it (using the UEKR3 repo on Oracle Linux public yum), and added the git version system executables using ‘yum install git’, and next I cloned the systemtap git repository using ‘git clone git clone git:// What needed to be done next, is compile and install the stuff. This can be done in a quite standardised way:

make install

If all goes well, you end up with the latest version of systemtap (version 2.5/0.152), which should be able to do userspace probing, and a kernel capable to provide the information for userspace probing.

Now let’s test this, and create a systemtap script to profile the time dbv (db verify) takes just by running it:
(please mind this is a proof of concept script, any additions or remarks are welcome!)

global time, function_times, prev_func, function_count

probe begin {

probe process("/u01/app/oracle/product/").function("*") {
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++

probe end {
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	delete function_times["__do_global_dtors_aux"]
	printf("Function\t\ttime (us)\tcount\tavg (us)\n")
	foreach( tm = [ fn ] in function_times+ ) {
		printf("%s: \t\t%d\t\t%d\t%d\n", fn, tm, function_count[fn],tm/function_count[fn])
		tot_time += tm
	printf("Total time: %d\n", tot_time)

This systemtap script can be run from one (root) session, and dbv run in another session. Please mind to wait with running dbv until the systemtap session notifies you it is ready by saying “Begin.”. This is the result:

Function		time (us)	count	avg (us)
frame_dummy: 		3		1	3
lxplget: 		3		1	3
lxpsset: 		3		1	3
call_gmon_start: 		4		1	4
lxplset: 		4		1	4
lxpcset: 		4		1	4
lxptget: 		4		1	4
lxptset: 		4		1	4
lxhLaToId: 		5		1	5
kudbvcCreate: 		5		1	5
_fini: 		6		1	6
__do_global_ctors_aux: 		7		1	7
lxldini: 		7		1	7
lxhenvquery: 		7		1	7
kudbvhlp: 		7		1	7
lxldlbb: 		8		2	4
lxldLoadBoot: 		8		2	4
lxpname: 		12		3	4
kudbvcCreateMsg: 		12		1	12
lxlfOpen: 		13		4	3
lmsapop: 		13		2	6
lxldLoadObject: 		14		4	3
lxpdload: 		14		2	7
lxldlod: 		15		4	3
lxladjobj: 		15		4	3
lxlchkobj: 		15		4	3
__libc_csu_init: 		16		1	16
lxlgsz: 		16		4	4
lxfgnb: 		20		2	10
lxoCnvCase: 		22		2	11
lxhLangEnv: 		24		3	8
_init: 		27		1	27
lxpe2i: 		31		9	3
slmsbfn: 		31		2	15
lxdlobj: 		34		4	8
lxmopen: 		36		5	7
lxlfrd: 		40		4	10
_start: 		41		1	41
lmsagb1: 		46		14	3
lxhchtoid: 		47		6	7
lmsapts: 		47		14	3
lxpcget: 		48		7	6
lxgratio: 		48		14	3
slxldgnv: 		49		11	4
lmsapsb: 		49		14	3
lmsagbcmt: 		50		14	3
lmsapsc: 		50		14	3
lmsapnm: 		51		14	3
lxldalc: 		54		6	9
main: 		63		1	63
kudbvmal: 		63		1	63
lmsaprb: 		67		7	9
kudbvexit: 		68		1	68
lmsapfc: 		71		7	10
slxcfct: 		72		5	14
lxpmclo: 		81		13	6
slmscl: 		88		1	88
slxdfsync: 		91		1	91
lmsapic: 		91		7	13
lxhci2h: 		97		28	3
lxpendian: 		107		13	8
kudbvcml: 		116		1	116
lxgu2t: 		119		16	7
lmsagbf: 		120		14	8
kudbvmai: 		151		1	151
lxdgetobj: 		225		44	5
lxinitc: 		247		6	41
kudbvcpf: 		254		27	9
slmsrd: 		256		9	28
lxhh2ci: 		350		34	10
slxcfot: 		514		5	102
lxlinit: 		688		6	114
kudbvini: 		798		1	798
slmsop: 		1005		2	502
kudbvvpf: 		4102		27	151
Total time: 10993

Of course the result itself is not very useful. The time spend in dbv is measured at 10,993 microseconds (us), the function the most time was spend in was kudbvvpf(), which was 4102 us, but that function was executed 27 times, which makes the time per execution 151 us. The longest taking function was kudbvini(), which was 798 us.

For some time now, I am using gdb to trace the inner working of the Oracle database. The reason for using gdb instead of systemtap or Oracle’s dtrace is the lack of user-level tracing with Linux. I am using this on Linux because most of my work is happening on Linux.

In order to see the same information with gdb on the system calls of Oracle as strace, there’s the Oracle debug info repository. This requires a bit of explanation. When strace is used on a process doing IO that Oracle executes asynchronous, the IO calls as seen with strace look something like this:

io_submit(140425370206208, 1, {{0x7fb7516c4bc0, 0, 0, 0, 257}}) = 1
io_getevents(140425370206208,1,128,{{0x7fb7516c45e8,0x7fb7516c45e8,106496,0}}, {600, 0}) = 1

This reveals exactly how Oracle used these calls. In case you wonder how to read these calls: Linux (as well as any other Unix like operating system) provides man pages (manual pages) for not only for the command line tools, but also on system calls, c library functions and device and special files, among others. So if you wonder what the io_submit line means, type ‘man io_submit’, or to be 100% sure you look in the manual pages of the system calls, type ‘man 2 io_submit’ to specify you want section 2: system calls.

When I use gdb, and break on io_submit and io_getevents, I get this information:

Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/
Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()

I think everybody can spot that I got less information now. In fact, I now know the calls have happened, and that’s all, there is no additional information. In order to get part of the information back that was visible with strace, use the debuginfo package of libaio. The debug info package must match 100% the version of the package it is meant to provide debug symbols about, because it provides debug information about the executable or library based on physical code locations.

In order to get information on these specific calls (libaio calls), the libaio-debuginfo package can be installed. Once done, we get a great deal of information which resembles strace:

Breakpoint 1, io_submit (ctx=0x7ff8b626c000, nr=1, iocbs=0x7fffa5c31a80) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 2, io_getevents_0_4 (ctx=0x7ff8b626c000, min_nr=2, nr=128, events=0x7fffa5c37b68, timeout=0x7fffa5c38b70) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

This shows all the arguments which are used by the process which is traced with gdb. Please mind that gdb breaks on entering the call, so it doesn’t give a return code. And the return code of io_getevents() is what returns the number of IO’s which are ready, so that information is still not visible, but is visible with strace, which does provides the return code.

How about the Oracle user land calls? I use breaking on kslwtbctx() and kslwtectx() a lot, which indicate the starting (kslwtbctx()) and stopping (kslwtectx()) of a wait event. When doing so, this is how it looks like:

Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/
Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 4, 0x0000000007cf47b6 in kslwtbctx ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 5, 0x0000000007cfb4f2 in kslwtectx ()

Here we see the libaio functions again, together with the Oracle wait event functions. When using these calls this way, we can safely say that there are some calls done outside of a wait, and one call is done inside of a wait. Because this measurement is done on a well known piece of Oracle code (well known to me: executing a full table scan via direct path), I just know the wait is ‘direct path read’. But what if you do not know? Wouldn’t it be nice to know which wait is called here?

The simplest way to get more information on Oracle function calls is to get the debug information for the Oracle database. However, since that makes references to the source code, that will probably never happen. So, does that mean this is all we can get? No.

In order to get more information out of a function call, we need to dive a little deeper into the internals of Linux x86_64. When a function is called, the arguments are passed on via processor registers. This is implementation specific, and differs between 32-bit and 64-bit. An overview of how that works is summarised in this table. The important line is: “The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9, while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments. For system calls, R10 is used instead of RCX.”

So. This means that if I look at the CPU registers when breaking on a function, there might be something usable. I say “something usable” deliberately, because the Oracle function calls are not publicly documented (I think/hope they are inside Oracle development). I’ve done some investigation, and it turns out that at the END of a wait event, there are a few functions which are called which have some information stored in a CPU register which is useful:
a) First the function kslwtectx() is called to mark the ending of a wait event.
b) Then a function called kslwtrk_enter_wait_int is called, which stores the time the took in the register R13.
c) Next a function called kskthewt is called, which stores the number of the wait event (V$EVENT_NAME.EVENT#) in RSI.

If we combine that information in a little gdb macro, it looks like this:

break kslwtbctx
    printf "kslwtbctx\n"
break kslwtectx
    printf "kslwtectx -- "
break kslwtrk_enter_wait_int
    set $time=$r13
break kskthewt
    printf "wait: %d, time: %d\n", $rsi, $time

Put this in a text file, and once attached to a process to trace with gdb, load it using ‘source ‘.
Here is how it looks like when you put it on a process (I’ve put it on the checkpoint process):

kslwtectx -- wait: 7, time: 2999054
kslwtectx -- wait: 81, time: 1979
kslwtectx -- wait: 81, time: 1050
kslwtectx -- wait: 81, time: 1216
kslwtectx -- wait: 81, time: 2031
kslwtectx -- wait: 83, time: 10443

If you want to learn more about this stuff, don’t forget I will be doing a hands-on session on using gdb as a pre-conference training during Collaborate 2014 in Las Vegas.

This blogpost is about how to print the system call arguments of a system call which is caught with ‘catch’ or ‘break’ in gdb. The reason for this blogpost is I spend quite some time on searching for this, and working around this, so writing it in a blogpost might help others who spend (some of) their time in the gdb debugger, and encounter the same issue.

When you break on a system call in gdb, it will show you something like this:

Breakpoint 2, semctl () at ../sysdeps/unix/syscall-template.S:82

All fine, but most of the people who break on something want to know what the arguments of the call are! We are actually breaking on what is called a “syscall wrapper” (explanation in link). This means we do not have the arguments of the system call nicely shown on screen, as we would like (will, I do!).

An example of a system call which does show it’s arguments nice and dandy on screen is “io_submit ()”:

Breakpoint 3, io_submit (ctx=0x7f42badba000, nr=1, iocbs=0x7fff0cc0f30) at io_submit.c:23
23      io_syscall(in, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

This allows me to look into the arguments, like for example:

(gdb) print *iocbs[0]

But now back to the main problem: for some system calls, when I break on them, I end up in the system call wrapper, not showing any arguments. After some time, I found this answer on stackoverflow. Which is exactly what I needed: the system call arguments are passed via registers!!

Let me show you how this works. I was investigating the system call “nanosleep()” recently. In order to understand its arguments, first issue “man nanosleep” to understand the arguments:

      nanosleep - high-resolution sleep

      #include <time.h>

      int nanosleep(const struct timespec *req, struct timespec *rem);

So, when nanosleep is called, it passes the pointers to two structs of the type “timespec”. The first one is a constant (it determines the specifics of the call), the second one is used if the nanosleep was interrupted with a signal, which means the remaining time is written in it. For the sake of understanding what is happening, I really only want to know what is in the first struct (*req).

So this is how I setup a break on nanosleep() in gdb, and print the argument I want to see:

(gdb) break nanosleep
Breakpoint 2 at 0x32e0e0ef10: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s), 2, one per line.
End with a line saying just "end".
>print (struct timespec) *$rdi
(gdb) c

Breakpoint 2, nanosleep () at ../sysdeps/unix/syscall-template.S:82
$1 = {tv_sec = 0, tv_nsec = 867779000}

There you go: we got the argument of a system call (that I was interested in) that was “hidden” by the linux system call wrapper.

In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().

This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.

First take a look at what I gathered from ‘perf’ in the first article:

# Samples: 501  of event 'cpu-clock'
# Event count (approx.): 501
# Overhead  Command       Shared Object                               Symbol
# ........  .......  ..................  ...................................
    52.50%   oracle  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore    
             --- _raw_spin_unlock_irqrestore
                |--99.24%-- do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          skgfqio
                |          ksfd_skgfqio
                |          ksfdgo

This shows 52.5% of the time of profiling a “select count(*) from t2″ on the server process was spending it’s time on unlocking a spinlock.

This was in the previous blogpost, and tanel commented the following:

I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.

Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.

These are a partial list of perf’s triggering events:

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]

If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.

However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.

However, there was another helpful comment from Tanel:

VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”

Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).

# Samples: 669  of event 'cycles'
# Event count (approx.): 288603593
# Overhead  Command      Shared Object                                   Symbol
# ........  .......  .................  .......................................
    11.31%   oracle  oracle             [.] sxorchk                            
             --- sxorchk
                |--98.50%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch
                |          kdstf00000010000kmP
                |          kdsttgr
                |          qertbFetch
                |          qergsFetch
                |          opifch2
                |          kpoal8
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                 --1.50%-- kcbhvbo

This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!

Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):

# perf record -g -p 2527 
# perf report -n -g none
# Samples: 580  of event 'cycles'
# Event count (approx.): 256237297
# Overhead      Samples  Command       Shared Object                                   Symbol
# ........  ...........  .......  ..................  .......................................
    17.47%          100   oracle  oracle              [.] sxorchk                            
     7.99%           47   oracle  oracle              [.] kdstf00000010000kmP                
     6.01%           35   oracle  oracle              [.] kcbhvbo                            
     3.25%           19   oracle  oracle              [.] kdst_fetch                         
     3.01%           17   oracle  [kernel.kallsyms]   [k] __wake_up_bit        

And now the same execution, but with the software timer:

# perf record -g -p 2527 -e cpu-clock
# perf report -n -g none
# Samples: 422  of event 'cpu-clock'
# Event count (approx.): 422
# Overhead      Samples  Command      Shared Object                            Symbol
# ........  ...........  .......  .................  ................................
    78.67%          332   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore 
     4.03%           17   oracle  oracle             [.] sxorchk                     
     2.13%            9   oracle  oracle             [.] kcbhvbo                     
     1.90%            8   oracle  oracle             [.] kdstf00000010000kmP         
     0.95%            4   oracle  oracle             [.] qeaeCn1Serial                 

This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.

So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.


Get every new post delivered to your Inbox.

Join 2,189 other followers

%d bloggers like this: