Archive

Linux

Warning! This is a post about Oracle database internals for internals lovers and researchers. For normal, functional administration, this post serves no function. The post shows a little tool I created which consists of a small database I compiled with Oracle database C function names and a script to query it. The reason that keeping such a database makes sense in the first place, is because the Oracle C functions for the Oracle database are setup in an hierarchy based on the function name. This means you can deduct what part of the execution you are in by looking at the function name; for example ‘kslgetl’ means kernel service lock layer, get latch.

To use this, clone git repository at https://gitlab.com/FritsHoogland/ora_functions.git

Use the ‘sf.sh’ (search function) script to query the functions. The script uses sqlite3, use the sqlite rpm package to add this (on linux), or use the packaging method of your platform.

This is how to use sf.sh :

$ ./sf.sh kglGetMutex
kglGetMutex -- kernel generic lock management
---

This shows the sf.sh script could find the first 3 letters (underlined), which probably mean kernel generic lock management. It couldn’t find ‘GetMutex’ however that is so self-explanatory that it doesn’t need annotation.

This is how a full function annotation looks like:

$ ./sf.sh qercoFetch
qercoFetch -- query execute rowsource count fetch
----------

Here the full function is underlined, which means the entire function is found.

If you found explanations for Oracle database code locations, or found an error in the annotation (I made assumptions here and there), please send them to frits.hoogland@gmail.com, or react to this post, so I can add or update it.

Advertisements

There are many situations where you want to use a very specific configuration of the Oracle database, for example when a client has an issue and is still on EL5, or gets disk errors on a filesystem that is ext3, or is using ASM and gets weird IO patterns. Other examples are: you want to test the newest PSU to see if responds differently to an issue you are working on, or you want to test a combination of the Oracle database version 11.2.0.3 and grid infrastructure 12.1.0.2.

Of course you can just go and install a virtual machine, install all the different bits and pieces. Doing so manually kills vast amounts of time. By doing that, you will end up with a lot of virtual machines, for which at a certain point in time you have to make a decision to remove some of these.

Also a lot of people use a (virtual) machine with a couple of database versions installed, and test on these. In that case you sometimes have to ignore details like filesystemASM, or specific PSU level, it’s hard to keep that updated, but when a client case is in a lower version, in general you don’t go back in PSU level (although not impossible). One thing I ran into frequently is that it’s easy to get caught in side effects because of changes and settings made for earlier test cases (often underscore parameters).

This blogpost introduces my project ‘vagrant-builder’ which allows you to build a virtual machine with Oracle and optionally clusterware installed in any version you specify. The provisioning will download all software and patches (except for the 12.2.0.1 media, which needs to be provided in the ‘files’ directory) fully automatic for you. These are the options:

Linux version:
Oracle linux version 5, 6 or 7 (limited by boxes build by the box-cutter project).
The Actual versions currently existing are ol5.11, ol6.6/7/8, ol7.0/1/2/3. I am awaiting the boxcutter project to produce ol6.9 and ol7.4.

Filesystems:
Filesystem types for u01 and for oradata (when no ASM is used): xfs, ext4, ext3.

Kernel:
Oracle linux 5: latest redhat kernel, latest UEK2 kernel.
Oracle linux 6: any exadata kernel version (if made available on public-yum), latest redhat/UEK2/UEK3/UEK4 kernel.
Oracle linux 7: latest redhat kernel, latest UEK3 or UEK 4 kernel.

ASM:
No ASM install.
12.2.0.1 no patches, PSU: 170620, 170718, 170814
12.1.0.2 no patches, PSU: 1, 2, 3, 4, 5, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.4 no patches, PSU: 1, 2, 3, 4, 5, 6, 7, 8, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814

Database:
No database install.
12.2.0.1 no patches, PSU: 170620, 170718, 170814
12.1.0.2 no patches, PSU: 1, 2, 3, 4, 5, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.4 no patches, PSU: 1, 2, 3, 4, 5, 6, 7, 8, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.3 PSU 15 only.
11.2.0.2 PSU 12 only.

Database:
By specifying a database name, a database will be created with that name. Of course the dictionary part of the patching will be applied to the database!

How does this work? This works using the combination of the following pieces of software:
– Virtualbox
– Vagrant
– Ansible
Plus the vagrant-builder repository: https://gitlab.com/FritsHoogland/vagrant-builder

If you don’t have Virtualbox, Vagrant or Ansible installed, follow the installation procedure in this blog article; it’s a bit older, so versions of the software components will be higher, you should simply install the latest versions. There is quite an important caveat (sadly): Ansible in principle does not run on windows. You can made it working on windows by using Cygwin, but officially it doesn’t support windows. If you can get the provisioning using Ansible to fully work on windows please share how you did that.

Once you got all the software components installed, another thing you might want to do first is to move your default virtual box directory to a place where you got enough space to hold virtual machines.

Then, clone the vagrant-builder repository into a directory (git clone https://gitlab.com/FritsHoogland/vagrant-builder.git myvm, for example), go into that directory and edit the Vagrantfile to set:
– hostonly_network_ip_address
– mos username & password
– database_name (if you want a database)
– linux (choose one by removing the hash sign in front of it)
– kernel
– asm_version (set a version if you want clusterware “siha” and ASM, if a database_version is set and asm_version is empty, you get a filesystem based database)
– database_version (set a version if you want the database software to be installed)
– vm_cpus (number of CPUs visible/made available to the VM)
– vm_memory (amount of memory made available ot the VM)
– vm_hostname (if you want multiple VMs, you need multiple vm_hostnames set!)
– perl_l4cache_workaround (if you got a newer CPU with a level 4 cache, set this to Y (yes), otherwise set this to N (no))

Save the changes, and startup the virtual machine: ‘vagrant up’. This will pull the operating system image, add a disk for the database, startup linux, setup and configure linux, download the database and grid software version (except for version 12.2.0.1, for which the installation media needs to be staged in the files dictory), install it, download the patches, install these and create a database, without manual intervention.

This post is about the decision the Oracle database engine makes when it is using a full segment scan approach. The choices the engine has is to store the blocks that are physically read in the buffercache, or read the blocks into the process’ PGA. The first choice is what I refer to as a ‘buffered read’, which places the block in the database buffercache so the process itself and other processes can bypass the physical read and use the block from the cache, until the block is evicted from the cache. The second choice is what is commonly referred to as ‘direct path read’, which places the blocks physically read into the process’ PGA, which means the read blocks are stored for only a short duration and is not shared with other processes.

There are some inherent performance aspects different between a buffered and a direct path read. A buffered read can only execute a single physical read request for a single range of blocks, wait for that request to finish, fetch and process the result of the physical read request after which it can execute the next physical read request. So there is maximum of one outstanding IO for multiple (adjacent) Oracle blocks. A direct path read works differently, it submits two physical IO requests, each for a distinct range of Oracle blocks asynchronously, after which it waits one or more IOs to finish. If an IO is returned, it is processed, and an IO for another range of Oracle blocks is submitted to restore the number of IOs in flight to two. If the database engine determines (based upon a non-disclosed mechanism) that enough resources are available it can increase the amount of IO physical IO requests in flight up to 32. Other differences include a maximum for the total size of the IO request, which is 1MB for buffered requests, and 32MB for direct path requests (which is achieved by setting db_file_multiblock_read_count to 4096).

At this point should be clear that there are differences between buffered and direct path reads, and when full segment scans switch from direct path reads to buffered reads it could mean a significant performance difference. On top of this, if your database is using Exadata storage, this decision between buffered reads and direct path reads is even more important. Only once the decision for direct path reads has been made, an Exadata smartscan can be executed. I have actually witnessed cases where a mix of partitioning and HCC lead to the situation that the partitions were so small that a direct path read was not chosen, which meant a smartscan was not considered anymore, meaning that instead of the cells decompressing the compressed blocks all in parallel, the process now had to fetch them and do the decompression on the database layer.

There have been some posts on the circumstances of the decision. However, I have seen none that summarise the differences for the different versions. In order to investigate the differences between the different Oracle versions, I created a git repository at gitlab: https://gitlab.com/FritsHoogland/table_scan_decision. You can easily use the repository by cloning it: git clone https://gitlab.com/FritsHoogland/table_scan_decision.git, which will create a table_scan_decision directory in the current working directory.

Oracle version 11.2.0.2.12
Please mind this version is very old, and SHOULD NOT BE USED ANYMORE because it’s not an actively supported version. However, I do use this version, because this version has different behaviour than the versions that follow.

First determine the small table threshold of the database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

Let’s create tables just below and just over 1531 blocks/small table threshold:

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1531 1531
...
    BLOCKS
----------
      1664

So the small table threshold is 1531, this means that an internal statistic that is used for determining using the direct path mechanism, medium table threshold will be approximately 1531*5=7655. Let’s create tables just below and just over that number of blocks:

TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

For the other versions, trace event ‘nsmtio’ can be used to learn how the decision is made. However, this trace event does not exist in Oracle version 11.2.0.2. The workaround is to just execute a SQL trace and interpret the wait events. For a full table scan, the wait events ‘db file scattered read’ means a buffered read is done, and wait events ‘direct path read’ means a direct path read was done (obviously).

TS@test > alter session set events 'sql_trace level 8';
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1531';
TS@test > select count(*) from table_1531;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that in my case, with Oracle version 11.2.0.2, the switching point is at 5 times _small_table_threshold.

Oracle 11.2.0.3.15
This version too should NOT BE USED ANYMORE because it is not in active support. This too is for reference.
Small table threshold for this database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

With the small table threshold being 1531, the medium table threshold should be approximately 1531*5=7655.

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1440 1440
...
    BLOCKS
----------
      1536
TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans. By doing that I ran into something peculiar with the ‘nsmtio’ event in this version (11.2.0.3 with the latest PSU). This event does exist for this version (which you can validate by running ‘oradebug doc component’), however, it does not yield any output. This means I have to revert to the previous method of running sql_trace at level 8 and interpret the wait events.

TS@test > alter session set events 'trace[nsmtio]:sql_trace level 8'; -- no NSMTIO lines, only sql_trace!
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1440';
TS@test > select count(*) from table_1440;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that with Oracle version 11.2.0.3, the direct path read switching point seems to have moved from 5 times small table threshold to small table threshold itself.

Oracle 11.2.0.4.170718
This version is in active support!
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1538

With the small table threshold being 1538, the medium table threshold should be approximately 1538*5=7690.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1538 1538
...
    BLOCKS
----------
      1664
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7690 7690
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans.

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7693 blocks),
-- main event: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1538';
SQL> select count(*) from table_1538;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14422, objn: 14422
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14423, objn: 14423
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7690';
SQL> select count(*) from table_7690;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14424, objn: 14424
-- main event: direct path read

This shows that with Oracle version 11.2.0.4, the direct path read switching is at small table threshold, which was changed starting from 11.2.0.3.

Oracle version 12.1.0.2.170718
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1440

SQL>

With small table threshold being 1440, the medium table threshold is approximately 1440*5=7200.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7203 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20489, objn: 20489
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20490, objn: 20490
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20491, objn: 20491
-- main events: direct path read

This is in line with the switch in version 11.2.0.3 to small table threshold as the switching point between buffered reads and direct path reads.

Oracle 12.2.0.1.170814
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1444

SQL>

With small table threshold being 1444, the medium table threshold is approximately 1444*5=7220.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1504 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 7048 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 22502, objn: 22502
-- main events: direct path read

Hey! With 12.2.0.1 the direct path read switching point reverted back to pre-11.2.0.3 behaviour of switching on 5 times small table threshold instead of small table threshold itself.

Update!
Re-running my tests shows differences in the outcome between buffered and direct path reads. My current diagnosis is that the scan type determination uses a step based approach:

– The first determination of size is done with ‘NSMTIO: kcbism’ (kcb is medium). If islarge is set to 1, it means the segment is bigger than STT. If islarge is set to 0 it means the segment is smaller than STT, and the segment will be read buffered, and the line ‘qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown in the NSMTIO output.

– The next line is 'NSMTIO: kcbimd' (kcb is medium determination?) It shows the size of the segment (nblks), STT (kcbstt), MTT (kcbpnb) and is_large, which in my tests always is set to 0. Here, there are 4 options that I could find:

1) Segment size between STT and MTT and a buffered read is executed.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, which probably is externalised in the line 'NSMTIO: kcbcmt1'. The outcome can be a buffered read, for which the line 'qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown.

2) Segment size between STT and MTT and the direct path code path is chosen.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, probably externalised in the line 'NSMTIO: kcbcmt1'. If the costing determines it would be beneficial to use a direct path mechanism, it seems it switches to the direct path with cache determination code, which is also used for any table scan that is smaller than VLOT. Because of switching to that code, it will determine if the segment is bigger than VLOT: 'NSMTIO: kcbivlo', which of course in this case isn't true. Then, it will show the line 'NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]'

3) Segment size bigger than MTT but smaller than VLOT.
If the segment is between MTT and VLOT, the Oracle engine does not apply the costing mechanism (which is means the kcbcmt1 line is not shown). It will determine if the segment is bigger than VLOT ('NSMTIO: kcbivlo'), and then show 'NSMTIO: qertbFetch:[MTT VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

4) Segment size bigger than VLOT.
If the segment is bigger than VLOT, the Oracle engine execute the functions kcbimd and kcbivlo, the NSMTIO line for kcbivlo will show is_large 1 to indicate it’s a very large object (VLOT by default is ‘500’, which is 5 times the total number of buffers in the buffer cache. The qertbFetch line will say ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

In the cases where ‘NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]' is shown, which is either a segment between STT and MTT which switched to this code path, or between MTT and VLOT, the code will apply a second determination and potential switching point from buffered to direct path or vice versa, which is shown with the line 'kcbdpc' (kcb direct path check). The outcome can be:

– NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ] to indicate it will use a buffered read.
– NSMTIO: kcbdpc:DirectRead to indicate it will use a direct path read.

I have verified the above 'decision tree' in 11.2.0.2, 11.2.0.3, 11.2.0.4, 12.1.0.2 and 12.2.0.1. It all seems to work this way consistently. I derived this working by looking at the NSMTIO tracing of 12.2, and then gone back in version. You will see that going lower in versions, there is lesser (nsmtio) tracing output; 11.2.0.4 does show way lesser information, for example, it does not show the kcbcmt1 line, and of course 11.2.0.3 and 11.2.0.2 do not show NSMTIO lines altogether. In order to verify the working, I used gdb and quite simply breaked on the kcbism, kcbimd, kcbcmt1, kcbivlo and kcbdpc functions in the versions where this information was missing in the trace.

Still, at the kcbcmt1 point:
– 11.2.0.2 seems to quite consistently take MTT as the direct path switching point.
– 11.2.0.3-12.1.0.2 seem to quite consistently take STT as the direct path switching point.
– 12.2.0.1 varies.

Conclusion.
This article first explained the differences between buffered and direct path reads, and why this is important, and that it is even more important with Exadata for smartscans.

The next part shows how to measure the switching point. The most important message from this blog article is that starting from 11.2.0.3 up to 12.1.0.2 the direct path read switching point is small table threshold, and with Oracle database version 12.2.0.1, the direct path switching point is changed back to pre-11.2.0.3 behaviour which means 5 times the small table threshold of the instance.
The next part shows measurements of the switching point. The addition shows that between STT and MTT there is a cost based decision to go direct path or buffered path. Once the direct path is chosen, it still can go buffered if the majority of the blocks are in the cache.

If you look closely at the output of the nsmtio lines for version 11.2.0.3-12.1.0.1 for tables that had a size between small table threshold and medium table threshold, it seemed a bit weird, because the nsmtio trace said ‘[MTT < OBJECT_SIZE < VLOT]', which to me means that Oracle detected the object size to be between medium table threshold and very large object threshold, which was not true. I can't tell, but it might be a bug that is solved for measuring the wrong size.
The text description in the NSMTIO qertbFetch line is bogus, it simply is a code path; ‘[- STT < OBJECT_SIZE < MTT]' means it's a buffered read, and could be chosen when < STT or in between STT and MTT, '[MTT < OBJECT_SIZE < VLOT]' means it's a direct path read, and could be chosen when in between STT and MTT or MTT and VLOT.

I added the scripts and examples of the tracing events so you can measure this yourself in your environment.

This post shows you how to install Rundeck with the Ansible plugin on Centos 7. The installation is done with nginx as the web server and using SSL with a self signed certificate. Please read the Ansible installation script, and modify anything that should be different for your situation. You will be amazed how well readable the installation script is!

Rundeck is a web based user interface that allows you to run commands against a group of hosts. Rundeck has an ansible plugin. Using that plugin, it could perform the similar functionality as Ansible Tower (commercial product) or Semaphore (open source).

After a fresh installation of Centos 7, do the following as root:

yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ansible git
git clone https://gitlab.com/FritsHoogland/install_rundeck.git

This adds the EPEL (extra packages for Enterprise Linux) repository to Centos, which contains ansible and git. The next yum command installs ansible (the orchestration software) and git (distributed version control). The git clone command pulls the ansible orchestration script from my install_rundeck git repository at gitlab.

It is important to open the install_rundeck/install_rundeck.yml script and modify anything that is different in your situation. The public hostname will most likely be different than the 192.168.66.16 in the script. You might want to change the locality information with the certificate generation (unless you live in Lutjebroek like me :-). If you do a corporate installation, you might want to skip this step altogether and get a certificate pair signed by your company’s CA.

Please mind I ran into issues with ansible inventories:
– The hosts in the inventory need to have ansible run against them to pick up their properties and become visible in rundeck in the nodes tab. For being able to have ansible run against the hosts in the inventory, the host need to exist in the ssh known hosts file of the rundeck user, otherwise ansible can’t run and the host or hosts will not be visible in rundeck. The solution is to log in as the rundeck user and logon to the hosts in the inventory once manually and accept the host or hosts. From a security perspective it’s understandable that you careful need to assess the hosts to trust, but from an automation standpoint this is annoying. Outside of essentially filling out the known hosts file as I described, there are several other workarounds.
– I created an ansible inventory file in the rundeck project directory and entered the hosts in it. Rundeck picked up the hosts (after solving the above point they became visible in the nodes tab), however when executing something using ansible via rundeck it would say ‘[WARNING]: provided hosts list is empty, only localhost is available’. This means ansible was not pointed explicitly to an inventory, so it used the default one. In order to solve this, I symlinked my (rundeck) project inventory to the /etc/ansible/hosts to make it centrally available. Apparently, using a central inventory for ansible using the plugin is by design. I would rather have rundeck generate an inventory per execution, and pointing to it when the plugin executes ansible.

Now install rundeck:

ansible-playbook install_rundeck/install_rundeck.yml

Done!

Prometheus is an open source systems monitoring and alerting toolkit originally build at Soundcloud. This blogpost shows how to install the needed components to do visualisation of linux system statistics via Grafana.

The setup consists of 3 components:
node_exporter, an exporter of system and hardware metrics.
prometheus, a metric collection and persistence layer.
grafana, the visualisation layer.

1. Preparation
The needed components are installed in the home directory of the user ‘prometheus’. In order for that user exist, it must obviously first be created:

# useradd prometheus
# su - prometheus
$

This installation guide uses Oracle Linux 7.3, but should work for RHEL or Centos too.

2. Node exporter
The next thing to do is install the node exporter. Please mind new version do come out, so you might want to verify the latest release on

$ curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz"
$ mkdir -p Prometheus/node_exporter
$ cd $_
$ tar xzf ../../node_exporter-0.14.0.linux-amd64.tar.gz

Now become root and create a unit file to automatically startup the node exporter using systemd:

# echo "[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

[Install]
WantedBy=default.target" > /etc/systemd/system/node_exporter.service

And make systemd start the node exporter:

# systemctl daemon-reload
# systemctl enable node_exporter.service
# systemctl start node_exporter.service

Next you can verify if the node exporter is running by using ‘systemctl status node_exporter.service:

# systemctl status node_exporter.service
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:20:54 UTC; 7s ago
 Main PID: 3017 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─3017 /home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - hwmon" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - infiniband" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - textfile" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - conntrack" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - diskstats" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - entropy" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - loadavg" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - sockstat" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - wifi" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg="Listening on :9100" source="node_exporter.go:186"

Additionally, you can go to hostname:9100, and look if that page says ‘node exporter’, and has a link called ‘metric’, which has all the metrics.

3. Prometheus
After we installed node_exporter to provide measurements, we must install the software that can fetch that information and store it. That is what prometheus does. First, become the prometheus user again, and install prometheus. Here too is important to realise that newer versions will come out after this article has been written:

# su - prometheus
$ curl -LO "https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz"
$ cd Prometheus
$ tar xzf ../prometheus-1.7.1.linux-amd64.tar.gz
$ cd prometheus-1.7.1.linux-amd64
$ echo "scrape_configs:

  - job_name: 'prometheus'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9100']"> prometheus.yml

This downloaded and unzipped prometheus, and created prometheus scrape config to fetch data from prometheus itself and the node exporter. Now become root, and install the systemd unit file for prometheus:

# echo "[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus.yml -storage.local.path=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/data

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/prometheus.service

And make systemd start prometheus:

# systemctl daemon-reload
# systemctl enable prometheus.service
# systemctl start prometheus.service

And verify prometheus is running:

# systemctl status prometheus.service
● prometheus.service - Prometheus Server
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:36:55 UTC; 9s ago
     Docs: https://prometheus.io/docs/introduction/overview/
 Main PID: 22656 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─22656 /home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1....

Jul 31 15:36:55 test.local systemd[1]: Started Prometheus Server.
Jul 31 15:36:55 test.local systemd[1]: Starting Prometheus Server...
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting prometheus (version=1.7.1, branch=mast...n.go:88"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Build context (go=go1.8.3, user=root@0aa1b7fc43...n.go:89"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Host details (Linux 3.10.0-514.26.2.el7.x86_64 ...n.go:90"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading configuration file /home/prometheus/Pro....go:252"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading series map and head chunks..." source="....go:428"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="0 series loaded." source="storage.go:439"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Listening on :9090" source="web.go:259"
Hint: Some lines were ellipsized, use -l to show in full.

Additionally you can go to hostname:9090/targets and verify both node_exporter and prometheus report state=UP.

At this point, system metrics are fetched and stored. All we need to do, is visualise it. An excellent tool for doing so is grafana. This is how grafana is installed:

4. Grafana
This webpage shows installation instructions and a link to the latest version. During the time of writing of this blogpost, the latest version was 4.1.1. This is how grafana is installed: (please mind installation and systemd require root privileges)

# yum install https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.4.1-1.x86_64.rpm

Next up make systemd handle grafana and start it:

# systemctl daemon-reload
# systemctl enable grafana-server.service
# systemctl start grafana-server.service

And check if grafana is running:

# systemctl status grafana-server.service
● grafana-server.service - Grafana instance
   Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:43:11 UTC; 1min 58s ago
     Docs: http://docs.grafana.org
 Main PID: 22788 (grafana-server)
   CGroup: /system.slice/grafana-server.service
           └─22788 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile= cfg:default.paths.logs=/var/log/grafana cfg:default.path...

Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Starting plugin search" logger=plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=warn msg="Plugin dir does not exist" logger=plugins dir=/...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Plugin dir created" logger=plugins dir=/var/lib...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Alerting" logger=alerting.engine
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing CleanUpService" logger=cleanup
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Stream Manager"
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing HTTP Server" logger=http.server ad...socket=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Hint: Some lines were ellipsized, use -l to show in full.

5. Grafana configuration
Next, we need to hook up grafana with prometheus. First, go to hostname:3000.
– Login with admin/admin
– Click ‘add datasource’
– Name: prometheus, Type: Prometheus
– Http settings: http://localhost:9090, select Access: ‘proxy’.
– Click ‘save and test’. This should result in ‘success’ and ‘datasource updated.’

Now click on the grafana symbol in the left upper corner, dashboards, import. Enter ‘2747’ at ‘grafana.com dashboard’. This will say ‘Linux memory’, select the prometheus datasource which you just defined, and click import.

This should result in a dashboard the shows you the linux memory area’s (click on the picture to get a better view!):

One of the principal important configuration settings for running an Oracle database is making appropriate use of memory. Sizing the memory regions too small leads to increased IO, sizing the memory regions too big leads to inefficient use of memory and an increase in memory latency most notably because of swapping.

On Linux, there is a fair amount of memory information available, however it is not obvious how to use that information, which frequently leads to inefficient use of memory, especially in today’s world of consolidation.

The information about linux server database usage is available in /proc/meminfo, and looks like this:

$ cat /proc/meminfo
MemTotal:        3781616 kB
MemFree:          441436 kB
MemAvailable:    1056584 kB
Buffers:             948 kB
Cached:           625888 kB
SwapCached:            0 kB
Active:           500096 kB
Inactive:         447384 kB
Active(anon):     320860 kB
Inactive(anon):     8964 kB
Active(file):     179236 kB
Inactive(file):   438420 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       1048572 kB
SwapFree:        1048572 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:        320644 kB
Mapped:           127900 kB
Shmem:              9180 kB
Slab:              45244 kB
SReclaimable:      26616 kB
SUnreclaim:        18628 kB
KernelStack:        3312 kB
PageTables:         6720 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1786356 kB
Committed_AS:     767908 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       13448 kB
VmallocChunk:   34359721984 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
CmaTotal:          16384 kB
CmaFree:               4 kB
HugePages_Total:    1126
HugePages_Free:     1126
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       65472 kB
DirectMap2M:     4128768 kB

No matter how experienced you are, it’s not easy to get a good overview just be fetching this information. The point is these figures are not individual memory area’s which you simply can add up to understand to the total memory used by linux. Not even all figures are in kB (kilobyte), the HugePages values are in number of pages.

In fact, there is no absolute truth that I can find that gives a definite overview. Here is a description of what I think are the relevant memory statistics in /proc/meminfo:

MemFree: memory not being used, which should be low after a certain amount of time. Linux strives for using as much memory as much as possible for something useful, most notably as cache. If this number remains high, there is ineffective use of memory.
KernelStack: memory being used by the linux kernel.
Slab: memory being used by the kernel for caching data structures.
SwapCached: memory being used as a cache for memory pages being swapped in and out.
Buffers: memory being used as an IO buffer for disk blocks, not page caching, and should be relatively low.
PageTables: memory used for virtual to physical memory address translation.
Shmem: memory allocated as small pages shared memory.
Cached: memory used for caching pages.
Mapped: memory allocated for mapping a file into an address space.
AnonPages: memory allocated for mapping memory that is not backed by a file (“anonymous”).
Hugepagesize: the size for huge pages blocks. Valid choices with current modern intel Xeon CPUs are 2M or 1G. The Oracle database can only use 2M HugePages on linux.
HugePages_Total: total number of pages explicitly allocated as huge pages memory.
HugePages_Rsvd: total number of pages allocated as huge pages memory, but not yet allocated (and thus reported as free).
HugePages_Free: total number of pages available as huge pages memory, includes HugePages_Rsvd.

Based on information in several blogposts and experimenting with the figures, I came up with this formula to get an overview of used memory. This is not an all-conclusive formula, my tests so far get me within 5% of what Linux is reporting as MemTotal.

Warning: the text: “Another thing is that in most cases when the system has swapped out, ‘Cached’ (minus ‘Shmem’ and ‘Mapped’) gets negative, which I currently can’t explain.” is to true anymore!
By dividing Cached memory into Shmem and Cached+Mapped memory, there is no negative value anymore! I can’t find a way to make a distinction between true ‘Cached’ memory, meaning pages cached without any process attached purely for the sake of reusing them so they do not need to be physically read again, and true Mapped pages, meaning pages that are mapped into a process address space. I know there is a value ‘Mapped’, but I can’t work out reliably how to make the distinction between cache and true mapped. Maybe there even isn’t one.

This is that formula:

MemFree
+ KernelStack
+ Buffers
+ PageTables
+ AnonPages
+ Slab
+ SwapCached
+ Cached – Shmem
+ Shmem
+ HugePages used (HugePages_Total-Hugepages_Free)*Hugepagesize
+ Hugepages rsvd (Hugepages_Rsvd*Hugepagesize)
+ Hugepages free (Hugepages_Free-Hugepages_Rsvd)*Hugepagesize
————————————————————-
= Approximate total memory usage

In order to easily use this, I wrote a shell script to apply this formula to your Linux system, available on gitlab: https://gitlab.com/FritsHoogland/memstat.git. You can use the script to get a (quite wide) overview every 3 seconds by running ./memstat.sh, or you can get an overview of the current situation by running ./memstat.sh –oneshot.

This is how a –oneshot of my test system looks like (which is a quite small VM):

$ ./memstat.sh --oneshot
Free                 773932
Shmem                  2264
Mapped+Cached        359712
Anon                 209364
Pagetables            28544
KernelStack            4256
Buffers                   0
Slab                  63716
SwpCache              21836
HP Used             2023424
HP Rsvd               75776
HP Free              206848
Unknown               11944 (  0%)
Total memory        3781616
---------------------------
Total swp           1048572
Used swp             200852 ( 19%)

There is a lot to say about linux memory management. One important thing to realise is that when a system is running low on memory, it will not show as ‘Free’ declining towards zero. Linux will keep a certain amount of memory for direct use, dictated by ‘vm.min_free_kbytes’ as the absolute minimum.

In general, the ‘Cached’ pages (not Shmem pages at first) will be made available under memory pressure, since the Linux page cache really is only caching for potential performance benefit, there is no process directly attached to ‘Cached’ pages. Please mind my experimentations show there is no reliable way I could make a distinction between true ‘Mapped’ pages, meaning pages which are in use as memory mapped files, and true ‘Cached’ pages, meaning disk pages (blocks) sized 4KB which are kept in memory for the sake of reusing them, not directly related to a process.

Once the the number of page cache pages gets low, and there still is need for available pages, pages from the other categories are starting to get moved to swap. This excludes huge pages, even if they are not used! The way pages are considered is based on an ageing mechanism. This works quite well for light memory pressure for a short amount of time.

In fact, this works so well that the default eagerness of the kernel to swap (vm.swappiness, 60 by default, I have seen 30 as a default value too, 0=not eager to swap, 100=maximal swap eagerness) seems appropriate on most systems, even ones which need strict performance requirements. In fact, when swappiness is set (too) low, the kernel will try to avoid swapping as long as possible, meaning that once there is no way around it, it probably needs to swap multiple pages, leading to noticeable delays, while paging out single pages more in advance will have a hardly noticeable overhead.

However, please mind there is no way around consistent memory pressure! This means if memory in active use exceeds physical available memory, it results in physical memory to be shared at the cost of active memory pages being swapped to disk, for which process have to wait.

To show the impact of memory pressure, and how hard it is to understand that from looking at the memory pages, let me show you an example. I ran ‘memstat.sh’ in one session, and the command ‘memhog’ (part of the numactl rpm package) in another. My virtual machine has 4G of memory, and has an Oracle database running which has the SGA allocated in huge pages.

First I started memstat.sh, then ran ‘memhog 1g’, which allocates 1 gigabyte of memory and then releases it. This is the memstat output:

$ ./memstat.sh
          Free          Shmem  Mapped+Cached           Anon     Pagetables    KernelStack        Buffers           Slab       SwpCache        HP Used        HP Rsvd        HP Free        Unknown   %
         42764         435128         495656         387872          40160           4608             96          38600             24              0              0        2306048          30660   0
         42616         435128         495656         388076          40256           4640             96          38572             24              0              0        2306048          30504   0
         42988         435128         495656         388088          40256           4640             96          38576             24              0              0        2306048          30116   0
         42428         435128         495700         388108          40264           4640             96          38600             24              0              0        2306048          30580   0
        894424         320960          99456          12704          40256           4640              0          35496          42352              0              0        2306048          25280   0
        775188         321468         141804          79160          40260           4640              0          35456          70840              0              0        2306048           6752   0
        698636         324248         201476          95044          40264           4640              0          35400          64744              0              0        2306048          11116   0
        686452         324264         202388         107056          40260           4640              0          35392          66076              0              0        2306048           9040   0
        682452         324408         204496         108504          40264           4640              0          35388          65636              0              0        2306048           9780   0

You can see memstat taking some measurements, then memhog is run which quickly allocates 1g and releases it. This is done between rows 6 and 7. First of all the free memory: once the process has allocated all the memory, it stops running which means the memory is freed. Any private memory allocation mapped into the (now quitted) process address space which has backing by a physical page is returned to the operating system as free because it has effectively become available. So what might seem counter-intuitive, by stopping a process that allocated a lot of non-shared (!) memory, it results in a lot of free memory being available.

As I indicated, ‘Cached’ memory is first to be released to provide memory pages for direct use. Mapped+Cached does contain this together with Mapped memory. The amount of pages used by Mapped and Cached are drastically reduced by swapping. ‘Anon’ pages are significantly reduced too, which means they are swapped to the swap device, and ‘Shmem’ is reduced too, which means swapped to the swap device, but way lesser than ‘Mapped+Cached’ and ‘Anon’. ‘Kernel’ (kernel stack) and ‘Pagetables’ hardly decreased and ‘Slab’ decreased somewhat. ‘Swapcache’ actually grew, which makes sense because that is related to the swapping that took place.

The main thing I wanted to point out is that between the time of no memory pressure (lines 2-6) and past memory pressure (8-16), there is no direct memory statistic showing that a system is doing okay nor having suffered. The only thing that directly indicates memory pressure are active swapping in and swapping out, which can be seen with sar -W; pswpin/s and pgwpout/s, or vmstat si/so columns; which are not shown here.

Even past memory pressure, where prior linux memory management had swapped out a lot of pages to facilitate the 1G being allocated which immediately after been allocated was freed and returned as free memory, the majority of the pages on my system that have been swapped out are still swapped out:

$ ./memstat.sh --oneshot
...
Total memory       3781616
--------------------------
Total swp          1048572
Used swp            407228 ( 38%)

This underlines an important linux memory management principle: only do something if there is an immediate, direct need. My system now has no memory pressure anymore, but still 38% of my swap is allocated. Only if these pages are needed, they are paged back in. This underlines the fact that swapping can not and should not be measured by looking at the used amount of swap, a significant amount of swap being used only indicates that memory pressure has occurred in the past. The only way to detect swapping is taking place is by looking at the actual current amount of pages being swapped in and out.

If you see (very) low amounts of pages being swapped out without pages being swapped in at the same time, it’s the swappiness setting that makes pages being moved that have not been used for some time out to the swap device. This is not a problem. If you see pages being swapped in without pages being swapped out at the same time, it means pages that were swapped out either because of past memory pressure or proactive paging due to swappiness are read back in, which is not a problem too. Again, only if both pages are actively being swapped in and out at the same time or if the rate is very high there is a memory problem. The swapping actually is helping you not fail because of memory not being available at all.

This is the third part in a series of blogposts about how postgresql manages data in its blocks (called ‘pages’ in postgres speak). If you found this post and did not read the previous ones, it might be beneficial to read block internals (part 1) and block internals, part 2 first. In these blogposts I’ve shown how heap and index pages look like, and how these can be investigated, including looking at the raw block information.

This blogpost is intended to show the effects on pages when DML happens. This is inherently different from my personal reference of database implementation, which is the oracle database.

If you followed the previous two articles, you have a table mytable with one thousand rows. In order to simplify and be able to grasp the essence of the implications of data changes, drop mytable and recreate the simple situation of mytable with four rows again:

test=# drop table if exists mytable;
DROP TABLE
test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

This should result in the table having four rows in the first page:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

And the second index page having the index tuples of these four rows:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The next thing to do, is change the configuration of postgres to disable auto-vacuum. This should absolutely not be done on any postgres database serving something useful, but in order to see the effects of data changes, and the changes as a result of a vacuum, vacuum should not kick in automatically, because then we can’t predict what we will be looking at. This is governed by the parameter ‘autovacuum’, which is turned on by default:

test=# show autovacuum;
 autovacuum
------------
 on

Turn autovacuum off and reload the configuration:

test=# alter system set autovacuum=off;
ALTER SYSTEM
test=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
test=# show autovacuum;
 autovacuum
------------
 off

Now let’s update id 1 and set f1 to ‘zzzzzzzzzz’:

test=# update mytable set f1 = 'zzzzzzzzzz' where id = 1;
UPDATE 1

Now let’s look at the page contents of the table mytable:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |   1791 |        0 | (0,5)  |       16386 |        258 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1791 |      0 |        0 | (0,5)  |       32770 |      10242 |     24 |        |       | \x01000000177a7a7a7a7a7a7a7a7a7a

This is a good example how read consistency and row versioning works in postgres. The update inserted a new tuple at offset 7992 (lp_off) in the page, and added a pointer to it in the line pointer array (lp 5). Mind the ‘the update inserted a new tuple’ which is fundamental! In the metadata of tuple ID 1, xmax is set to indicate the maximum transaction ID which should be able to see the this version of the tuple. The postgres documentation says ‘delete XID stamp’ for xmax. Tuple ID 5 is created with exactly the same XID for xmin as tuple 1 has for xmax, which is the transaction ID starting from which this tuple should be visible. Obviously, for this tuple xmax is set to 0, indicating no newer version exists. Also, the t_ctid (current tuple ID) for tuple 1 is set to 0,5 (block 0, tuple 5), indicating the next version of that tuple is tuple ID 5. To indicate what happened, infomask2 and infomask have been set.
infomask2 16386 means: heap has two attributes, heap HOT updated.
infomask 258 means: has variable attributes, xmin committed.

The new tuple has the following values for infomask2 and infomask:
infomask2 32770 means: heap has two attributes, is heap only tuple.
infomask 10242 means: has variable attributes, xmax invalid, this is an updated version of the row.
(see the blogpost postgresql block internals, part 2 for a description of the infomask2 and infomask bitfields)

It should be clear now the update caused the active version of the tuple with id = 1 to be tuple ID 5 instead tuple ID 1. Please mind tuple ID 1 is still entirely present (in the line pointer array and in the data part of the page).

How does this looks like in the index?

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

Nothing changed!

Actually, this is exactly what ‘HOT’ (heap only tuple) means to describe in the above descriptions (infomask2, heap HOT updated). A new version of a tuple is created, but it is not covered by any index. This means when table/heap tuple with id=1 is accessed using the index, the index points to the original tuple ID (1) in the heap/table, which points to the new version of the tuple (tuple ID 5). This is an optimisation to lessen the amount of work done when updating a field that is not indexed, at the cost of following an additional pointer.

What happens if the tuple with id = 1 gets updated again?

test=# update mytable set f1 = 'yyyyyyyyyy' where id = 1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |   1791 |        0 | (0,5)  |       16386 |       1282 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1791 |   1792 |        0 | (0,6)  |       49154 |       8450 |     24 |        |       | \x01000000177a7a7a7a7a7a7a7a7a7a
  6 |   7952 |        1 |     39 |   1792 |      0 |        0 | (0,6)  |       32770 |      10242 |     24 |        |       | \x010000001779797979797979797979

infomask 2 for tuple ID 5 now says: heap has two attributes, is heap only tuple and additionally: heap was HOT updated.
infomask for tuple ID 5 now says: heap has variable attributes, xmin committed and additionally: this is an updated version of the tuple.

Another update essentially performs exactly the same transaction on tuple ID 5 as previously done on tuple ID 1: a new tuple is created containing the result of the update (linepointer array entry and data at the end of the page), xmax is set for tuple ID 5 and xmin is set for tuple ID 6 with the same transaction ID, and the current tuple id for tuple ID 5 is set to the new version for that tuple, which is tuple ID 6.

As can be seen by looking at the value for infomask2, tuple ID 6 now has the same value as tuple ID 5 had previously. Just like the tuple ID 5 previously, the new tuple with ID 6 is a ‘hot’ updated tuple too, which means the index was once again left untouched, so the index still points to heap tuple ID 1. This means that a session that uses the index to find the tuple, will be pointed to heap tuple ID 1, which points to tuple ID 5, which points to tuple ID 6 in this case.

It’s obvious that this tuple pointer chasing can easily lead to more work to be done to find a row/tuple, meaning lesser performance. It also is obvious that this construction exists to provide a way to produce an older version of a tuple for read consistency. However, this old version is not useful anymore after some time, which is when no query exists with a transaction ID lower than the value for xmax for the old tuples, and needs to be purged. This is where vacuum comes in. Let’s vacuum mytable, and read the heap/table page again:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     39 |   1792 |      0 |        0 | (0,6)  |       32770 |      10498 |     24 |        |       | \x010000001779797979797979797979

There are a few things vacuum did, and some things it didn’t:
– The line pointer array entries all stayed.
– The first entry points to heap tuple ID 6. lp_flags=2 together with lp_len=0 means the lp_off field points to the active tuple ID. The reason the first row still exists, is because the index entry still points to it.
– The fifth line pointer entry is all zero, lp_flags=0 means ‘unused’.

If you look closely at the line pointer array entries and the tuple data offsets before and after vacuum, you see that the past versions of the tuple data do not exist anymore. In fact, if you look closely to the offset (lp_off) values of the tuples that are stored in the page, you see that ID 2 took the offset value from ID 1 from before vacuuming, indicating it moved all the way to the back of the page. In fact, vacuum compacted the data tuples, maximising free space in the page.

A logical next question is: okay, this is way better than how it was left for read consistency after the two transactions, but it is still not optimal. Can it be fully cleaned? The answer is yes, vacuum can be told to fully clean up, including the removing any non-functional line pointer array entries, and as a logical consequence have all index entries point directly to the heap tuple it ought to point to. This is called ‘vacuum full’.

Why not use vacuum full all the time? The reason is vacuum full requires a table lock for the entire duration of running vacuum. This is not desirable and acceptable in a lot of databases. If normal vacuum is run sufficiently, running vacuum full will not yield much additional benefit in most cases, because when normal vacuum is run sufficiently, the old row versions are cleaned up in time, preventing excessive amounts of old versions in a page. This is important because normal vacuum cleans up within a page, but it does not move tuples across pages, freeing up pages and thus operating system disk space, which is what vacuum full does. This is why autovacuum should be turned on, to clean up old versions in time.

When an update performs an update on a field that is covered by the index, obviously the update can not be ‘HOT’, because the index must cover the fields it is created on. Let’s update the id of id=1 to 5:

test=# update mytable set id=5 where id=1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1814 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1814 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1814 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1817 |      0 |        0 | (0,5)  |           2 |      10242 |     24 |        |       | \x050000001779797979797979797979
  6 |   8032 |        1 |     39 |   1816 |   1817 |        0 | (0,5)  |       40962 |       8450 |     24 |        |       | \x010000001779797979797979797979

As you can see, tuple ID/ line pointer number 5, which was emptied by vacuum, now is reused. The chain of tuple ID’s now is: ID 1 points to ID 6. ID 6 has xmax set, and t_ctid points to ID 5, with xmin set to the previous version’s xmax value. The value of 2 for t_infomask2 for ID 5 shows no bits indicating a HOT update. How does this look like in the index?

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          5 | (0,5) |      16 | f     | f    | 05 00 00 00 00 00 00 00

Because this is not a HOT update, there is a change to the index, an index tuple is added for the new value: ‘5’, which points to heap page 0 tuple ID 5, which is the result of the update. It is interesting to see value ‘1’ still exists in the index, while value ‘5’ is the actual value and ‘1’ the old value of the same heap tuple. It is logical however if you realise there is no versioning data in the index, so in order to understand if a tuple should be visible requires the heap/table tuple to be read. Once again, if you vacuum the table, things get sorted:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1814 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1814 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1814 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   8032 |        1 |     39 |   1817 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x050000001779797979797979797979
  6 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |

Tuple ID 1 now is empty, because it is relieved from its function of pointing to tuple ID 6, because tuple ID 6 gotten a new version, which means tuple ID 6 could be and is freed by vacuum. If you look in the index, you will see vacuum now freed up the tuple of the updated value:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          3 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          4 | (0,5) |      16 | f     | f    | 05 00 00 00 00 00 00 00

The bt_page_items function does not reveal too much information about line pointer and data offsets, in other words, what vacuum did, let’s look at the raw block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 78 60 b5 09 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 05 00 00 00 00 00 00 00
*
0008128 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008144 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008160 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

The interesting bit is the second star indicating a duplicate row, which is the index tuple pointing to the value ‘5’. In order to understand what this means, let’s read the line pointer array for the index tuples. Here’s a little script to do that:

for TUPLE_ID in 1 2 3 4 5 6; do
 LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
 DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
 LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
 LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
 TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
 TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
 LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
 LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
 echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done

And this is the output:

TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000
TID: 5 data offset: 8096 length: 16 binary: 0000000000100000 1001111110100000
TID: 6 data offset: 0 length: 0 binary: 0000000000000000 0000000000000000

That’s weird! It looks like there is an additional line pointer entry with tuple that points to the value ‘5’, which are at offset 8112 (which is the row that second asterisk was put in for), and offset 8096. If you look at the line pointer status bits (the last bit of the first bitfield ‘0000000000100000’) and the first bit of the next field, it’s all ’01’ which means LP_NORMAL (normal, live line pointer entry).

After some reading up in the source code, it turns out the upper value in the page header is used to determine which line pointer array entries are valid. So the above code should be:

PAGE_HEADER_UPPER=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t d2 -j 14 -N 2 | awk '{ print $1 }' )
for TUPLE_ID in $( seq 1 1000 ); do
 LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
 DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
 LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
 LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
 TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
 [ $PAGE_HEADER_UPPER -gt $TRUE_DATAPTR ] && break
 TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
 LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
 LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
 echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done

Which will only show the actual entries based on the upper value in the page header. This shows:

TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000

If you look back to the raw page, it should be clear that vacuum did cleaning up in the index. The tuple for value ‘1’ was deleted, meaning that the first entry in the line pointer array now is the value ‘2’. Also the data for the index tuples at the end of the block has been cleaned up, the data for the new first tuple is copied over the old data, the tuple data moved to the end. As a result, there was a copy left of the data for the tuple with the value ‘5’ in the tuple data area and in the line pointer array, which is not used, because the ‘upper’ value in the page header will tell if a line pointer entry is active or not.

Another scenario is important to understand. What if you insert a lot of rows in a table, and then start updating rows? As we have seen, by default postgres fills up a table page to 100%. In other words: this is not an unlikely scenario. Let’s investigate how that works! In order to begin with a clean slate, drop mytable and create it again and insert a thousand rows. My second blogpost contains an anonymous PL/SQL block that inserts rows in a loop.

If we look at the first page of mytable, we see that it is entirely filled:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/9BC4120 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

There is upper-lower=28 bytes free actually, but the heap tuple length is:

test=# select lp_len from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp_len
--------
     39

So the row length is 39 bytes. So indeed page 0 of mytable is entirely filled.

New let’s update f1 of the first row (id=1) to something. As we know, this will create a new version of the row, which can’t fit in the first block, where the row with id=1 is located:

test=# update mytable set f1='ZZZZZZZZZZ' where id=1;
UPDATE 1

Let’s look at the first row how postgres handled that:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1835 |   1836 |        0 | (5,76) |           2 |        258 |     24 |        |       | \x010000001758585858585858585858

I think the update is entirely as expected, which means xmax has a transaction ID set indicating a new version exists. However t_ctid is set to 5,76, in other words: the new version is created in page 5 tuple ID 76, which means outside of the page where to original tuple version exist. If you look at the metadata of the new tuple:

test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
 76 |   5152 |        1 |     39 |   1836 |      0 |        0 | (5,76) |           2 |      10242 |     24 |        |       | \x01000000175a5a5a5a5a5a5a5a5a5a

You see a normal new version, and no bits are set in t_infomask2 indicating a heap only tuple (hot). This means this update did change the index:

test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (5,76)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00

It seems weird that two index entries exist for id=1 (item offset 2 and 3) in a primary key index. If you think a little longer, it makes perfect sense: there is no version consistency information in the index. Currently two versions of the heap tuple with id=1 exist in the heap/table so it’s logical the index points to them both. Let’s see what happens when the table is vacuumed:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
 76 |   5152 |        1 |     39 |   1836 |      0 |        0 | (5,76) |           2 |      10498 |     24 |        |       | \x01000000175a5a5a5a5a5a5a5a5a5a
test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (5,76)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00

After vacuuming, the old version of the row is emptied, and the index entry for the old version is deleted.

There is an additional thing inherent to how postgres handles DML to any tuple (version), which needs to be taken into account. These is what is commonly called tuple ‘hint bits’. The hint bits are bits in t_infomask in the tuple header (let me emphasise: the header with every single tuple!). During DML on a tuple, the transaction obviously is ongoing, and as such the xmax bits for a deleted or old version of a tuple are reset, and the xmin bits of the new version or inserted tuple are reset. This is how a DML operation leaves the tuples it worked on.

The first process that visits a tuple after a DML operation reads t_xmin and t_xmax to understand which version of the tuple should be active. However, during reading the metadata, if a tuple does not have the transaction status set for t_xmax (HEAP_XMAX_COMMITTED or HEAP_XMAX_INVALID) or for t_xmin (HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID), it must investigate whether the transaction is still ongoing, rolled back or committed, and set the t_infomask bits accordingly.

In order to do this, the process needs to read pg_clog and potentially pg_subtrans. Reading pg_clog/pg_subtrans involves extra reading, and setting the bits means changes to the page, which need to be recorded in the WAL file. This means a select (read only) query can lead to writing to the WAL file! It should be noted that after the just described more expensive first read, which sets the hint bits for a tuple, the next readers can take advantage of that work and do not need to read up on the transaction state of a tuple again, unless a transaction performs DML again. Running vacuum on a table (whether by autovacuum or explicitly) will set the hint bits too.

Conclusion
This blogpost looks at updates to tuples. In postgres, an update means the previous version gets xmax set, and a new version of a tuple is created with xmin set to the xmax of the old version. If the update happened to a non-indexed field and the new tuple can be allocated in the same page, the new tuple can be ‘HOT’, alias a heap only tuple. This way no modification needs to be made to the index, which is an optimisation. However, when the changed tuple is requested via the index, it points to the tuple ID of the old version, which contains the tuple ID of the updated version of that tuple. Of course the updated version can also be updated and point to a newer version.

When a HOT updated table is vacuumed, the index entry of a HOT updated tuple remains untouched, and the tuple ID in the table to which the index points to will be a dedicated pointer to the current version (no matter how many HOT row versions existed prior to vacuuming), without space allocated in the data area. All allocated space of a line pointer entry that is cleaned by vacuum is compacted, meaning that all active data entries are defragmented and allocated from the bottom.

If an indexed entry is updated, the new version is created and an index entry is created for the update. Because the old version still exists for read consistency, the index entry is not deleted, it needs to be available for read consistency. If the table is vacuumed, the old version is emptied from the table, and the index entry is deleted.

Very much alike an indexed field update, if an update causes a new version to be created in another page than the original version of the tuple, the update is not HOT, and causes the index to be modified and the new version from the update is added.

Any DML to a row resets what is called the ‘hint bits’. Of course only the hint bits of either xmin or xmax which are relevant to the transaction are reset. The hint bits describe the state of xmin and xmax, and need to be set by the next process that touches the heap/page tuple. The setting of the hint bits require the process to read pg_clog and potentially pg_subtrans to find out the state of the transaction, and set the hint bits, which is a change to the page, which requires WAL logging.

%d bloggers like this: