Archive

Postgresql

This post is about an issue I ran into when trying to use pgbouncer in front of Yugabyte YSQL, the postgres layer yugabyte reuses for handling relational data.

pgbouncer: https://www.pgbouncer.org

yugabyte: https://www.yugabyte.com

postgres: https://www.postgresql.org

The case for pgbouncer.

Pgbouncer is a connectionpool for postgres databases. It’s a process sitting in between the client and the actual postgres server, and separates clients from a postgres server connection by keeping a pool of connections from the clients (the socalled clients pool), and keeping a pool of connections to the postgres server (the socalled servers pool).

A postgres database connection is forked from the postmaster as an independent process when a client connects, and executes the queries on the client’s behalf. Creating a process is a timetaking and “costly” action for a computer system. The most ideal way of using postgres is to logon once with all the needed connections, and use and reuse these connections.

Sadly, a lot of database clients do not take using the database in the most ideal way into account, and perform database unfriendly connection management, such as creating and dropping connections really quick or creating an high amount of connections, or both.

pgbouncer helps in such cases with the aforementioned separation of clients and servers, and linking a client with a server connection based on the set pool mode, which can be session (a client uses a database connection for the duration of the session until disconnect), transaction (a client uses a database connection for the duration of a transaction) or statement (a client uses a database connection for each query).

The default setting for pool mode is session, based on testing it seems pgbouncer functions most optimal when using transaction mode, so (varying/huge numbers of) client connections can use a lower amount of (dedicated/steady) database connections.

The encountered issue.

I used a default Centos 8.3 image I created myself using packer, using vagrant and virtualbox. This means a lot of the machine configuration is default, using packer means the image is created using scripts. In other words: there was no configuration done by myself or someone else without me knowing it.

The server was functioning correctly, and added pgbouncer (pgbouncer is a package in EPEL), configured it and ran it. I configured pgbouncer to allow 1000 client connections (max_client_conn), and create 100 database connections (default_pool_size,min_pool_size). The way pgbouncer works is that the connections are not created when pgbouncer starts, but per database when a client requests the database from pgbouncer for the first time. I manually connected through pgbouncer and it functioned correctly. So far, so good.

Then I tested it with a synthetic load:

for n in $(1 200) do; ysqlsh -h localhost -p 6432 -c "select pg_sleep(60);" & done

What should happen is that all 200 connections get connected to pgbouncer, and pgbouncer creates 100 connections, to which 100 clients get connected and execute, and 100 clients should remain waiting to be serviced.

However, I noticed that there were only a couple of connections created to the database server, something like 2 or 3, and the amount was slowly increasing. It also lead to pgbouncer eventually killing the connections, because query_wait_timeout is set to 120 (seconds) by default, and with the low number of connections, some sessions do exceed that time waiting and thus are stopped by pgbouncer.

The investigation.

The first thing I did was google the issue. No hits that I could find that actually described my issue.

To validate why I couldn’t find this issue, I setup a machine with vanilla postgres and added pgbouncer to it with the same configuration and performed the same test. In that situation pgbouncer build the server connections really quick, and everything worked as I wanted it to work, meaning that pgbouncer connected all the client connections, and executed the queries using the statically set number of connections to the postgres server.

At that point I started thinking the modifications yugabyte made to the postgres layer might have an impact on this.

But it’s clear, I need additional information to understand where I needed to look.

Looking more deeply into the configuration of pgbouncer I found the option “verbose”, and set it to 5 to get more information, and ran the test against the yugabyte ysql server again. I found the following messages in the pgbouncer log:

2021-04-08 16:59:49.839 UTC [22773] DEBUG sbuf_after_connect_check: pending error: Connection refused

2021-04-08 16:59:49.839 UTC [22773] LOG S-0x563bd3657a80: yugabyte/yugabyte@[::1]:5433 closing because: connect failed (age=0s)

2021-04-08 16:59:49.839 UTC [22773] NOISE safe_close(35) = 0

2021-04-08 16:59:50.174 UTC [22773] DEBUG launching new connection to satisfy min_pool_size

2021-04-08 16:59:50.174 UTC [22773] DEBUG launch_new_connection: last failed, not launching new connection yet, still waiting 14 s

2021-04-08 16:59:50.505 UTC [22773] DEBUG launching new connection to satisfy min_pool_size

2021-04-08 16:59:50.505 UTC [22773] DEBUG launch_new_connection: last failed, not launching new connection yet, still waiting 14 s

2021-04-08 16:59:50.841 UTC [22773] DEBUG launching new connection to satisfy min_pool_size

2021-04-08 16:59:50.841 UTC [22773] DEBUG launch_new_connection: last failed, not launching new connection yet, still waiting 13 s

So pgbouncer encountered a connection refused error, and therefore closes/stops the attempt to add a connection, and then waits for 15 seconds before trying again. Why would it encounter a connection refused, which it didn’t encounter with vanilla postgres?

To be more practical, there also is a setting that allows tweaking the back-off time for the connection refused issue: server_login_retry. When I set it to 0, I still do get the error, but pgbouncer then builds up the server connections.

But it’s not very satisfactory to have this rather blunt workaround. I would like get the issue solved!

Looking deeper.

I decided I need to get this solved. Because this is a connection refused error, it’s logical to look at networking. A really good tool for this case is tcpdump. tcpdump allows (a superuser/root) to capture network traffic and visualise it. For the investigation, I setup the server to have no other connections to the database, pgbouncer freshly started so no connection are built up yet, and then connect to pgbouncer to trigger it to build the connections.

First regular postgres:

vagrant@localhost ~]$ sudo tcpdump -ttt -i lo -n port 5432

dropped privs to tcpdump

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [S], seq 952020725, win 43690, options [mss 65495,sackOK,TS val 3901196965 ecr 0,nop,wscale 6], length 0

IP 127.0.0.1.postgres > 127.0.0.1.48366: Flags [S.], seq 3702813748, ack 952020726, win 43690, options [mss 65495,sackOK,TS val 3901196965 ecr 3901196965,nop,wscale 6], length 0

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [.], ack 1, win 683, options [nop,nop,TS val 3901196965 ecr 3901196965], length 0

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [P.], seq 1:9, ack 1, win 683, options [nop,nop,TS val 3901197012 ecr 3901196965], length 8

IP 127.0.0.1.postgres > 127.0.0.1.48366: Flags [.], ack 9, win 683, options [nop,nop,TS val 3901197012 ecr 3901197012], length 0

IP 127.0.0.1.postgres > 127.0.0.1.48366: Flags [P.], seq 1:2, ack 9, win 683, options [nop,nop,TS val 3901197012 ecr 3901197012], length 1

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [.], ack 2, win 683, options [nop,nop,TS val 3901197012 ecr 3901197012], length 0

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [P.], seq 9:98, ack 2, win 683, options [nop,nop,TS val 3901197012 ecr 3901197012], length 89

IP 127.0.0.1.postgres > 127.0.0.1.48366: Flags [P.], seq 2:15, ack 98, win 683, options [nop,nop,TS val 3901197013 ecr 3901197012], length 13

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [P.], seq 98:139, ack 15, win 683, options [nop,nop,TS val 3901197013 ecr 3901197013], length 41

IP 127.0.0.1.postgres > 127.0.0.1.48366: Flags [P.], seq 15:342, ack 139, win 683, options [nop,nop,TS val 3901197014 ecr 3901197013], length 327

IP 127.0.0.1.48366 > 127.0.0.1.postgres: Flags [.], ack 342, win 700, options [nop,nop,TS val 3901197053 ecr 3901197014], length 0

In essence there isn’t much noteworthy to see. The first 3 packets are the TCP 3-way handshake, after which we see pgbouncer and postgres getting the connection ready at the other layers.

Now yugabyte:

[vagrant@centos83-yb-1 ~]$ sudo tcpdump -ttt -i lo -n port 5433

dropped privs to tcpdump

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes

IP 127.0.0.1.37500 > 127.0.0.1.pyrrho: Flags [S], seq 2745440791, win 43690, options [mss 65495,sackOK,TS val 1151235630 ecr 0,nop,wscale 7], length 0

IP 127.0.0.1.pyrrho > 127.0.0.1.37500: Flags [S.], seq 2754397962, ack 2745440792, win 43690, options [mss 65495,sackOK,TS val 1151235630 ecr 1151235630,nop,wscale 7], length 0

IP 127.0.0.1.37500 > 127.0.0.1.pyrrho: Flags [.], ack 1, win 342, options [nop,nop,TS val 1151235630 ecr 1151235630], length 0

IP 127.0.0.1.37500 > 127.0.0.1.pyrrho: Flags [P.], seq 1:42, ack 1, win 342, options [nop,nop,TS val 1151235630 ecr 1151235630], length 41

IP 127.0.0.1.pyrrho > 127.0.0.1.37500: Flags [.], ack 42, win 342, options [nop,nop,TS val 1151235630 ecr 1151235630], length 0

IP 127.0.0.1.pyrrho > 127.0.0.1.37500: Flags [P.], seq 1:332, ack 42, win 342, options [nop,nop,TS val 1151235755 ecr 1151235630], length 331

IP 127.0.0.1.37500 > 127.0.0.1.pyrrho: Flags [.], ack 332, win 350, options [nop,nop,TS val 1151235756 ecr 1151235755], length 0

IP6 ::1.43518 > ::1.pyrrho: Flags [S], seq 2042629123, win 43690, options [mss 65476,sackOK,TS val 2150324005 ecr 0,nop,wscale 7], length 0

IP6 ::1.pyrrho > ::1.43518: Flags [R.], seq 0, ack 2042629124, win 0, length 0

Yugabyte works at port 5433, which tcpdump translates as ‘pyrrho’. Here too the 3-way handshake is visible. And then pgbouncer and yugabyte chatting, until…a packet is sent using IP6 ?!? The packet is sent to the ‘pyrrho’ port, so it means the client (pgbouncer) must be doing this. Quite correctly the server responds to the IP6 packet by telling it the port is not in use (‘R’: reset). That actually fits with what the pgbouncer logging told us.

The solution.

At his point I knew somehow IPv6 (IP6) got involved. It probably has to do with name resolving, because that fits a scenario where something might “spontaneously” resolve as IPv6. And since it’s localhost, it’s not logical this is DNS, this must be something that is local on the machine. Well, if it’s name to address resolving and it can’t be DNS, the most logical thing would be the /etc/hosts file!

Apparently, with a Centos 8 standard installation, localhost is defined in the following way in /etc/hosts:

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost4.localdomain6

And because ‘::1’, the IPv6 address, is the second one, that is the one localhost will resolve to:

ping -c 1 localhost
PING localhost(localhost (::1)) 56 data bytes
64 bytes from localhost (::1): icmp_seq=1 ttl=64 time=0.024 ms

Well, then the solution to fully solve this would be really simple: remove the IPv6 / ::1 line from /etc/hosts. I did that, and indeed, there is no ‘connection refused’ anymore! Problem solved!

I am excited to announce that I have accepted a role at Yugabyte as ‘developer advocate’. It’s not easy to leave the talented group of people that is Enkitec, of which many of them I call friends. In fact, the change is much bigger since Yugabyte is a database outside of the Oracle ecosystem, in which ecosystem there also are a lot of people that I have gotten to known and call friends.

So why the change? In essence, my reasons are the same as the ones Bryn Llewellyn mentioned in his blogpost about his move, although I come from a different background being a consultant and not working for Oracle.

Open source

You could argue that having spend investigating the inner working of the Oracle database for 20 or so years means I am discarding all this knowledge I build up. I don’t believe that’s true. Yes, I have build up a lot of knowledge about the Oracle database works, and parts of that knowledge might be known (usable?) to a small group. However, I am looking forward to working with software for which I can actually see how it’s made, and I don’t have to perform extensive research to figure out how it’s “probably” implemented. This is an important reason for me personally. In general I think there are much more reasons for companies for wanting to use open source software instead of proprietary, closed source software.

Distributed architecture & distributed SQL

The area I have spend a significant amount of time in, is assessing and investigating performance of the Oracle database. Oracle, like many traditional monolithic databases, use an architecture where the database itself is actually a single set of files containing the data, which is singular, and are served by processes that perform validation of requests to use that data, and perform performance features like caching so data can be used from memory, which means it doesn’t need to be re-read from the files so it can be served at memory latency. This architecture imposes limits on the scale of use, for which common solutions to overcome these limits with monolithic architectures are:
– Sharing the (still singular) datafiles with multiple machines (RAC).
– Make a copy of the data files, and spool the changes to the copy (DG/OGG/replication).
– (Manually) partition the data over multiple independent databases (sharding).
– Storing data redundantly over the available (still local) disks (RAID).
I am not saying these solutions are wrong. In fact, many of them are very cleverly architected and optimized over the years with the different architectures of databases and work really well.
However, there are a consequences that are inherently imposed by a monolithic architecture, such as scalability, flexibility and failure resistance.

The Yugabyte database provides a cloud ready, cloud vendor agnostic, failure resistant, linear scaling database (sorry for the abundance of hype words). Failure resistance is provided by keeping multiple copies of the data, which is configurable as ‘replication factor’. By keeping copies of the data, failure of a single Yugabyte node (called a ‘tablet server’) can be overcome. By spreading these copies of data over groups of tablet servers in cloud availability zones, the outage of an entire availability zone can be survived. When such a disaster strikes, the cluster will survive and provide normal (read and write functions) as long as 2/3rd of the nodes survive (in case of a replication factor of 3), without the need for any management to accommodate failover, like instantiating a replica/standby.

Each Yugabyte tablet server also serves as an API for data retrieval and manipulation, so with the addition of tablet servers, the processing power increases. A tablet server should use local, non-protected disks to take advantage of low latency, and by adding more tablet servers the work can be spread out further to these, increasing the amount of work that can be done. Aside from increasing processing for API-side processing, Yugabyte provides functionality which pushes down the work to the storage, which Exadata specialists will recognize as ‘smart scans’, alongside things like predicate pushdown and bloom filters.

Yugabyte has not invented yet another SQL and/or no-SQL dialect. Another point that I think is really strong is that Yugabyte reuses Postgresql as the SQL API. The Postgres API is “wire compatible” with Postgres, which means that any product that can talk to a Postgres database can talk to Yugabyte. In fact, instead of writing a Postgres layer, Yugabyte has re-used the Postgres source to use the query layers of postgres, and then provide the storage for postgres using Yugabyte. This also means a lot of the server side functionality of Postgres, such as PL/pgSQL can be used. Needless to say there are limitations inherent to the postgres connection with the Yugabyte storage layer, but most common postgres database functionality is available.

But that is not all! Aside and in parallel to Yugabyte’s Postgres API, it also provides a “NoSQL” APIs, which are compatible with Apache Cassandra and with Redis. Oh, and data storage is done via LSM (log-structured merge-tree), etc.

Conclusion

I hope at this point you can see that Yugabyte provides a fresh, modern approach to a database that provides a lot of advantages over more traditional, monolithic databases, for which I think a lot of cases that I witnessed over the years could significantly benefit from (performance-wise and availability-wise). I am really thrilled to be on the basis of building that further.

Also, a lot of technical reasons that I described are really just a summary, if you have gotten interested or want to learn more, I would urge you (or challenge you :-D) to have a look at https://www.yugabyte.com or read more at https://docs.yugabyte.com. Or if you want to see the code, head over to https://github.com/yugabyte/yugabyte-db!

This post is about a fully documented but easy to miss feature of vagrant, which is the evaluation of the Vagrantfiles when a vagrant ‘box’ is started. As you can see, I wrote ‘Vagrantfiles’, which is plural. I (very naively) thought that the Vagrantfile in the directory where you want to start the ‘box’ determines the specifics of the vagrant ‘box’.

This begins with me trying to create a vagrant virtualbox VM using packer, where I specify a Vagrantfile during the packer build. I normally don’t spend a lot of thought on that, and just put in a very simple Vagrantfile, essentially just defining the box and setting memory and cpu amounts.

In this case, I decided to put in the Vagrantfile that I wanted to use for this special purpose ‘box’ as the Vagrantfile for the packer build, which has some ruby in it which adds a disk:

data1_disk = "data1.vdi"
if !File.exist?(data1_disk)
  vb.customize [ 'createhd', '--filename', data1_disk, '--size', 20480 ]
end
vb.customize [ 'storageattach', :id, '--storagectl', 'SATA Controller', '--port', 2, '--device', 0, '--type', 'hdd', '--medium', data1_disk ]

My (wrong) assumption was that this Vagrantfile was kept as a validation or template.

After the box was created, I tested running it, and therefore I copied the Vagrantfile including the above addition of checking for a file, if it doesn’t exist creating it, and then attaching the disk to the VM. However, this did fail with an error in the provisioning stage that a disk was already added, and therefore could not be added again.

I didn’t understand the error when I first encountered it, so I investigated the issue by removing the box (vagrant destroy), and commenting out the disk addition code (the above 5 lines) from the Vagrantfile, and then run the box addition again (vagrant up). Much to my surprise, it did add the disk for which I explicitly removed the code that provided that functionality.

This severely puzzled me.

After going over the vagrant documentation and searching for the error messages, I found out that the Vagrantfile embedded with the ‘box’ is actually used, and parsed, along with the user specified Vagrantfile, and the contents of both are merged before a vagrant ‘box’ is started. This Vagrantfile can be seen in the directory which holds the zipped boxes after downloading from app.vagrantup.com. On my Mac this is ‘~/.vagrant.d/boxes/{box name}/{version}/{provider}/Vagrantfile’.

This perfectly explained what I witnessed: because the disk creation steps are in the ’embedded’ Vagrantfile, it will simply be executed, even if it’s not in the normal Vagrantfile. And of course it threw an error when the disk creation steps were added to the normal Vagrantfile, because then these steps were executed twice, which would show at the vb.customize ‘storageattach’ execution, because that is not protected by a check, so the second occurrence would be tried and fails.

This was really my ignorance, and maybe a bit the documentation not being overly verbose about the existence of more than one Vagrantfile. It also gives great opportunities: lots of steps or logic that would normally be in the Vagrantfile could be put in the embedded Vagrantfile so that regular Vagrantfile can be kept really simple.

Conclusion is that if you create vagrant boxes yourself, when you want to perform provisioning steps against a vagrant box that simply need to be done, you might as well put them in the embedded Vagrantfile.

Hopefully I got your interest by the weird name of this blogpost. This blogpost is about sensible usage of an Oracle database. Probably, there are a lot of blog posts like this, but I need to get this off my chest.

A quote any starwars fan would recognise is ‘I sense a disturbance in the force’. I do, and I have felt it for a long time. This disturbance is the usage of the number of connections for a database. Because of my profession, this is the oracle database, but this really applies to the server-side of any client/server server processor running on at least (but probably not limited to) intel Xeon processors.

The disturbance is the oversubscription or sometimes even excessive oversubscription of database connections from application servers, or any other means of database processes that acts as clients. This means this does not exclude parallel query processes, in other words: this applies to parallel query processes too.

What is oversubscription of database connections? Any good consultant would be able to tell you this: it depends. In principle, oversubscription means more subscribers than a system can handle. This means ‘oversubscription’ is a multidimensional beast, which can apply to CPU, memory, disk IO, network IO. That makes it hard.

This blogpost is about CPU oversubscription. The way a modern CPU, Intel Xeon in this case, works is not simple, and thus this will and cannot be an exhaustive description. What I want to try is provide a basic, simplistic description to provide an idea and give guidance.

An Intel Xeon CPU provides a number of processing units, called ‘processor’ in /proc/cpuinfo, called ‘cpu’ in the top utility, etc, which are the execution contexts for processes. These execution contexts can be hyperthreads, which for Intel Xeon are two threads per core, or an execution context for a single core. Not all Xeon CPUs provide hyperthreading, and hyperthreading, if available in the CPU, can be disabled in a system’s BIOS. A hyperthread can not, and does not do any processing, that is what a core does.

So why do hyperthreads exist then? I think two prominent reasons are:
1. Not all processes are active all the time. By increasing the number of execution contexts, the switching for processes between execution contexts is reduced, which means a reduction of time spend on context switches.
2. Cores are incredibly powerful, and given that probably not all processes are active all the time, combining two processes on the same core will be “reasonably unnoticeable”.

The next question which then comes to mind is: is hyperthreading a good thing for a database? I cannot give an exhaustive answer for this, partially because there is massive difference between system usage for different types of database usage. I would say that with reasonable usage, hyperthreading in modern Intel Xeon CPUs does provide more benefits, reduced context switching, than it gives downsides, like variances in latency of CPU usage.

This means that for looking at the CPU processing power of a database server the ‘actual’ processing power sits logically between the number of core’s and the number of threads. But wait! Didn’t I just say that threads don’t process, only core’s do? Yes, but let me explain how I look at this: if your processes do processing requiring ACTUAL, ON CPU processing, it means they depend on the core to be able to handle this, versus processes that work by running into waiting really quickly, like doing disk IO, network IO or waiting for users to make them active, which still might appear as running all the time, but in reality are actually processing on the core occasionally.

The first type, doing the actual, on cpu processing should calculate CPU power more towards core count, and the second type, doing lots of stalls, should calculate CPU more towards thread count. Overall, this is quite simply about using a core as efficient as possible.

Now that we gone through CPUs and their cores and threads, and oversubscription in general, the next question is: so how much processes should be allocated on a database server?

The answer is simple: if you want a high performing database for your application servers, the number of processes IN TOTAL should not exceed a number sitting somewhere between CPU core count and CPU thread count.

“Isn’t that incredibly low?”

Yes, for most of the deployments that I see this would be shockingly low. But just because the number of processes is set very high somewhere doesn’t make it right. It just means it’s set that way.

“But why is it set too high everywhere?”

I don’t know. I don’t understand why lots and lots of people do allocate high, up to sometimes ASTRONOMICAL numbers of database processes, and then expect that to be the best tuned way, while there is NO LOGICAL EXPLANATION that I can see for this to make sense. In fact: the explanation why this doesn’t make sense is this blogpost.

To make the comparison with a supermarket and the number of tills: if you go shopping in a supermarket and want to pay and leave as soon as possible, there should be a ready, idle till available, or else you have to wait. For Intel Xeon hyperthreading, you could make the comparison with a till that serves two lanes with persons that want to pay at the same time, because it takes time to put all the items from the shopping basket onto the desk, and the more time that takes, the more efficient a till serving two lanes would be (an Intel Xeon CPU can actually serve two threads at the same time, optimising runtime on the single core).

“Okay, but the majority of the processes is not actually doing anything.”

Well, if the processes are actually not doing anything, why have them in the first place? If that is really true, it doesn’t make sense to have them. And don’t forget: what looks like an idle connection from both an application and a database perspective still is an actual live, running operating system process, and a running database process that has memory allocated, occupies a network socket, has breakable parse locks out, etc, and requires CPU time to maintain the connection.

In fact, by having huge numbers of database connections, you have setup the application to be able to cause “the perfect storm” on the database!

With this, I mean that what I normally see, is that indeed the majority of the database connections are not used. However… if things get worse, and the database gets active and starts lacking CPU processing power, more database connections get active. That is logical, right? The database connections that normally would be active will take longer time because of the increased activity, so with a constant amount of work, new work cannot use an existing connection because that is still active, and thus take another connection that normally would sit idle. However, serving more connections will increase the amount of CPU required even further, which was already lacking, so the waiting time increases further. Now because the waiting time gets higher, more connections are needed, etc.

And then I didn’t talk about dynamically increasing connection pools!

What I mean with that is that I until now talked about STATIC connection pools. Static means the minimal number of connections is the maximal number of connections in the pool. A dynamic connection pool will have a certain amount of connections, and when there is a need for more, which means all the connections are busy, add more connections.

Especially with Oracle, this is really a bad idea. Let me explain. Outside of too much connections in the first place, which is a bad idea already, having an expanding connection pool means not only idle connections are put to work, but instead the database is given EVEN MORE work by initialising new connections. An oracle database connection is not lightweight, it requires initialising memory, which is an expensive operation. And the whole reason the connection is created is because the connection pool established all the connections were busy, which almost certainly is because the database was busy (!!!!).

I hope a lot of people will make it to the end, and then realise that high numbers of connections does not make any sense. If you do have an explanation that makes sense, please comment. Please mind that a tuned setup requires an application server to be reasonably setup too, you cannot have one part setup for ultimate processing power, and another part be just a shipwreck.

This post is about how to make your log files being aggregated in a single place and easy searchable via a convenient web interface.

You might think: wait a minute; doesn’t this exist already? Well, yes and no. Let me explain.
a) traditional log management: traditionally, logs are/were searched for certain strings, typically error messages or parts of error messages and a shell script run by cron, and when matched, a system that was setup to be able to email would send an indicator that the aforementioned string was found.
b) log management via a monitoring tool: the next step was the individual script on each server was exchanged for a monitoring tool, which performed the same task as the shell script. In reality quite often a default monitoring set/template was enabled, instead of the specific strings that were searched for with the shell script. Sometimes this was an improvement, sometimes this meant the specific issues (and thus messages) gone invisible. This is still the way the monitoring works in 90% of the cases, including a general completely standard monitoring template. At least in my experience.
c) other log gathering and indexing: there are many more products that perform this function. The first one that comes to my mind is splunk, and all the options for doing this cloud based (many there!), and a lot of tools based on elasticsearch, like the “ELK stack”.

I think it’s clear “a)” and “b)” are not dynamic and in fact very static. My major concern is it doesn’t allow exploratory investigation, it simply is a warning that is raised, any investigation means you have to log on and start browsing the available information locally. Everybody who worked with HP Openview or Oracle Enterprise Manager will recognise this. Yes, it’s probably all possible with these tools, but it never (ever(!)) is implemented.

For the last category, the likes of splunk, the ELK stack and the cloud based log tools: for the first two I feel definitely serves a function, but it’s aimed at aggregating logs of multiple servers, and is simply too much to setup on a server alongside the processes it is meant to monitor. For the cloud based tools: it might be my conservatism, but getting a subscription and loading up logging feels awkward, especially if it’s for my own test purposes.

This is where Loki comes in. Finally, I would say, there is a tool that can function on small scale (!) and perform log aggegration and provide a searchable database without a huge setup. This is ideal for your own test or development box to able to discover what is going on, and have all the log files at your fingertips without endlessly going through the filesystem performing cd, tail, grep, wc, sort, uniq, et cetera. I think lots of people recognise travelling from log file to log file.

Loki gives you a database that orders log entries based on time, and Grafana provides a web based UI to view and query the loki database. This is how it looks like:

(click to enlarge)
This is an example, what this shows is my test machine, where I decided to see when linux started, as well when the Oracle database instance was started.
* The first query uses the label “rdbms_alert” (indicating it came from the text-based oracle alert.log file, this is a label I added to it), and within the log lines with the label job with value “rdbms_alert”, I added a filter for the string “Starting ORACLE”, which indicates an Oracle database instance start.
* The second query uses the label “messages” (indicating it came from the linux /var/log/messages file, this label is mine too), and within the log lines with the label job with the value “messages”, I added a filter for the string “kernel: Command line”, which indicates linux startup. I additionally added a negative filter for “loki”, because loki logs the queries to the messages file, which I don’t want to see.

I hope you can see the power of having all the logs in a single place, and completely searchable.

This is just a start, this is a very simple proof-of-concept setup, for example the date/time in the log lines is not used, the date/time of the log lines is when it was ingested into loki, it is possible to have loki interpret these.

If you are interested, but are uncertain if this is for you, and would like to test this: I got a couple of Ansible scripts that can setup the combination of:
* promtail (the default loki log streaming tool)
* loki (the database)
* Grafana (the web UI)
The scripts are created on Oracle Linux 7.8.

Install git, and clone the https://gitlab.com/FritsHoogland/loki-setup.git repository using a user that has sudo rights:

git clone https://gitlab.com/FritsHoogland/loki-setup.git

Install ansible (you might have to install the EPEL repository).

In the loki-setup repo you find the setup scripts for loki, promtail and grafana.
You can execute the scripts in that order (loki, promtail and Grafana) by executing the setup_loki.yml, setup_promtail,yml and setup_grafana.yml scripts.
IMPORTANT: do proofread the scripts, and validate the variables for your situation. Don’t worry: the scripts are easy to read.

After installing, you can go to http://yourmachine:3000, login with admin and the password you set in the setup_grafana.yml script, and click on the compass (explore) option, and you can perform your own log browsing.

If you decide you want to remove it, I got a remove script for each of the components, which will remove the specific component entirely. Same applies here too: validate the script.

This post is about one of the fundamentally important properties of a database: how IO is done. The test case I studied is doing a simple full table scan of a single large table. In both Oracle and postgres the table doesn’t have any indexes or constraints, which is not a realistic example, but this doesn’t change the principal topic of the study: doing a table scan.

I used a publicly available dataset from the US bureau of transportation statistics called FAF4.5.1_database.zip
The zipped file is 347MB, unzipped size 1.7GB.

In both cases Oracle Linux 7.7 (64 bit) is used, running in VirtualBox, with the storage being a USB3 SSD. Number of CPUs is 4, memory size is 6G. Filesystem type: xfs.
The Oracle version used is Oracle 19.5, the Postgresql version used is 12.1.
For Postgresql, the postgresql.conf file is not changed, except for max_parallel_workers_per_gather which is set to 0 to make postgres use a single process.
For Oracle, the parameters that I think are important: filesystemio_options=’setall’. Oracle is used filesystem based (so no ASM).

This is the table definition for Oracle:

create table faf451 (
  fr_origin varchar2(3),
  dms_orig varchar2(3),
  dms_dest varchar2(3),
  fr_dest varchar2(3),
  fr_inmode varchar2(1),
  dms_mode varchar2(1),
  fr_outmode varchar2(1),
  sctg2 varchar2(2),
  trade_type varchar2(1),
  tons number,
  value number,
  tmiles number,
  curval number,
  wgt_dist number,
  year varchar2(4)
);

This is the table definition for Postgresql:

create table faf451 (
  fr_origin varchar(3),
  dms_orig varchar(3),
  dms_dest varchar(3),
  fr_dest varchar(3),
  fr_inmode varchar(1),
  dms_mode varchar(1),
  fr_outmode varchar(1),
  sctg2 varchar(2),
  trade_type varchar(1),
  tons double precision,
  value double precision,
  tmiles double precision,
  curval double precision,
  wgt_dist double precision,
  year varchar(4)
);

In order for the data to be easy loadable into postgres using copy from, I had to remove ‘””‘ (double double quotes) for the empty numeric fields. In oracle I could say “optionally enclosed by ‘”‘”. For Oracle I used an external table definition to load the data.

Now, before doing any benchmarks, I have an idea where this is going. Oracle is using direct IO (DIO) so linux page cache management and “double buffering” are avoided. Also, oracle will be doing asynchronous IO (AIO), which means submitting is separated from waiting for the notification that the submitted IOs are ready, and on top of that oracle will submit multiple IO requests at the same time. And again on top of that, oracle does multi-block IO, which means that instead of requesting each 8K database block individually, it will group adjacent blocks and request for these in one go, up to a size of combined blocks of 1MB, which means it can requests up to 128 8K blocks in one IO. Postgres will request every block synchronous, so 1 8K block at a time, and waiting for each request to finish. That makes me have a strong idea where this is going.

It should be noted that postgres explicitly is depending on the operating system page cache for buffering as a design principle. Because of DIO, blocks that are read by oracle are not cached in the operating system page cache.

I executed my benchmark in the following way:
– A run for every size is executed 5 times.
– At the start of every run for a certain size (so before every “batch” of 5 runs), the page cache is flushed: (echo 3 > /proc/sys/vm/drop_caches).
– Before each individual run, the database cache is flushed (systemctl restart postgresql-12 for postgres, alter system flush buffer_cache for oracle).

I started off with 2G from the dataset, and then simply performed a ‘copy from’ again to load the same dataset into the table in postgres. Oracle required a bit more of work. Oracle was able to save the same data in way less blocks; the size became 1.18G. In order to have both postgres and oracle scan the same amount of data, I calculated roughly how much rows I needed to add to the table to make it 2G, and copied that table to save it as a 2G table, so I could insert that table to increase the size of the test table by 2G. This way in both oracle and postgres I could test with a 2G table and add 2G at a time until I reached 20G.

These are the results. As you can see in the legenda: oracle is orange, postgres is blue.
postgres oracle scan results(click graph to load full picture)

What we see, is that postgres is a bit slower with the first run of 5 for the smaller dataset sizes, which becomes less visible with larger datasets.
Also, postgres is way faster if the dataset fits into the page cache and it has been read into it. This is logical because postgres explicitly uses the page cache as a secondary cache, and the test is the only activity on this server, so it hasn’t been flushed by other activity.

What was totally shocking to me, is postgres is performing alike oracle and both roughly are able to perform at the maximum IO speed of my disk: 300MB/s, especially when the dataset is bigger, alias beyond the page cache size.

It wasn’t shocking that oracle could reach the total bandwidth of the disk: oracle uses all the techniques to optimise IO for bandwidth. But how can postgres do the same, NOT deploying these techniques, reading 8K at a time??

The first thing to check is whether postgres is doing something else than I suspected. This can simply be checked with strace:

poll_wait(3, [{EPOLLIN, {u32=18818136, u64=18818136}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0!select count(*) from faf451"..., 8192, 0, NULL, NULL) = 34
lseek(20, 0, SEEK_END)                  = 335740928
lseek(20, 0, SEEK_END)                  = 335740928
kill(1518, SIGUSR1)                     = 0
pread64(5, "\f\0\0\0\310ILc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846061568) = 8192
pread64(5, "\f\0\0\0HcLc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846069760) = 8192
pread64(5, "\f\0\0\0\260|Lc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846077952) = 8192
pread64(5, "\f\0\0\0000\226Lc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846086144) = 8192
…etc…

The above strace output shows only 4 rows of pread64() calls, but this goes on. So no “secret” optimisation there.

Luckily, my VM has a new enough version of Linux for it to be able to use eBPF, so I can use biosnoop. Biosnoop is a tool to look at IO on one of the lower layers of the linux kernel, the block device interface (hence ‘bio’). This is the biosnoop output:

# /usr/share/bcc/tools/biosnoop
TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
0.000000000    postmaster     4143   sdb     R  66727776  708608     5.51
0.006419000    postmaster     4143   sdb     R  66731720  77824     11.06
0.006497000    postmaster     4143   sdb     R  66734432  786432    11.03
0.011550000    postmaster     4143   sdb     R  66731872  1310720   16.17
0.013470000    postmaster     4143   sdb     R  66729160  1310720   18.86
0.016439000    postmaster     4143   sdb     R  66735968  1310720   14.61
0.019220000    postmaster     4143   sdb     R  66738528  786432    15.20

Wow…so here it’s doing IOs of up to 1MB! So somewhere between postgres itself and the block device, the IOs magically grew to sizes up to 1MB…that’s weird. The only thing that sits between postgres and the block device is the linux kernel, which includes page cache management.

To get an insight into that, I ran ‘perf record -g -p PID’ during the scan, and then perf report to look at the recorded perf data. This is what is I found:

Samples: 21K of event 'cpu-clock', Event count (approx.): 5277000000
  Children      Self  Command     Shared Object       Symbol                                                                  ◆
-   41.84%     3.63%  postmaster  libpthread-2.17.so  [.] __pread_nocancel                                                    ▒
   - 38.20% __pread_nocancel                                                                                                  ▒
      - 38.08% entry_SYSCALL_64_after_hwframe                                                                                 ▒
         - 37.95% do_syscall_64                                                                                               ▒
            - 35.87% sys_pread64                                                                                              ▒
               - 35.51% vfs_read                                                                                              ▒
                  - 35.07% __vfs_read                                                                                         ▒
                     - 34.97% xfs_file_read_iter                                                                              ▒
                        - 34.69% __dta_xfs_file_buffered_aio_read_3293                                                        ▒
                           - 34.32% generic_file_read_iter                                                                    ▒
                              - 21.10% page_cache_async_readahead                                                             ▒
                                 - 21.04% ondemand_readahead                                                                  ▒
                                    - 20.99% __do_page_cache_readahead                                                        ▒
                                       + 14.14% __dta_xfs_vm_readpages_3179                                                   ▒
                                       + 5.07% __page_cache_alloc                                                             ▒
                                       + 0.97% radix_tree_lookup                                                              ▒
                                       + 0.54% blk_finish_plug                                                                ▒

If you look at rows 13-15 you see that the kernel is performing readahead. This is an automatic function in the linux kernel which looks if the requests are sequential of nature, and when that’s true performs readahead, so that the scan is made faster.

This is the third part in a series of blogposts about how postgresql manages data in its blocks (called ‘pages’ in postgres speak). If you found this post and did not read the previous ones, it might be beneficial to read block internals (part 1) and block internals, part 2 first. In these blogposts I’ve shown how heap and index pages look like, and how these can be investigated, including looking at the raw block information.

This blogpost is intended to show the effects on pages when DML happens. This is inherently different from my personal reference of database implementation, which is the oracle database.

If you followed the previous two articles, you have a table mytable with one thousand rows. In order to simplify and be able to grasp the essence of the implications of data changes, drop mytable and recreate the simple situation of mytable with four rows again:

test=# drop table if exists mytable;
DROP TABLE
test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

This should result in the table having four rows in the first page:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

And the second index page having the index tuples of these four rows:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The next thing to do, is change the configuration of postgres to disable auto-vacuum. This should absolutely not be done on any postgres database serving something useful, but in order to see the effects of data changes, and the changes as a result of a vacuum, vacuum should not kick in automatically, because then we can’t predict what we will be looking at. This is governed by the parameter ‘autovacuum’, which is turned on by default:

test=# show autovacuum;
 autovacuum
------------
 on

Turn autovacuum off and reload the configuration:

test=# alter system set autovacuum=off;
ALTER SYSTEM
test=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
test=# show autovacuum;
 autovacuum
------------
 off

Now let’s update id 1 and set f1 to ‘zzzzzzzzzz’:

test=# update mytable set f1 = 'zzzzzzzzzz' where id = 1;
UPDATE 1

Now let’s look at the page contents of the table mytable:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |   1791 |        0 | (0,5)  |       16386 |        258 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1791 |      0 |        0 | (0,5)  |       32770 |      10242 |     24 |        |       | \x01000000177a7a7a7a7a7a7a7a7a7a

This is a good example how read consistency and row versioning works in postgres. The update inserted a new tuple at offset 7992 (lp_off) in the page, and added a pointer to it in the line pointer array (lp 5). Mind the ‘the update inserted a new tuple’ which is fundamental! In the metadata of tuple ID 1, xmax is set to indicate the maximum transaction ID which should be able to see the this version of the tuple. The postgres documentation says ‘delete XID stamp’ for xmax. Tuple ID 5 is created with exactly the same XID for xmin as tuple 1 has for xmax, which is the transaction ID starting from which this tuple should be visible. Obviously, for this tuple xmax is set to 0, indicating no newer version exists. Also, the t_ctid (current tuple ID) for tuple 1 is set to 0,5 (block 0, tuple 5), indicating the next version of that tuple is tuple ID 5. To indicate what happened, infomask2 and infomask have been set.
infomask2 16386 means: heap has two attributes, heap HOT updated.
infomask 258 means: has variable attributes, xmin committed.

The new tuple has the following values for infomask2 and infomask:
infomask2 32770 means: heap has two attributes, is heap only tuple.
infomask 10242 means: has variable attributes, xmax invalid, this is an updated version of the row.
(see the blogpost postgresql block internals, part 2 for a description of the infomask2 and infomask bitfields)

It should be clear now the update caused the active version of the tuple with id = 1 to be tuple ID 5 instead tuple ID 1. Please mind tuple ID 1 is still entirely present (in the line pointer array and in the data part of the page).

How does this looks like in the index?

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

Nothing changed!

Actually, this is exactly what ‘HOT’ (heap only tuple) means to describe in the above descriptions (infomask2, heap HOT updated). A new version of a tuple is created, but it is not covered by any index. This means when table/heap tuple with id=1 is accessed using the index, the index points to the original tuple ID (1) in the heap/table, which points to the new version of the tuple (tuple ID 5). This is an optimisation to lessen the amount of work done when updating a field that is not indexed, at the cost of following an additional pointer.

What happens if the tuple with id = 1 gets updated again?

test=# update mytable set f1 = 'yyyyyyyyyy' where id = 1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |   1791 |        0 | (0,5)  |       16386 |       1282 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1791 |   1792 |        0 | (0,6)  |       49154 |       8450 |     24 |        |       | \x01000000177a7a7a7a7a7a7a7a7a7a
  6 |   7952 |        1 |     39 |   1792 |      0 |        0 | (0,6)  |       32770 |      10242 |     24 |        |       | \x010000001779797979797979797979

infomask 2 for tuple ID 5 now says: heap has two attributes, is heap only tuple and additionally: heap was HOT updated.
infomask for tuple ID 5 now says: heap has variable attributes, xmin committed and additionally: this is an updated version of the tuple.

Another update essentially performs exactly the same transaction on tuple ID 5 as previously done on tuple ID 1: a new tuple is created containing the result of the update (linepointer array entry and data at the end of the page), xmax is set for tuple ID 5 and xmin is set for tuple ID 6 with the same transaction ID, and the current tuple id for tuple ID 5 is set to the new version for that tuple, which is tuple ID 6.

As can be seen by looking at the value for infomask2, tuple ID 6 now has the same value as tuple ID 5 had previously. Just like the tuple ID 5 previously, the new tuple with ID 6 is a ‘hot’ updated tuple too, which means the index was once again left untouched, so the index still points to heap tuple ID 1. This means that a session that uses the index to find the tuple, will be pointed to heap tuple ID 1, which points to tuple ID 5, which points to tuple ID 6 in this case.

It’s obvious that this tuple pointer chasing can easily lead to more work to be done to find a row/tuple, meaning lesser performance. It also is obvious that this construction exists to provide a way to produce an older version of a tuple for read consistency. However, this old version is not useful anymore after some time, which is when no query exists with a transaction ID lower than the value for xmax for the old tuples, and needs to be purged. This is where vacuum comes in. Let’s vacuum mytable, and read the heap/table page again:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     39 |   1792 |      0 |        0 | (0,6)  |       32770 |      10498 |     24 |        |       | \x010000001779797979797979797979

There are a few things vacuum did, and some things it didn’t:
– The line pointer array entries all stayed.
– The first entry points to heap tuple ID 6. lp_flags=2 together with lp_len=0 means the lp_off field points to the active tuple ID. The reason the first row still exists, is because the index entry still points to it.
– The fifth line pointer entry is all zero, lp_flags=0 means ‘unused’.

If you look closely at the line pointer array entries and the tuple data offsets before and after vacuum, you see that the past versions of the tuple data do not exist anymore. In fact, if you look closely to the offset (lp_off) values of the tuples that are stored in the page, you see that ID 2 took the offset value from ID 1 from before vacuuming, indicating it moved all the way to the back of the page. In fact, vacuum compacted the data tuples, maximising free space in the page.

A logical next question is: okay, this is way better than how it was left for read consistency after the two transactions, but it is still not optimal. Can it be fully cleaned? The answer is yes, vacuum can be told to fully clean up, including the removing any non-functional line pointer array entries, and as a logical consequence have all index entries point directly to the heap tuple it ought to point to. This is called ‘vacuum full’.

Why not use vacuum full all the time? The reason is vacuum full requires a table lock for the entire duration of running vacuum. This is not desirable and acceptable in a lot of databases. If normal vacuum is run sufficiently, running vacuum full will not yield much additional benefit in most cases, because when normal vacuum is run sufficiently, the old row versions are cleaned up in time, preventing excessive amounts of old versions in a page. This is important because normal vacuum cleans up within a page, but it does not move tuples across pages, freeing up pages and thus operating system disk space, which is what vacuum full does. This is why autovacuum should be turned on, to clean up old versions in time.

When an update performs an update on a field that is covered by the index, obviously the update can not be ‘HOT’, because the index must cover the fields it is created on. Let’s update the id of id=1 to 5:

test=# update mytable set id=5 where id=1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1814 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1814 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1814 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1817 |      0 |        0 | (0,5)  |           2 |      10242 |     24 |        |       | \x050000001779797979797979797979
  6 |   8032 |        1 |     39 |   1816 |   1817 |        0 | (0,5)  |       40962 |       8450 |     24 |        |       | \x010000001779797979797979797979

As you can see, tuple ID/ line pointer number 5, which was emptied by vacuum, now is reused. The chain of tuple ID’s now is: ID 1 points to ID 6. ID 6 has xmax set, and t_ctid points to ID 5, with xmin set to the previous version’s xmax value. The value of 2 for t_infomask2 for ID 5 shows no bits indicating a HOT update. How does this look like in the index?

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          5 | (0,5) |      16 | f     | f    | 05 00 00 00 00 00 00 00

Because this is not a HOT update, there is a change to the index, an index tuple is added for the new value: ‘5’, which points to heap page 0 tuple ID 5, which is the result of the update. It is interesting to see value ‘1’ still exists in the index, while value ‘5’ is the actual value and ‘1’ the old value of the same heap tuple. It is logical however if you realise there is no versioning data in the index, so in order to understand if a tuple should be visible requires the heap/table tuple to be read. Once again, if you vacuum the table, things get sorted:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1814 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1814 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1814 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   8032 |        1 |     39 |   1817 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x050000001779797979797979797979
  6 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |

Tuple ID 1 now is empty, because it is relieved from its function of pointing to tuple ID 6, because tuple ID 6 gotten a new version, which means tuple ID 6 could be and is freed by vacuum. If you look in the index, you will see vacuum now freed up the tuple of the updated value:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          3 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          4 | (0,5) |      16 | f     | f    | 05 00 00 00 00 00 00 00

The bt_page_items function does not reveal too much information about line pointer and data offsets, in other words, what vacuum did, let’s look at the raw block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 78 60 b5 09 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 05 00 00 00 00 00 00 00
*
0008128 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008144 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008160 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

The interesting bit is the second star indicating a duplicate row, which is the index tuple pointing to the value ‘5’. In order to understand what this means, let’s read the line pointer array for the index tuples. Here’s a little script to do that:

for TUPLE_ID in 1 2 3 4 5 6; do
 LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
 DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
 LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
 LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
 TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
 TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
 LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
 LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
 echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done

And this is the output:

TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000
TID: 5 data offset: 8096 length: 16 binary: 0000000000100000 1001111110100000
TID: 6 data offset: 0 length: 0 binary: 0000000000000000 0000000000000000

That’s weird! It looks like there is an additional line pointer entry with tuple that points to the value ‘5’, which are at offset 8112 (which is the row that second asterisk was put in for), and offset 8096. If you look at the line pointer status bits (the last bit of the first bitfield ‘0000000000100000’) and the first bit of the next field, it’s all ’01’ which means LP_NORMAL (normal, live line pointer entry).

After some reading up in the source code, it turns out the upper value in the page header is used to determine which line pointer array entries are valid. So the above code should be:

PAGE_HEADER_UPPER=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t d2 -j 14 -N 2 | awk '{ print $1 }' )
for TUPLE_ID in $( seq 1 1000 ); do
 LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
 DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
 LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
 LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
 TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
 [ $PAGE_HEADER_UPPER -gt $TRUE_DATAPTR ] && break
 TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
 LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
 LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
 echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done

Which will only show the actual entries based on the upper value in the page header. This shows:

TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000

If you look back to the raw page, it should be clear that vacuum did cleaning up in the index. The tuple for value ‘1’ was deleted, meaning that the first entry in the line pointer array now is the value ‘2’. Also the data for the index tuples at the end of the block has been cleaned up, the data for the new first tuple is copied over the old data, the tuple data moved to the end. As a result, there was a copy left of the data for the tuple with the value ‘5’ in the tuple data area and in the line pointer array, which is not used, because the ‘upper’ value in the page header will tell if a line pointer entry is active or not.

Another scenario is important to understand. What if you insert a lot of rows in a table, and then start updating rows? As we have seen, by default postgres fills up a table page to 100%. In other words: this is not an unlikely scenario. Let’s investigate how that works! In order to begin with a clean slate, drop mytable and create it again and insert a thousand rows. My second blogpost contains an anonymous PL/SQL block that inserts rows in a loop.

If we look at the first page of mytable, we see that it is entirely filled:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/9BC4120 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

There is upper-lower=28 bytes free actually, but the heap tuple length is:

test=# select lp_len from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp_len
--------
     39

So the row length is 39 bytes. So indeed page 0 of mytable is entirely filled.

New let’s update f1 of the first row (id=1) to something. As we know, this will create a new version of the row, which can’t fit in the first block, where the row with id=1 is located:

test=# update mytable set f1='ZZZZZZZZZZ' where id=1;
UPDATE 1

Let’s look at the first row how postgres handled that:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1835 |   1836 |        0 | (5,76) |           2 |        258 |     24 |        |       | \x010000001758585858585858585858

I think the update is entirely as expected, which means xmax has a transaction ID set indicating a new version exists. However t_ctid is set to 5,76, in other words: the new version is created in page 5 tuple ID 76, which means outside of the page where to original tuple version exist. If you look at the metadata of the new tuple:

test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
 76 |   5152 |        1 |     39 |   1836 |      0 |        0 | (5,76) |           2 |      10242 |     24 |        |       | \x01000000175a5a5a5a5a5a5a5a5a5a

You see a normal new version, and no bits are set in t_infomask2 indicating a heap only tuple (hot). This means this update did change the index:

test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (5,76)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00

It seems weird that two index entries exist for id=1 (item offset 2 and 3) in a primary key index. If you think a little longer, it makes perfect sense: there is no version consistency information in the index. Currently two versions of the heap tuple with id=1 exist in the heap/table so it’s logical the index points to them both. Let’s see what happens when the table is vacuumed:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
 76 |   5152 |        1 |     39 |   1836 |      0 |        0 | (5,76) |           2 |      10498 |     24 |        |       | \x01000000175a5a5a5a5a5a5a5a5a5a
test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (5,76)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00

After vacuuming, the old version of the row is emptied, and the index entry for the old version is deleted.

There is an additional thing inherent to how postgres handles DML to any tuple (version), which needs to be taken into account. These is what is commonly called tuple ‘hint bits’. The hint bits are bits in t_infomask in the tuple header (let me emphasise: the header with every single tuple!). During DML on a tuple, the transaction obviously is ongoing, and as such the xmax bits for a deleted or old version of a tuple are reset, and the xmin bits of the new version or inserted tuple are reset. This is how a DML operation leaves the tuples it worked on.

The first process that visits a tuple after a DML operation reads t_xmin and t_xmax to understand which version of the tuple should be active. However, during reading the metadata, if a tuple does not have the transaction status set for t_xmax (HEAP_XMAX_COMMITTED or HEAP_XMAX_INVALID) or for t_xmin (HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID), it must investigate whether the transaction is still ongoing, rolled back or committed, and set the t_infomask bits accordingly.

In order to do this, the process needs to read pg_clog and potentially pg_subtrans. Reading pg_clog/pg_subtrans involves extra reading, and setting the bits means changes to the page, which need to be recorded in the WAL file. This means a select (read only) query can lead to writing to the WAL file! It should be noted that after the just described more expensive first read, which sets the hint bits for a tuple, the next readers can take advantage of that work and do not need to read up on the transaction state of a tuple again, unless a transaction performs DML again. Running vacuum on a table (whether by autovacuum or explicitly) will set the hint bits too.

Conclusion
This blogpost looks at updates to tuples. In postgres, an update means the previous version gets xmax set, and a new version of a tuple is created with xmin set to the xmax of the old version. If the update happened to a non-indexed field and the new tuple can be allocated in the same page, the new tuple can be ‘HOT’, alias a heap only tuple. This way no modification needs to be made to the index, which is an optimisation. However, when the changed tuple is requested via the index, it points to the tuple ID of the old version, which contains the tuple ID of the updated version of that tuple. Of course the updated version can also be updated and point to a newer version.

When a HOT updated table is vacuumed, the index entry of a HOT updated tuple remains untouched, and the tuple ID in the table to which the index points to will be a dedicated pointer to the current version (no matter how many HOT row versions existed prior to vacuuming), without space allocated in the data area. All allocated space of a line pointer entry that is cleaned by vacuum is compacted, meaning that all active data entries are defragmented and allocated from the bottom.

If an indexed entry is updated, the new version is created and an index entry is created for the update. Because the old version still exists for read consistency, the index entry is not deleted, it needs to be available for read consistency. If the table is vacuumed, the old version is emptied from the table, and the index entry is deleted.

Very much alike an indexed field update, if an update causes a new version to be created in another page than the original version of the tuple, the update is not HOT, and causes the index to be modified and the new version from the update is added.

Any DML to a row resets what is called the ‘hint bits’. Of course only the hint bits of either xmin or xmax which are relevant to the transaction are reset. The hint bits describe the state of xmin and xmax, and need to be set by the next process that touches the heap/page tuple. The setting of the hint bits require the process to read pg_clog and potentially pg_subtrans to find out the state of the transaction, and set the hint bits, which is a change to the page, which requires WAL logging.

This is the second part of a blogpost about Postgresql database block internals. If you found this blogpost, and are interested in getting started with it, please read the first part, and then continue with this post.
I am doing the investigations on Oracle Linux 7u3 with postgres 9.6 (both the latest versions when this blogpost was written).

In the first part I talked about the pageinspect extension, and investigated the page header and line pointer array. This blogpost looks at the actual tuples, including the index, and how these are stored in the pages.

The block structures are explained in the main documentation at: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
We can see the tuple metadata and raw data using the heap_page_items function:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

As has been described in the first blogpost, the ‘lp’ entries fetch their data from the line pointer array in the page header, and the ‘t’ entries fetch the data from the actual tuple.

This is the raw block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

In the previous blog post I described how the offset of the rows can be found. This a description of the tuple header data in the documentation:

Field           Type            Length  Offset  Description
t_xmin          TransactionId   4 bytes 0       insert XID stamp
t_xmax          TransactionId   4 bytes 4       delete XID stamp
t_cid           CommandId       4 bytes 8       insert and/or delete CID stamp (overlays with t_xvac)
t_xvac          TransactionId   4 bytes 8       XID for VACUUM operation moving a row version
t_ctid          ItemPointerData 6 bytes 12      current TID of this or newer row version
t_infomask2     uint16          2 bytes 18      number of attributes, plus various flag bits
t_infomask      uint16          2 bytes 20      various flag bits
t_hoff          uint8           1 byte  22      offset to user data
All the details can be found in src/include/access/htup_details.h.

Let’s extract the fields from the raw block:

$ # xmin 8152+0
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8152 -N 2
   1760
$ # xmax 8152+4
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8156 -N 2
      0
$ # ctid 8152+12 (current tuple id), first 4 bytes is the block number, second 2 bytes tuple id
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x4 -j 8164 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8168 -N 2
     1
$ # t_infomask2 8152+18
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8170 -N 2
     2

What does ‘2’ mean? Once again, the source code to the rescue! A description of t_infomask2 can be found here: https://doxygen.postgresql.org/htup__details_8h_source.html At line 260 there is a description. t_infomask2 is a (binary, obviously) bit mask. Here is how you can look it up. I took the mask definition from the source code, and printed the binary value using ‘echo “ibase=16;obase=2;7FF” | bc’ in front of it:

     11111111111 #define HEAP_NATTS_MASK         0x07FF  /* 11 bits for number of attributes */
  10000000000000 #define HEAP_KEYS_UPDATED       0x2000  /* tuple was updated and key cols
 100000000000000 #define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
1000000000000000 #define HEAP_ONLY_TUPLE         0x8000  /* this is heap-only tuple */
1110000000000000 #define HEAP2_XACT_MASK         0xE000  /* visibility-related bits */
1111111111111110 #define SpecTokenOffsetNumber       0xfffe

So, anything up to the number 2047 (0x7FF) in the infomask2 field is meant to describe the number of fields/attributes. In our case ‘2’ (binary 10) means there are two fields. Please mind additional bits can be set (for a HOT update for example) while the number of fields is still described. With a bitmask, every position functions independent from the other positions.

Now let’s look at the next field, t_infomask:

$ # t_infomask 8152+20
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8172 -N 2
  2050

This too is a bitmask. The description is in https://doxygen.postgresql.org/htup__details_8h_source.html starting from line 173:

               1 #define HEAP_HASNULL            0x0001  /* has null attribute(s) */
              10 #define HEAP_HASVARWIDTH        0x0002  /* has variable-width attribute(s) */
             100 #define HEAP_HASEXTERNAL        0x0004  /* has external stored attribute(s) */
            1000 #define HEAP_HASOID             0x0008  /* has an object-id field */
           10000 #define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-shared locker */
          100000 #define HEAP_COMBOCID           0x0020  /* t_cid is a combo cid */
         1000000 #define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is exclusive locker */
        10000000 #define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax, if valid, is only a locker */
                    /* xmax is a shared locker */
                 #define HEAP_XMAX_SHR_LOCK  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
                 #define HEAP_LOCK_MASK  (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                          HEAP_XMAX_KEYSHR_LOCK)
       100000000 #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
      1000000000 #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
                 #define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
     10000000000 #define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
    100000000000 #define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */
   1000000000000 #define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
  10000000000000 #define HEAP_UPDATED            0x2000  /* this is UPDATEd version of row */
 100000000000000 #define HEAP_MOVED_OFF          0x4000  /* moved to another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
1000000000000000 #define HEAP_MOVED_IN           0x8000  /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
                 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
1111111111110000 #define HEAP_XACT_MASK          0xFFF0  /* visibility-related bits */

The value for t_infomask is 2050, this is binary (echo “obase=2;2050” | bc):

    100000000010

Which means: 10: has variable-width attributes and 100000000000: xmax is invalid (this means xmax is not set).

I created an index on this table too, called ‘pk_mytable’. The below text looks at an ordinary index. An index uses the same block size, the same block header, but different contents, quite obviously, because an index needs to maintain a balanced tree (btree) structure with ordered entries.

The first page of an index is a ‘metapage’ (a page that essentially points to the index root block). This is how the metapage can be inspected:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    1 |     0 |        1 |         0

As described earlier, a heap page, index metapage or index page all are normal postgres pages, which means they all contain a header which can be inspected using the ‘page_header’ function:

test=# select * from page_header(get_raw_page('pk_mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576A10 |        0 |     0 |    48 |  8176 |    8176 |     8192 |       4 |         0

The bt_metap function tells us the root block is in block 1. Index entries can be listed using the bt_page_items function:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The index structure (like root/leaf blocks, next and previous block) can be investigated using the bt_page_stats function:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3

This tells us that this is a leaf block (type: l), we got four indexed tuples (live_items) in this leaf block, there is no previous index leaf page/left sibling (btpo_prev: 0), there is no next index leaf page/right sibling (btpo_next: 0), btpo is a union that either contains the tree level if not a leaf block or next transaction ID if this leaf page is deleted and btpo_flags, which is a bitmap. the bitmap explanation is available in the source code, I added the actual bits in front of the defines (this is how that is done on the linux prompt: ‘echo “obase=2;$((1 << 0 ))" | bc'):

             1 #define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
            10 #define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
           100 #define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
          1000 #define BTP_META        (1 << 3)    /* meta-page */
         10000 #define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
        100000 #define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
       1000000 #define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
      10000000 #define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */
10011011111011 #define MAX_BT_CYCLE_ID     0xFF7F

The btpo_flags field contains the value 3 (binary 11), which means this is a leaf block and this is the root block. This is how the BTPageOpaqueData struct looks like, with included offsets and byte sizes by me:

typedef struct BTPageOpaqueData
Bytes   {
4         BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
4         BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
          union
          {
4             uint32      level;      /* tree level --- zero for leaf pages */
              TransactionId xact;     /* next transaction ID, if deleted */
          }           btpo;
2         uint16      btpo_flags;     /* flag bits, see below */
2         BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
       } BTPageOpaqueData;

This struct is placed at the end of the block at the offset for ‘special’ when using the page_header function on the index block.

This is how the block contents look like for an index:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 e8 6b 57 01 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look closely, you see the header is the same, which is obvious because I just told that all pages contain the same page header. For index entries, the entries are allocated from the bottom up too, to allow the header (the line pointer array actually) to grow when additional entries are inserted into this page of the index. This is exactly the same line pointer array as we saw with a heap table, so we can extract them in the same way; first entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fe0

Now flip the 16th bit to get the offset number:

$ echo $((0x9fe0 & ~$((1<<15))))
8160

Get the length of the index entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
0020

Right-shift:

$ echo $((0x0020 >> 1))
16

The offset number from the line pointer array points to the index tuple data header, which contains two fields, described in the annotated source code:

typedef struct IndexTupleData
{
   ItemPointerData t_tid;      /* reference TID to heap tuple */
   /* ---------------
    * t_info is laid out in the following fashion:
    *
    * 15th (high) bit: has nulls
    * 14th bit: has var-width attributes
    * 13th bit: unused
    * 12-0 bit: size of tuple
    * ---------------
    */
    unsigned short t_info;      /* various info about tuple */
} IndexTupleData;               /* MORE DATA FOLLOWS AT END OF STRUCT */

So the header contains the block and tuple id in ItemPointerData to the row the index entry describes. The length of ItemPointerData is 6 bytes. And the header contains a bitmap called t_info, size 2 bytes. This means an index tuple header is 8 bytes. The header of an index is not described in the postgres online manual. Did you notice there is no xmin/xmax in the index structure? An index does not do multi-versioning like a table does!

Let’s fetch the index tuple header data directly. First the tuple id of the heap (block number and tuple offset number:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x4 -j 8160 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
     1

The next field in the header is t_info:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
    16
$ echo "obase=2;16" | bc
10000

The value in t_info has the highest bit set at the 5th position, while the description says that the bits 13/14/15 are used as a bitmask, so the only information in t_info is the size of the tuple: 16 bytes.

It is widely known that an regular btree index stores its entries ordered, which is why you can range scan the index leaf blocks, which is an optimised way of searching through data. But how is that organised in the index if I enter data unordered? In the table the data will be added unordered, bottom up to the block, because the table has no requirement to store the data ordered. But the index must provide the indexed value in an ordered way. The two options I see there are is the indexed value is added bottom up unordered, and the pointer to the entry is entered in the correct position in the line pointer array, or: the indexed value is placed at the correct position in the data area, AND the pointer is added at the correct position in the line pointer array. Let’s test this! We currently have the values 1-4 stored in the index, let’s add 6, and checkpoint:

test=# insert into mytable (id, f1) values (6, 'ffffffffff');
INSERT 0 1
test=# checkpoint;
CHECKPOINT

Now let’s look at the index block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 c8 81 64 01 00 00 00 00 2c 00 a0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

Here we see the new tuple with the value 6 is added at position 8096, and the corresponding line pointer entry is added (echo $((0x9fa0 & ~$((1<<15)))) = 8096). Now let's add 5 to mytable, and checkpoint:

test=# insert into mytable (id, f1) values (5, 'eeeeeeeeee');
INSERT 0 1
test=# checkpoint;

And dump the block contents again:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 00 85 64 01 00 00 00 00 30 00 90 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008080 00 00 00 00 06 00 10 00 05 00 00 00 00 00 00 00
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look in the index field data first (offset numbers 8080-8176), you see that the indexed field is added unordered. The number (field data) is at offset 8088, value 5, and at offset 8104, value 6 for the two rows we just inserted. If you look at the last two array members of the line pointer, you see that the ordering is happening there:

$ # 5th array member: 24 + (4 * 4)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 40 -N 2
 9f90
$ echo $((0x9f90 & ~$((1<<15))))
8080
$ # 6th array member: 24 + (4 * 5)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 44 -N 2
 9fa0
echo $((0x9fa0 & ~$((1<<15))))
8096

Now let’s fill up the table to 1000 rows to see what happens when a heap and an index block fill up:

test=# do $$
test$# begin
test$# for nr in 7..1000 loop
test$# insert into mytable (id, f1) values (nr, 'XXXXXXXXXX');
test$# end loop;
test$# end $$;
DO
test=# checkpoint;
CHECKPOINT

This is the first block of mytable:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/164E6A8 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

If you subtract upper with lower (792-764), you get the amount of free space in the first block: 28 bytes. I deliberately created all the rows with the same mount of data (an int and a varchar with 10 characters in it), let’s extract the first row:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp = 1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1779 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |       | \x010000001761616161616161616161

This shows in lp_len that the length of a single row is 39 bytes. This means that by default (when no fill factor is set), a table block will be filled up to 100%.

Let’s look at the index. First scan the metapage of the index ‘pk_mytable’:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1

The root of the index now is block 3 (obviously in the ‘root’ field). Let’s examine that page:

test=# select * from bt_page_stats('pk_mytable',3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2

The old page block changed from being a combined root and leaf page to being a dedicated leaf page, and a new root page was created in page number 3 of pk_mytable! This also explains why there are only 3 items in the page (live_items), this is a block to guide the process to the leaf blocks. This also is visible in the btpo_flags, which now is 2: only the bit for root page is set.

Let’s look at the contents of the root page:

test=# select * from bt_page_items('pk_mytable',3);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (1,1) |       8 | f     | f    |
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00

A root/tree (called ‘branch’ in Oracle) index block contains pointers to index pages which can be tree pages or leaf blocks. Non-root tree pages are called ‘internal’ pages. The data field contains the the highest value of the left/lower value branch or leaf. This way, with a tree level of one (bt_page_stats.btpo = 1), it contains the highest value of the lower offset leaf page ctid is pointing at for this index. A leaf page always points to heap (table) pages. If this single tree page fills up, it is splitted and a new root is created to point to the two tree pages which are just created as a result of the split of the old root page. Because a tree page points to the highest value of the left/lower value branch or leaf page, there is no value for the left-most leaf, because there is no ‘left side’ block. This is also visible with the ‘itemlen’ field above when examining the index page using bt_page_items, only the header of 8 bytes is stored.

To get an idea of the id values stored in the leaf pages, take the hexadecimal value and convert it to decimal:

$ echo "ibase=16;016F" | bc
367
$ echo "ibase=16;02DD" | bc
733

This means pk_mytable leaf page 1 contains the values 1-367, leaf page 2 contains 368-733 and leaf page 4 contains 734-1000.

Let’s descent into the first leaf page, page 1 of pk_mytable, which was our old combined root and leaf page, and read the index meta data using bt_page_stats:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1

This shows this is a leaf page (type: l), but this time it is only a leaf page, which is visible with btpo_flags, only bit 1 is set indicating a leaf page. btpo_prev is set to 0, which indicates this is the left-most side of the index, there is no left block. There is a page with a higher value, btpo_next is set to 2, indicating block 2 of pk_mytable is the right side page.

There are 808 bytes of free space available in the page. An index entry for this index is 16 bytes, which means there is space available. The way this is handled is summarised in the documentation in the source code for the function _bt_findsplitloc. Essentially, if an index page fills up, the page contents are split between two pages, for which every page gets 50% of the page’s payload. In the case of the page being a leaf page and the right-most page (btpo_next=0), it is considered a special case, and the page is split leaving fillfactor% (default 90%) in the original page, and the remaining percentage in the new right-most page, optimising space usage and still leaving room for updates.

Now let’s look at the index tuples for the leftmost page:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,6)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
          7 | (0,5)   |      16 | f     | f    | 06 00 00 00 00 00 00 00
          8 | (0,7)   |      16 | f     | f    | 07 00 00 00 00 00 00 00
          9 | (0,8)   |      16 | f     | f    | 08 00 00 00 00 00 00 00
...etc...

What is visible is the data is ordered, however the highest value in the index page is put in the first row, the lowest value starts at the second row. In the readme with the nbtree source this is explained:

On a page that is not rightmost in its tree level, the "high key" is 
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused.  A page that is
rightmost has no "high key", so data items start with the first item.
Putting the high key at the left, rather than the right, may seem odd,
but it avoids moving the high key as we add data items.

Conclusion
In this post I moved beyond the page headers of a heap and looked at how tuple metadata looks like, notably xmin and xmax, the ctid and the infomasks which are bitmap fields to indicate specific status indicators per row, and how to extract these manually.

Next I moved to the (simple primary key) index on the id column. An index page partly looks the same as a heap block because it contains a common block header. However an index contains a metapage as the first page, and if all the index tuples fit in a single page has a combined root and leaf page. Once the combined root and leaf page grows out of the single page, a non-leaf root page is created which stores pointers to index leaf pages containing the index tuple values and the max values of the leaf pages, except for the leftmost leaf page.

Heap (table) pages are filled up to 100% by default, which can be changed with the fillfactor storage attribute. An index has a default fillfactor of 90%. If an index leaf page is filled up to fillfactor, it will split. However, if the right-most leaf page fills up (which means an increasing value, like a sequence or a timestamp), the split will be done 90% (fillfactor% actually)-10% (old-new page) instead of 50%-50%.

All index leaf pages outside of the rightmost page stores the max index field value as the first index tuple, and starts of with the lowest index field value as the second tuple, to optimise inserting new values.

The next post will look at what happens when row data gets changed. This is particularly exciting for me, because this is where postgres truly is different from my base reference, which is the Oracle database.

This blogpost is the result of me looking into how postgres works, and specifically the database blocks. The inspiration and essence of this blogpost comes from two blogs from Jeremiah Peschka: https://facility9.com/2011/03/postgresql-row-storage-fundamentals/ and https://facility9.com/2011/04/postgresql-update-internals/
I am using Oracle Linux 7u3 and postgres 9.6 (current versions when this blogpost was written).

Postgres is already installed, and a database cluster is already running. Let’s create a database ‘test’ for the sake of our tests:

$ createdb test

Once the database is created, logging on is done with ‘psql’:

$ psql -d test
psql (9.6.3)
Type "help" for help.

test=#

Once logged on, we need a table and index to investigate. Let’s create a very simple table with a primary key and insert some data:

test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

In order to look at the block structures there is a postgres extension called ‘pageinspect’. The extension is part of the ‘postgresql96-contrib’ RPM package. You can check if it’s installed on your machine by looking if the file ‘/usr/pgsql-9.6/share/extension/pageinspect.control’ exists.

Once you made sure the pageinspect operating system dependencies are met, you need to install it in the database. To verify if it’s installed or not, look in ‘pg_extension’. This is how it looks like on a fresh installed database cluster:

test=# select * from pg_extension;
 extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
---------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql |       10 |           11 | f              | 1.0        |           |

This means the pageinspect extension is not installed, install it in the following way:

test=# create extension pageinspect;
CREATE EXTENSION
test=# select * from pg_extension;
   extname   | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
-------------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql     |       10 |           11 | f              | 1.0        |           |
 pageinspect |       10 |         2200 | t              | 1.5        |           |

Now with the pageinspect extension installed, let’s look at the heap (table) block. In order to understand how a block looks like, go to the documentation: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html
The essence is a heap bock contains a header, immediately followed by ItemIdData (also known as line pointers, which are pointers to rows in the same block) growing top to bottom, and then the tuples allocated bottom to top, allowing the line pointer array to grow.

The table mytable in this case consists of one block, because the table tuples (rows) fit in one. We can look at the page header by using the page_header function together with the get_raw_page function. Mind the name of the table and the block number in the function get_raw_page:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576BA8 |        0 |     0 |    40 |  8032 |    8192 |     8192 |       4 |         0

Next we want to look at the ItemIdData/line pointer array and rows, which are combined in the heap_page_items function. The ItemIdData array fields are indicated by ‘lp’, the fields in the tuple are indicated by ‘t’:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

In case you wondered how to decipher the data:

test=# select chr(x'61'::integer);
 chr
-----
 a

In order to get a better understanding, it often helps to physically see the block contents (at least, that is how my brain works). This is how that can be done:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

(if xxd is not installed on your system, it’s in the vim-common package)
If you look at the header and page items results, you see the header and line pointer items on the top, then a star after offset 48, which indicates identical lines (all zero, indicating non-touched free space), and the row data allocated from the bottom (remember 0x61 is ‘a’, indicating the first added row is at the bottom).

Let’s work a bit on the raw data, to see how it works. What I found EXTREMELY helpful (for me coming from investigating a closed-source product like the Oracle database), is the source code available and fully searchable at http://doxygen.postgresql.org (thanks Jan and Devrim!!).

Let’s try to find the first record directly using the block contents.

First we need to find the line pointer array. In order to do that, I looked at the PageHeaderData struct in the source code:

Page header in the source code: https://doxygen.postgresql.org/structPageHeaderData.html
typedef struct PageHeaderData
  148 {
  149     /* XXX LSN is member of *any* block, not only page-organized ones */		size	/ offset
  150     PageXLogRecPtr pd_lsn;      /* LSN: next byte after last byte of xlog		8 bytes / 0
  151                                  * record for last change to this page */
  152     uint16      pd_checksum;    /* checksum */					2 bytes	/ 8
  153     uint16      pd_flags;       /* flag bits, see below */			2 bytes / 10
  154     LocationIndex pd_lower;     /* offset to start of free space */		2 bytes / 12
  155     LocationIndex pd_upper;     /* offset to end of free space */			2 bytes / 14
  156     LocationIndex pd_special;   /* offset to start of special space */		2 bytes / 16
  157     uint16      pd_pagesize_version;						2 bytes / 18
  158     TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */	4 bytes / 20
  159     ItemIdData  pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */		4 bytes array / 24
  160 } PageHeaderData;

I added the size and offset numbers myself, in order to make it easier. Above we see the first array member should be at offset 24 from the start of the header. However, we need to understand how ItemIdData looks like: https://doxygen.postgresql.org/structItemIdData.html

typedef struct ItemIdData
   25 {
   26     unsigned    lp_off:15,      /* offset to tuple (from start of page) */
   27                 lp_flags:2,     /* state of item pointer, see below */
   28                 lp_len:15;      /* byte length of tuple */
   29 } ItemIdData;

Now let’s get the offset number of the first row straight from the line pointer array in the block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fd8

However, this number is too high, because it’s decimal 40920, binary: 1001 111 1101 1000; alias: the 16th bit is set. So we need to set the 16th bit to zero:

$ echo $((0x9fd8 & ~$((1<<15))))
8152

This is how to get the tuple length from the line pointer array:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
 004e

This number is too high too: it’s decimal 78, while we know the length is 39 (see above). This is because this includes a bit from the lp_flags field.
We need to right-shift this number to get the actual number:

$ echo $((0x004e >> 1))
39

So there you got it: by reading the block directly, I fetched the line pointer values of the first block (offset number and length), which are 8152 and length 39 bytes. Using the above commands you can easily read the next line pointer value by setting the correct number at ‘-j’, and doing the bit manipulation.

So, to conclude this blogpost: I’ve touched on subject of the pageinspect extension, the postgres searchable source in http://doxygen.postgresql.org, the documentation at https://www.postgresql.org/docs/9.6/static/storage-page-layout.html, and showed the page header, the line pointer array in the page header, and the rows in the block. These can all be seen using functions page_header, heap_page_items and get_raw_page. Then I showed how to fetch the raw block contents, to truly see the block, and showed how to fetch data directly using the encode and get_raw_page functions in postgres, and the xxd and od functions on the linux command line.

NB. Updated July 1st, 17:19 CET after comments of Jan Karremans.

When performance or behavior during the development or test phase is not what you’ve expected, the normal way to go is explain or explain analyze the statements involved.

But how about production environments? And how can you tell what has happened? The most common scenario is probably when the behavior is unacceptable, so instant action is required. That is not the way you, as a DBA, want to be notified how about the dissatisfaction of the performance.

Some time ago, glenn fawcett of sun microsystems encountered probably a situation in which he wanted to know what was going on on a database level. He also had experience with the oracle database, and with a statistics package called “statspack”. What glenn did, was emulate some of the functionality of statspack, and create a version for postgresql, heavily focussed on logical IO and physical IO.

Later, this package was lost, and corresponding page also. In fact, I managed to get the package using the internet time machine (http://www.archive.org/web/web.php). Glenn recently got notified of this fact, and created a post about the package on his blog.

Well, I got into the same situation as Glenn. I have a postgresql database, and want to know what’s going on. In fact, I want to know more than just logical and physical IO, I want to know everything that is available. I also want to know postgresql version, database size, parameters, etc. Also, I want to make use of it to do analysis on daily basis of the snapshots, so I can do capacity management and do predictions about growth and performance.

That is why I’ve enhanced pgstatspack, and made it available open source: http://pgfoundry.org/projects/pgstatspack

Enjoy!