Archive

Tag Archives: Postgresql

This post is about one of the fundamentally important properties of a database: how IO is done. The test case I studied is doing a simple full table scan of a single large table. In both Oracle and postgres the table doesn’t have any indexes or constraints, which is not a realistic example, but this doesn’t change the principal topic of the study: doing a table scan.

I used a publicly available dataset from the US bureau of transportation statistics called FAF4.5.1_database.zip
The zipped file is 347MB, unzipped size 1.7GB.

In both cases Oracle Linux 7.7 (64 bit) is used, running in VirtualBox, with the storage being a USB3 SSD. Number of CPUs is 4, memory size is 6G. Filesystem type: xfs.
The Oracle version used is Oracle 19.5, the Postgresql version used is 12.1.
For Postgresql, the postgresql.conf file is not changed, except for max_parallel_workers_per_gather which is set to 0 to make postgres use a single process.
For Oracle, the parameters that I think are important: filesystemio_options=’setall’. Oracle is used filesystem based (so no ASM).

This is the table definition for Oracle:

create table faf451 (
  fr_origin varchar2(3),
  dms_orig varchar2(3),
  dms_dest varchar2(3),
  fr_dest varchar2(3),
  fr_inmode varchar2(1),
  dms_mode varchar2(1),
  fr_outmode varchar2(1),
  sctg2 varchar2(2),
  trade_type varchar2(1),
  tons number,
  value number,
  tmiles number,
  curval number,
  wgt_dist number,
  year varchar2(4)
);

This is the table definition for Postgresql:

create table faf451 (
  fr_origin varchar(3),
  dms_orig varchar(3),
  dms_dest varchar(3),
  fr_dest varchar(3),
  fr_inmode varchar(1),
  dms_mode varchar(1),
  fr_outmode varchar(1),
  sctg2 varchar(2),
  trade_type varchar(1),
  tons double precision,
  value double precision,
  tmiles double precision,
  curval double precision,
  wgt_dist double precision,
  year varchar(4)
);

In order for the data to be easy loadable into postgres using copy from, I had to remove ‘””‘ (double double quotes) for the empty numeric fields. In oracle I could say “optionally enclosed by ‘”‘”. For Oracle I used an external table definition to load the data.

Now, before doing any benchmarks, I have an idea where this is going. Oracle is using direct IO (DIO) so linux page cache management and “double buffering” are avoided. Also, oracle will be doing asynchronous IO (AIO), which means submitting is separated from waiting for the notification that the submitted IOs are ready, and on top of that oracle will submit multiple IO requests at the same time. And again on top of that, oracle does multi-block IO, which means that instead of requesting each 8K database block individually, it will group adjacent blocks and request for these in one go, up to a size of combined blocks of 1MB, which means it can requests up to 128 8K blocks in one IO. Postgres will request every block synchronous, so 1 8K block at a time, and waiting for each request to finish. That makes me have a strong idea where this is going.

It should be noted that postgres explicitly is depending on the operating system page cache for buffering as a design principle. Because of DIO, blocks that are read by oracle are not cached in the operating system page cache.

I executed my benchmark in the following way:
– A run for every size is executed 5 times.
– At the start of every run for a certain size (so before every “batch” of 5 runs), the page cache is flushed: (echo 3 > /proc/sys/vm/drop_caches).
– Before each individual run, the database cache is flushed (systemctl restart postgresql-12 for postgres, alter system flush buffer_cache for oracle).

I started off with 2G from the dataset, and then simply performed a ‘copy from’ again to load the same dataset into the table in postgres. Oracle required a bit more of work. Oracle was able to save the same data in way less blocks; the size became 1.18G. In order to have both postgres and oracle scan the same amount of data, I calculated roughly how much rows I needed to add to the table to make it 2G, and copied that table to save it as a 2G table, so I could insert that table to increase the size of the test table by 2G. This way in both oracle and postgres I could test with a 2G table and add 2G at a time until I reached 20G.

These are the results. As you can see in the legenda: oracle is orange, postgres is blue.
postgres oracle scan results(click graph to load full picture)

What we see, is that postgres is a bit slower with the first run of 5 for the smaller dataset sizes, which becomes less visible with larger datasets.
Also, postgres is way faster if the dataset fits into the page cache and it has been read into it. This is logical because postgres explicitly uses the page cache as a secondary cache, and the test is the only activity on this server, so it hasn’t been flushed by other activity.

What was totally shocking to me, is postgres is performing alike oracle and both roughly are able to perform at the maximum IO speed of my disk: 300MB/s, especially when the dataset is bigger, alias beyond the page cache size.

It wasn’t shocking that oracle could reach the total bandwidth of the disk: oracle uses all the techniques to optimise IO for bandwidth. But how can postgres do the same, NOT deploying these techniques, reading 8K at a time??

The first thing to check is whether postgres is doing something else than I suspected. This can simply be checked with strace:

poll_wait(3, [{EPOLLIN, {u32=18818136, u64=18818136}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0!select count(*) from faf451"..., 8192, 0, NULL, NULL) = 34
lseek(20, 0, SEEK_END)                  = 335740928
lseek(20, 0, SEEK_END)                  = 335740928
kill(1518, SIGUSR1)                     = 0
pread64(5, "\f\0\0\0\310ILc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846061568) = 8192
pread64(5, "\f\0\0\0HcLc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846069760) = 8192
pread64(5, "\f\0\0\0\260|Lc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846077952) = 8192
pread64(5, "\f\0\0\0000\226Lc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846086144) = 8192
…etc…

The above strace output shows only 4 rows of pread64() calls, but this goes on. So no “secret” optimisation there.

Luckily, my VM has a new enough version of Linux for it to be able to use eBPF, so I can use biosnoop. Biosnoop is a tool to look at IO on one of the lower layers of the linux kernel, the block device interface (hence ‘bio’). This is the biosnoop output:

# /usr/share/bcc/tools/biosnoop
TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
0.000000000    postmaster     4143   sdb     R  66727776  708608     5.51
0.006419000    postmaster     4143   sdb     R  66731720  77824     11.06
0.006497000    postmaster     4143   sdb     R  66734432  786432    11.03
0.011550000    postmaster     4143   sdb     R  66731872  1310720   16.17
0.013470000    postmaster     4143   sdb     R  66729160  1310720   18.86
0.016439000    postmaster     4143   sdb     R  66735968  1310720   14.61
0.019220000    postmaster     4143   sdb     R  66738528  786432    15.20

Wow…so here it’s doing IOs of up to 1MB! So somewhere between postgres itself and the block device, the IOs magically grew to sizes up to 1MB…that’s weird. The only thing that sits between postgres and the block device is the linux kernel, which includes page cache management.

To get an insight into that, I ran ‘perf record -g -p PID’ during the scan, and then perf report to look at the recorded perf data. This is what is I found:

Samples: 21K of event 'cpu-clock', Event count (approx.): 5277000000
  Children      Self  Command     Shared Object       Symbol                                                                  ◆
-   41.84%     3.63%  postmaster  libpthread-2.17.so  [.] __pread_nocancel                                                    ▒
   - 38.20% __pread_nocancel                                                                                                  ▒
      - 38.08% entry_SYSCALL_64_after_hwframe                                                                                 ▒
         - 37.95% do_syscall_64                                                                                               ▒
            - 35.87% sys_pread64                                                                                              ▒
               - 35.51% vfs_read                                                                                              ▒
                  - 35.07% __vfs_read                                                                                         ▒
                     - 34.97% xfs_file_read_iter                                                                              ▒
                        - 34.69% __dta_xfs_file_buffered_aio_read_3293                                                        ▒
                           - 34.32% generic_file_read_iter                                                                    ▒
                              - 21.10% page_cache_async_readahead                                                             ▒
                                 - 21.04% ondemand_readahead                                                                  ▒
                                    - 20.99% __do_page_cache_readahead                                                        ▒
                                       + 14.14% __dta_xfs_vm_readpages_3179                                                   ▒
                                       + 5.07% __page_cache_alloc                                                             ▒
                                       + 0.97% radix_tree_lookup                                                              ▒
                                       + 0.54% blk_finish_plug                                                                ▒

If you look at rows 13-15 you see that the kernel is performing readahead. This is an automatic function in the linux kernel which looks if the requests are sequential of nature, and when that’s true performs readahead, so that the scan is made faster.

This is the second part of a blogpost about Postgresql database block internals. If you found this blogpost, and are interested in getting started with it, please read the first part, and then continue with this post.
I am doing the investigations on Oracle Linux 7u3 with postgres 9.6 (both the latest versions when this blogpost was written).

In the first part I talked about the pageinspect extension, and investigated the page header and line pointer array. This blogpost looks at the actual tuples, including the index, and how these are stored in the pages.

The block structures are explained in the main documentation at: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
We can see the tuple metadata and raw data using the heap_page_items function:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

As has been described in the first blogpost, the ‘lp’ entries fetch their data from the line pointer array in the page header, and the ‘t’ entries fetch the data from the actual tuple.

This is the raw block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

In the previous blog post I described how the offset of the rows can be found. This a description of the tuple header data in the documentation:

Field           Type            Length  Offset  Description
t_xmin          TransactionId   4 bytes 0       insert XID stamp
t_xmax          TransactionId   4 bytes 4       delete XID stamp
t_cid           CommandId       4 bytes 8       insert and/or delete CID stamp (overlays with t_xvac)
t_xvac          TransactionId   4 bytes 8       XID for VACUUM operation moving a row version
t_ctid          ItemPointerData 6 bytes 12      current TID of this or newer row version
t_infomask2     uint16          2 bytes 18      number of attributes, plus various flag bits
t_infomask      uint16          2 bytes 20      various flag bits
t_hoff          uint8           1 byte  22      offset to user data
All the details can be found in src/include/access/htup_details.h.

Let’s extract the fields from the raw block:

$ # xmin 8152+0
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8152 -N 2
   1760
$ # xmax 8152+4
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8156 -N 2
      0
$ # ctid 8152+12 (current tuple id), first 4 bytes is the block number, second 2 bytes tuple id
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x4 -j 8164 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8168 -N 2
     1
$ # t_infomask2 8152+18
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8170 -N 2
     2

What does ‘2’ mean? Once again, the source code to the rescue! A description of t_infomask2 can be found here: https://doxygen.postgresql.org/htup__details_8h_source.html At line 260 there is a description. t_infomask2 is a (binary, obviously) bit mask. Here is how you can look it up. I took the mask definition from the source code, and printed the binary value using ‘echo “ibase=16;obase=2;7FF” | bc’ in front of it:

     11111111111 #define HEAP_NATTS_MASK         0x07FF  /* 11 bits for number of attributes */
  10000000000000 #define HEAP_KEYS_UPDATED       0x2000  /* tuple was updated and key cols
 100000000000000 #define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
1000000000000000 #define HEAP_ONLY_TUPLE         0x8000  /* this is heap-only tuple */
1110000000000000 #define HEAP2_XACT_MASK         0xE000  /* visibility-related bits */
1111111111111110 #define SpecTokenOffsetNumber       0xfffe

So, anything up to the number 2047 (0x7FF) in the infomask2 field is meant to describe the number of fields/attributes. In our case ‘2’ (binary 10) means there are two fields. Please mind additional bits can be set (for a HOT update for example) while the number of fields is still described. With a bitmask, every position functions independent from the other positions.

Now let’s look at the next field, t_infomask:

$ # t_infomask 8152+20
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8172 -N 2
  2050

This too is a bitmask. The description is in https://doxygen.postgresql.org/htup__details_8h_source.html starting from line 173:

               1 #define HEAP_HASNULL            0x0001  /* has null attribute(s) */
              10 #define HEAP_HASVARWIDTH        0x0002  /* has variable-width attribute(s) */
             100 #define HEAP_HASEXTERNAL        0x0004  /* has external stored attribute(s) */
            1000 #define HEAP_HASOID             0x0008  /* has an object-id field */
           10000 #define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-shared locker */
          100000 #define HEAP_COMBOCID           0x0020  /* t_cid is a combo cid */
         1000000 #define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is exclusive locker */
        10000000 #define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax, if valid, is only a locker */
                    /* xmax is a shared locker */
                 #define HEAP_XMAX_SHR_LOCK  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
                 #define HEAP_LOCK_MASK  (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                          HEAP_XMAX_KEYSHR_LOCK)
       100000000 #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
      1000000000 #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
                 #define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
     10000000000 #define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
    100000000000 #define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */
   1000000000000 #define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
  10000000000000 #define HEAP_UPDATED            0x2000  /* this is UPDATEd version of row */
 100000000000000 #define HEAP_MOVED_OFF          0x4000  /* moved to another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
1000000000000000 #define HEAP_MOVED_IN           0x8000  /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
                 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
1111111111110000 #define HEAP_XACT_MASK          0xFFF0  /* visibility-related bits */

The value for t_infomask is 2050, this is binary (echo “obase=2;2050” | bc):

    100000000010

Which means: 10: has variable-width attributes and 100000000000: xmax is invalid (this means xmax is not set).

I created an index on this table too, called ‘pk_mytable’. The below text looks at an ordinary index. An index uses the same block size, the same block header, but different contents, quite obviously, because an index needs to maintain a balanced tree (btree) structure with ordered entries.

The first page of an index is a ‘metapage’ (a page that essentially points to the index root block). This is how the metapage can be inspected:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    1 |     0 |        1 |         0

As described earlier, a heap page, index metapage or index page all are normal postgres pages, which means they all contain a header which can be inspected using the ‘page_header’ function:

test=# select * from page_header(get_raw_page('pk_mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576A10 |        0 |     0 |    48 |  8176 |    8176 |     8192 |       4 |         0

The bt_metap function tells us the root block is in block 1. Index entries can be listed using the bt_page_items function:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The index structure (like root/leaf blocks, next and previous block) can be investigated using the bt_page_stats function:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3

This tells us that this is a leaf block (type: l), we got four indexed tuples (live_items) in this leaf block, there is no previous index leaf page/left sibling (btpo_prev: 0), there is no next index leaf page/right sibling (btpo_next: 0), btpo is a union that either contains the tree level if not a leaf block or next transaction ID if this leaf page is deleted and btpo_flags, which is a bitmap. the bitmap explanation is available in the source code, I added the actual bits in front of the defines (this is how that is done on the linux prompt: ‘echo “obase=2;$((1 << 0 ))" | bc'):

             1 #define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
            10 #define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
           100 #define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
          1000 #define BTP_META        (1 << 3)    /* meta-page */
         10000 #define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
        100000 #define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
       1000000 #define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
      10000000 #define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */
10011011111011 #define MAX_BT_CYCLE_ID     0xFF7F

The btpo_flags field contains the value 3 (binary 11), which means this is a leaf block and this is the root block. This is how the BTPageOpaqueData struct looks like, with included offsets and byte sizes by me:

typedef struct BTPageOpaqueData
Bytes   {
4         BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
4         BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
          union
          {
4             uint32      level;      /* tree level --- zero for leaf pages */
              TransactionId xact;     /* next transaction ID, if deleted */
          }           btpo;
2         uint16      btpo_flags;     /* flag bits, see below */
2         BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
       } BTPageOpaqueData;

This struct is placed at the end of the block at the offset for ‘special’ when using the page_header function on the index block.

This is how the block contents look like for an index:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 e8 6b 57 01 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look closely, you see the header is the same, which is obvious because I just told that all pages contain the same page header. For index entries, the entries are allocated from the bottom up too, to allow the header (the line pointer array actually) to grow when additional entries are inserted into this page of the index. This is exactly the same line pointer array as we saw with a heap table, so we can extract them in the same way; first entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fe0

Now flip the 16th bit to get the offset number:

$ echo $((0x9fe0 & ~$((1<<15))))
8160

Get the length of the index entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
0020

Right-shift:

$ echo $((0x0020 >> 1))
16

The offset number from the line pointer array points to the index tuple data header, which contains two fields, described in the annotated source code:

typedef struct IndexTupleData
{
   ItemPointerData t_tid;      /* reference TID to heap tuple */
   /* ---------------
    * t_info is laid out in the following fashion:
    *
    * 15th (high) bit: has nulls
    * 14th bit: has var-width attributes
    * 13th bit: unused
    * 12-0 bit: size of tuple
    * ---------------
    */
    unsigned short t_info;      /* various info about tuple */
} IndexTupleData;               /* MORE DATA FOLLOWS AT END OF STRUCT */

So the header contains the block and tuple id in ItemPointerData to the row the index entry describes. The length of ItemPointerData is 6 bytes. And the header contains a bitmap called t_info, size 2 bytes. This means an index tuple header is 8 bytes. The header of an index is not described in the postgres online manual. Did you notice there is no xmin/xmax in the index structure? An index does not do multi-versioning like a table does!

Let’s fetch the index tuple header data directly. First the tuple id of the heap (block number and tuple offset number:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x4 -j 8160 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
     1

The next field in the header is t_info:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
    16
$ echo "obase=2;16" | bc
10000

The value in t_info has the highest bit set at the 5th position, while the description says that the bits 13/14/15 are used as a bitmask, so the only information in t_info is the size of the tuple: 16 bytes.

It is widely known that an regular btree index stores its entries ordered, which is why you can range scan the index leaf blocks, which is an optimised way of searching through data. But how is that organised in the index if I enter data unordered? In the table the data will be added unordered, bottom up to the block, because the table has no requirement to store the data ordered. But the index must provide the indexed value in an ordered way. The two options I see there are is the indexed value is added bottom up unordered, and the pointer to the entry is entered in the correct position in the line pointer array, or: the indexed value is placed at the correct position in the data area, AND the pointer is added at the correct position in the line pointer array. Let’s test this! We currently have the values 1-4 stored in the index, let’s add 6, and checkpoint:

test=# insert into mytable (id, f1) values (6, 'ffffffffff');
INSERT 0 1
test=# checkpoint;
CHECKPOINT

Now let’s look at the index block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 c8 81 64 01 00 00 00 00 2c 00 a0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

Here we see the new tuple with the value 6 is added at position 8096, and the corresponding line pointer entry is added (echo $((0x9fa0 & ~$((1<<15)))) = 8096). Now let's add 5 to mytable, and checkpoint:

test=# insert into mytable (id, f1) values (5, 'eeeeeeeeee');
INSERT 0 1
test=# checkpoint;

And dump the block contents again:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 00 85 64 01 00 00 00 00 30 00 90 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008080 00 00 00 00 06 00 10 00 05 00 00 00 00 00 00 00
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look in the index field data first (offset numbers 8080-8176), you see that the indexed field is added unordered. The number (field data) is at offset 8088, value 5, and at offset 8104, value 6 for the two rows we just inserted. If you look at the last two array members of the line pointer, you see that the ordering is happening there:

$ # 5th array member: 24 + (4 * 4)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 40 -N 2
 9f90
$ echo $((0x9f90 & ~$((1<<15))))
8080
$ # 6th array member: 24 + (4 * 5)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 44 -N 2
 9fa0
echo $((0x9fa0 & ~$((1<<15))))
8096

Now let’s fill up the table to 1000 rows to see what happens when a heap and an index block fill up:

test=# do $$
test$# begin
test$# for nr in 7..1000 loop
test$# insert into mytable (id, f1) values (nr, 'XXXXXXXXXX');
test$# end loop;
test$# end $$;
DO
test=# checkpoint;
CHECKPOINT

This is the first block of mytable:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/164E6A8 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

If you subtract upper with lower (792-764), you get the amount of free space in the first block: 28 bytes. I deliberately created all the rows with the same mount of data (an int and a varchar with 10 characters in it), let’s extract the first row:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp = 1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1779 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |       | \x010000001761616161616161616161

This shows in lp_len that the length of a single row is 39 bytes. This means that by default (when no fill factor is set), a table block will be filled up to 100%.

Let’s look at the index. First scan the metapage of the index ‘pk_mytable’:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1

The root of the index now is block 3 (obviously in the ‘root’ field). Let’s examine that page:

test=# select * from bt_page_stats('pk_mytable',3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2

The old page block changed from being a combined root and leaf page to being a dedicated leaf page, and a new root page was created in page number 3 of pk_mytable! This also explains why there are only 3 items in the page (live_items), this is a block to guide the process to the leaf blocks. This also is visible in the btpo_flags, which now is 2: only the bit for root page is set.

Let’s look at the contents of the root page:

test=# select * from bt_page_items('pk_mytable',3);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (1,1) |       8 | f     | f    |
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00

A root/tree (called ‘branch’ in Oracle) index block contains pointers to index pages which can be tree pages or leaf blocks. Non-root tree pages are called ‘internal’ pages. The data field contains the the highest value of the left/lower value branch or leaf. This way, with a tree level of one (bt_page_stats.btpo = 1), it contains the highest value of the lower offset leaf page ctid is pointing at for this index. A leaf page always points to heap (table) pages. If this single tree page fills up, it is splitted and a new root is created to point to the two tree pages which are just created as a result of the split of the old root page. Because a tree page points to the highest value of the left/lower value branch or leaf page, there is no value for the left-most leaf, because there is no ‘left side’ block. This is also visible with the ‘itemlen’ field above when examining the index page using bt_page_items, only the header of 8 bytes is stored.

To get an idea of the id values stored in the leaf pages, take the hexadecimal value and convert it to decimal:

$ echo "ibase=16;016F" | bc
367
$ echo "ibase=16;02DD" | bc
733

This means pk_mytable leaf page 1 contains the values 1-367, leaf page 2 contains 368-733 and leaf page 4 contains 734-1000.

Let’s descent into the first leaf page, page 1 of pk_mytable, which was our old combined root and leaf page, and read the index meta data using bt_page_stats:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1

This shows this is a leaf page (type: l), but this time it is only a leaf page, which is visible with btpo_flags, only bit 1 is set indicating a leaf page. btpo_prev is set to 0, which indicates this is the left-most side of the index, there is no left block. There is a page with a higher value, btpo_next is set to 2, indicating block 2 of pk_mytable is the right side page.

There are 808 bytes of free space available in the page. An index entry for this index is 16 bytes, which means there is space available. The way this is handled is summarised in the documentation in the source code for the function _bt_findsplitloc. Essentially, if an index page fills up, the page contents are split between two pages, for which every page gets 50% of the page’s payload. In the case of the page being a leaf page and the right-most page (btpo_next=0), it is considered a special case, and the page is split leaving fillfactor% (default 90%) in the original page, and the remaining percentage in the new right-most page, optimising space usage and still leaving room for updates.

Now let’s look at the index tuples for the leftmost page:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,6)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
          7 | (0,5)   |      16 | f     | f    | 06 00 00 00 00 00 00 00
          8 | (0,7)   |      16 | f     | f    | 07 00 00 00 00 00 00 00
          9 | (0,8)   |      16 | f     | f    | 08 00 00 00 00 00 00 00
...etc...

What is visible is the data is ordered, however the highest value in the index page is put in the first row, the lowest value starts at the second row. In the readme with the nbtree source this is explained:

On a page that is not rightmost in its tree level, the "high key" is 
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused.  A page that is
rightmost has no "high key", so data items start with the first item.
Putting the high key at the left, rather than the right, may seem odd,
but it avoids moving the high key as we add data items.

Conclusion
In this post I moved beyond the page headers of a heap and looked at how tuple metadata looks like, notably xmin and xmax, the ctid and the infomasks which are bitmap fields to indicate specific status indicators per row, and how to extract these manually.

Next I moved to the (simple primary key) index on the id column. An index page partly looks the same as a heap block because it contains a common block header. However an index contains a metapage as the first page, and if all the index tuples fit in a single page has a combined root and leaf page. Once the combined root and leaf page grows out of the single page, a non-leaf root page is created which stores pointers to index leaf pages containing the index tuple values and the max values of the leaf pages, except for the leftmost leaf page.

Heap (table) pages are filled up to 100% by default, which can be changed with the fillfactor storage attribute. An index has a default fillfactor of 90%. If an index leaf page is filled up to fillfactor, it will split. However, if the right-most leaf page fills up (which means an increasing value, like a sequence or a timestamp), the split will be done 90% (fillfactor% actually)-10% (old-new page) instead of 50%-50%.

All index leaf pages outside of the rightmost page stores the max index field value as the first index tuple, and starts of with the lowest index field value as the second tuple, to optimise inserting new values.

The next post will look at what happens when row data gets changed. This is particularly exciting for me, because this is where postgres truly is different from my base reference, which is the Oracle database.

This blogpost is the result of me looking into how postgres works, and specifically the database blocks. The inspiration and essence of this blogpost comes from two blogs from Jeremiah Peschka: https://facility9.com/2011/03/postgresql-row-storage-fundamentals/ and https://facility9.com/2011/04/postgresql-update-internals/
I am using Oracle Linux 7u3 and postgres 9.6 (current versions when this blogpost was written).

Postgres is already installed, and a database cluster is already running. Let’s create a database ‘test’ for the sake of our tests:

$ createdb test

Once the database is created, logging on is done with ‘psql’:

$ psql -d test
psql (9.6.3)
Type "help" for help.

test=#

Once logged on, we need a table and index to investigate. Let’s create a very simple table with a primary key and insert some data:

test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

In order to look at the block structures there is a postgres extension called ‘pageinspect’. The extension is part of the ‘postgresql96-contrib’ RPM package. You can check if it’s installed on your machine by looking if the file ‘/usr/pgsql-9.6/share/extension/pageinspect.control’ exists.

Once you made sure the pageinspect operating system dependencies are met, you need to install it in the database. To verify if it’s installed or not, look in ‘pg_extension’. This is how it looks like on a fresh installed database cluster:

test=# select * from pg_extension;
 extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
---------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql |       10 |           11 | f              | 1.0        |           |

This means the pageinspect extension is not installed, install it in the following way:

test=# create extension pageinspect;
CREATE EXTENSION
test=# select * from pg_extension;
   extname   | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
-------------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql     |       10 |           11 | f              | 1.0        |           |
 pageinspect |       10 |         2200 | t              | 1.5        |           |

Now with the pageinspect extension installed, let’s look at the heap (table) block. In order to understand how a block looks like, go to the documentation: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html
The essence is a heap bock contains a header, immediately followed by ItemIdData (also known as line pointers, which are pointers to rows in the same block) growing top to bottom, and then the tuples allocated bottom to top, allowing the line pointer array to grow.

The table mytable in this case consists of one block, because the table tuples (rows) fit in one. We can look at the page header by using the page_header function together with the get_raw_page function. Mind the name of the table and the block number in the function get_raw_page:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576BA8 |        0 |     0 |    40 |  8032 |    8192 |     8192 |       4 |         0

Next we want to look at the ItemIdData/line pointer array and rows, which are combined in the heap_page_items function. The ItemIdData array fields are indicated by ‘lp’, the fields in the tuple are indicated by ‘t’:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

In case you wondered how to decipher the data:

test=# select chr(x'61'::integer);
 chr
-----
 a

In order to get a better understanding, it often helps to physically see the block contents (at least, that is how my brain works). This is how that can be done:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

(if xxd is not installed on your system, it’s in the vim-common package)
If you look at the header and page items results, you see the header and line pointer items on the top, then a star after offset 48, which indicates identical lines (all zero, indicating non-touched free space), and the row data allocated from the bottom (remember 0x61 is ‘a’, indicating the first added row is at the bottom).

Let’s work a bit on the raw data, to see how it works. What I found EXTREMELY helpful (for me coming from investigating a closed-source product like the Oracle database), is the source code available and fully searchable at http://doxygen.postgresql.org (thanks Jan and Devrim!!).

Let’s try to find the first record directly using the block contents.

First we need to find the line pointer array. In order to do that, I looked at the PageHeaderData struct in the source code:

Page header in the source code: https://doxygen.postgresql.org/structPageHeaderData.html
typedef struct PageHeaderData
  148 {
  149     /* XXX LSN is member of *any* block, not only page-organized ones */		size	/ offset
  150     PageXLogRecPtr pd_lsn;      /* LSN: next byte after last byte of xlog		8 bytes / 0
  151                                  * record for last change to this page */
  152     uint16      pd_checksum;    /* checksum */					2 bytes	/ 8
  153     uint16      pd_flags;       /* flag bits, see below */			2 bytes / 10
  154     LocationIndex pd_lower;     /* offset to start of free space */		2 bytes / 12
  155     LocationIndex pd_upper;     /* offset to end of free space */			2 bytes / 14
  156     LocationIndex pd_special;   /* offset to start of special space */		2 bytes / 16
  157     uint16      pd_pagesize_version;						2 bytes / 18
  158     TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */	4 bytes / 20
  159     ItemIdData  pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */		4 bytes array / 24
  160 } PageHeaderData;

I added the size and offset numbers myself, in order to make it easier. Above we see the first array member should be at offset 24 from the start of the header. However, we need to understand how ItemIdData looks like: https://doxygen.postgresql.org/structItemIdData.html

typedef struct ItemIdData
   25 {
   26     unsigned    lp_off:15,      /* offset to tuple (from start of page) */
   27                 lp_flags:2,     /* state of item pointer, see below */
   28                 lp_len:15;      /* byte length of tuple */
   29 } ItemIdData;

Now let’s get the offset number of the first row straight from the line pointer array in the block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fd8

However, this number is too high, because it’s decimal 40920, binary: 1001 111 1101 1000; alias: the 16th bit is set. So we need to set the 16th bit to zero:

$ echo $((0x9fd8 & ~$((1<<15))))
8152

This is how to get the tuple length from the line pointer array:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
 004e

This number is too high too: it’s decimal 78, while we know the length is 39 (see above). This is because this includes a bit from the lp_flags field.
We need to right-shift this number to get the actual number:

$ echo $((0x004e >> 1))
39

So there you got it: by reading the block directly, I fetched the line pointer values of the first block (offset number and length), which are 8152 and length 39 bytes. Using the above commands you can easily read the next line pointer value by setting the correct number at ‘-j’, and doing the bit manipulation.

So, to conclude this blogpost: I’ve touched on subject of the pageinspect extension, the postgres searchable source in http://doxygen.postgresql.org, the documentation at https://www.postgresql.org/docs/9.6/static/storage-page-layout.html, and showed the page header, the line pointer array in the page header, and the rows in the block. These can all be seen using functions page_header, heap_page_items and get_raw_page. Then I showed how to fetch the raw block contents, to truly see the block, and showed how to fetch data directly using the encode and get_raw_page functions in postgres, and the xxd and od functions on the linux command line.

NB. Updated July 1st, 17:19 CET after comments of Jan Karremans.