Postgresql block internals, part 2

This is the second part of a blogpost about Postgresql database block internals. If you found this blogpost, and are interested in getting started with it, please read the first part, and then continue with this post.
I am doing the investigations on Oracle Linux 7u3 with postgres 9.6 (both the latest versions when this blogpost was written).

In the first part I talked about the pageinspect extension, and investigated the page header and line pointer array. This blogpost looks at the actual tuples, including the index, and how these are stored in the pages.

The block structures are explained in the main documentation at: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
We can see the tuple metadata and raw data using the heap_page_items function:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

As has been described in the first blogpost, the ‘lp’ entries fetch their data from the line pointer array in the page header, and the ‘t’ entries fetch the data from the actual tuple.

This is the raw block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

In the previous blog post I described how the offset of the rows can be found. This a description of the tuple header data in the documentation:

Field           Type            Length  Offset  Description
t_xmin          TransactionId   4 bytes 0       insert XID stamp
t_xmax          TransactionId   4 bytes 4       delete XID stamp
t_cid           CommandId       4 bytes 8       insert and/or delete CID stamp (overlays with t_xvac)
t_xvac          TransactionId   4 bytes 8       XID for VACUUM operation moving a row version
t_ctid          ItemPointerData 6 bytes 12      current TID of this or newer row version
t_infomask2     uint16          2 bytes 18      number of attributes, plus various flag bits
t_infomask      uint16          2 bytes 20      various flag bits
t_hoff          uint8           1 byte  22      offset to user data
All the details can be found in src/include/access/htup_details.h.

Let’s extract the fields from the raw block:

$ # xmin 8152+0
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8152 -N 2
   1760
$ # xmax 8152+4
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8156 -N 2
      0
$ # ctid 8152+12 (current tuple id), first 4 bytes is the block number, second 2 bytes tuple id
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x4 -j 8164 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8168 -N 2
     1
$ # t_infomask2 8152+18
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8170 -N 2
     2

What does ‘2’ mean? Once again, the source code to the rescue! A description of t_infomask2 can be found here: https://doxygen.postgresql.org/htup__details_8h_source.html At line 260 there is a description. t_infomask2 is a (binary, obviously) bit mask. Here is how you can look it up. I took the mask definition from the source code, and printed the binary value using ‘echo “ibase=16;obase=2;7FF” | bc’ in front of it:

     11111111111 #define HEAP_NATTS_MASK         0x07FF  /* 11 bits for number of attributes */
  10000000000000 #define HEAP_KEYS_UPDATED       0x2000  /* tuple was updated and key cols
 100000000000000 #define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
1000000000000000 #define HEAP_ONLY_TUPLE         0x8000  /* this is heap-only tuple */
1110000000000000 #define HEAP2_XACT_MASK         0xE000  /* visibility-related bits */
1111111111111110 #define SpecTokenOffsetNumber       0xfffe

So, anything up to the number 2047 (0x7FF) in the infomask2 field is meant to describe the number of fields/attributes. In our case ‘2’ (binary 10) means there are two fields. Please mind additional bits can be set (for a HOT update for example) while the number of fields is still described. With a bitmask, every position functions independent from the other positions.

Now let’s look at the next field, t_infomask:

$ # t_infomask 8152+20
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8172 -N 2
  2050

This too is a bitmask. The description is in https://doxygen.postgresql.org/htup__details_8h_source.html starting from line 173:

               1 #define HEAP_HASNULL            0x0001  /* has null attribute(s) */
              10 #define HEAP_HASVARWIDTH        0x0002  /* has variable-width attribute(s) */
             100 #define HEAP_HASEXTERNAL        0x0004  /* has external stored attribute(s) */
            1000 #define HEAP_HASOID             0x0008  /* has an object-id field */
           10000 #define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-shared locker */
          100000 #define HEAP_COMBOCID           0x0020  /* t_cid is a combo cid */
         1000000 #define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is exclusive locker */
        10000000 #define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax, if valid, is only a locker */
                    /* xmax is a shared locker */
                 #define HEAP_XMAX_SHR_LOCK  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
                 #define HEAP_LOCK_MASK  (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                          HEAP_XMAX_KEYSHR_LOCK)
       100000000 #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
      1000000000 #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
                 #define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
     10000000000 #define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
    100000000000 #define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */
   1000000000000 #define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
  10000000000000 #define HEAP_UPDATED            0x2000  /* this is UPDATEd version of row */
 100000000000000 #define HEAP_MOVED_OFF          0x4000  /* moved to another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
1000000000000000 #define HEAP_MOVED_IN           0x8000  /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
                 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
1111111111110000 #define HEAP_XACT_MASK          0xFFF0  /* visibility-related bits */

The value for t_infomask is 2050, this is binary (echo “obase=2;2050” | bc):

    100000000010

Which means: 10: has variable-width attributes and 100000000000: xmax is invalid (this means xmax is not set).

I created an index on this table too, called ‘pk_mytable’. The below text looks at an ordinary index. An index uses the same block size, the same block header, but different contents, quite obviously, because an index needs to maintain a balanced tree (btree) structure with ordered entries.

The first page of an index is a ‘metapage’ (a page that essentially points to the index root block). This is how the metapage can be inspected:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    1 |     0 |        1 |         0

As described earlier, a heap page, index metapage or index page all are normal postgres pages, which means they all contain a header which can be inspected using the ‘page_header’ function:

test=# select * from page_header(get_raw_page('pk_mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576A10 |        0 |     0 |    48 |  8176 |    8176 |     8192 |       4 |         0

The bt_metap function tells us the root block is in block 1. Index entries can be listed using the bt_page_items function:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The index structure (like root/leaf blocks, next and previous block) can be investigated using the bt_page_stats function:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3

This tells us that this is a leaf block (type: l), we got four indexed tuples (live_items) in this leaf block, there is no previous index leaf page/left sibling (btpo_prev: 0), there is no next index leaf page/right sibling (btpo_next: 0), btpo is a union that either contains the tree level if not a leaf block or next transaction ID if this leaf page is deleted and btpo_flags, which is a bitmap. the bitmap explanation is available in the source code, I added the actual bits in front of the defines (this is how that is done on the linux prompt: ‘echo “obase=2;$((1 << 0 ))" | bc'):

             1 #define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
            10 #define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
           100 #define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
          1000 #define BTP_META        (1 << 3)    /* meta-page */
         10000 #define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
        100000 #define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
       1000000 #define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
      10000000 #define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */
10011011111011 #define MAX_BT_CYCLE_ID     0xFF7F

The btpo_flags field contains the value 3 (binary 11), which means this is a leaf block and this is the root block. This is how the BTPageOpaqueData struct looks like, with included offsets and byte sizes by me:

typedef struct BTPageOpaqueData
Bytes   {
4         BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
4         BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
          union
          {
4             uint32      level;      /* tree level --- zero for leaf pages */
              TransactionId xact;     /* next transaction ID, if deleted */
          }           btpo;
2         uint16      btpo_flags;     /* flag bits, see below */
2         BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
       } BTPageOpaqueData;

This struct is placed at the end of the block at the offset for ‘special’ when using the page_header function on the index block.

This is how the block contents look like for an index:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 e8 6b 57 01 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look closely, you see the header is the same, which is obvious because I just told that all pages contain the same page header. For index entries, the entries are allocated from the bottom up too, to allow the header (the line pointer array actually) to grow when additional entries are inserted into this page of the index. This is exactly the same line pointer array as we saw with a heap table, so we can extract them in the same way; first entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fe0

Now flip the 16th bit to get the offset number:

$ echo $((0x9fe0 & ~$((1<<15))))
8160

Get the length of the index entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
0020

Right-shift:

$ echo $((0x0020 >> 1))
16

The offset number from the line pointer array points to the index tuple data header, which contains two fields, described in the annotated source code:

typedef struct IndexTupleData
{
   ItemPointerData t_tid;      /* reference TID to heap tuple */
   /* ---------------
    * t_info is laid out in the following fashion:
    *
    * 15th (high) bit: has nulls
    * 14th bit: has var-width attributes
    * 13th bit: unused
    * 12-0 bit: size of tuple
    * ---------------
    */
    unsigned short t_info;      /* various info about tuple */
} IndexTupleData;               /* MORE DATA FOLLOWS AT END OF STRUCT */

So the header contains the block and tuple id in ItemPointerData to the row the index entry describes. The length of ItemPointerData is 6 bytes. And the header contains a bitmap called t_info, size 2 bytes. This means an index tuple header is 8 bytes. The header of an index is not described in the postgres online manual. Did you notice there is no xmin/xmax in the index structure? An index does not do multi-versioning like a table does!

Let’s fetch the index tuple header data directly. First the tuple id of the heap (block number and tuple offset number:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x4 -j 8160 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
     1

The next field in the header is t_info:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
    16
$ echo "obase=2;16" | bc
10000

The value in t_info has the highest bit set at the 5th position, while the description says that the bits 13/14/15 are used as a bitmask, so the only information in t_info is the size of the tuple: 16 bytes.

It is widely known that an regular btree index stores its entries ordered, which is why you can range scan the index leaf blocks, which is an optimised way of searching through data. But how is that organised in the index if I enter data unordered? In the table the data will be added unordered, bottom up to the block, because the table has no requirement to store the data ordered. But the index must provide the indexed value in an ordered way. The two options I see there are is the indexed value is added bottom up unordered, and the pointer to the entry is entered in the correct position in the line pointer array, or: the indexed value is placed at the correct position in the data area, AND the pointer is added at the correct position in the line pointer array. Let’s test this! We currently have the values 1-4 stored in the index, let’s add 6, and checkpoint:

test=# insert into mytable (id, f1) values (6, 'ffffffffff');
INSERT 0 1
test=# checkpoint;
CHECKPOINT

Now let’s look at the index block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 c8 81 64 01 00 00 00 00 2c 00 a0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

Here we see the new tuple with the value 6 is added at position 8096, and the corresponding line pointer entry is added (echo $((0x9fa0 & ~$((1<<15)))) = 8096). Now let's add 5 to mytable, and checkpoint:

test=# insert into mytable (id, f1) values (5, 'eeeeeeeeee');
INSERT 0 1
test=# checkpoint;

And dump the block contents again:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 00 85 64 01 00 00 00 00 30 00 90 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008080 00 00 00 00 06 00 10 00 05 00 00 00 00 00 00 00
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look in the index field data first (offset numbers 8080-8176), you see that the indexed field is added unordered. The number (field data) is at offset 8088, value 5, and at offset 8104, value 6 for the two rows we just inserted. If you look at the last two array members of the line pointer, you see that the ordering is happening there:

$ # 5th array member: 24 + (4 * 4)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 40 -N 2
 9f90
$ echo $((0x9f90 & ~$((1<<15))))
8080
$ # 6th array member: 24 + (4 * 5)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 44 -N 2
 9fa0
echo $((0x9fa0 & ~$((1<<15))))
8096

Now let’s fill up the table to 1000 rows to see what happens when a heap and an index block fill up:

test=# do $$
test$# begin
test$# for nr in 7..1000 loop
test$# insert into mytable (id, f1) values (nr, 'XXXXXXXXXX');
test$# end loop;
test$# end $$;
DO
test=# checkpoint;
CHECKPOINT

This is the first block of mytable:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/164E6A8 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

If you subtract upper with lower (792-764), you get the amount of free space in the first block: 28 bytes. I deliberately created all the rows with the same mount of data (an int and a varchar with 10 characters in it), let’s extract the first row:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp = 1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1779 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |       | \x010000001761616161616161616161

This shows in lp_len that the length of a single row is 39 bytes. This means that by default (when no fill factor is set), a table block will be filled up to 100%.

Let’s look at the index. First scan the metapage of the index ‘pk_mytable’:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1

The root of the index now is block 3 (obviously in the ‘root’ field). Let’s examine that page:

test=# select * from bt_page_stats('pk_mytable',3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2

The old page block changed from being a combined root and leaf page to being a dedicated leaf page, and a new root page was created in page number 3 of pk_mytable! This also explains why there are only 3 items in the page (live_items), this is a block to guide the process to the leaf blocks. This also is visible in the btpo_flags, which now is 2: only the bit for root page is set.

Let’s look at the contents of the root page:

test=# select * from bt_page_items('pk_mytable',3);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (1,1) |       8 | f     | f    |
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00

A root/tree (called ‘branch’ in Oracle) index block contains pointers to index pages which can be tree pages or leaf blocks. Non-root tree pages are called ‘internal’ pages. The data field contains the the highest value of the left/lower value branch or leaf. This way, with a tree level of one (bt_page_stats.btpo = 1), it contains the highest value of the lower offset leaf page ctid is pointing at for this index. A leaf page always points to heap (table) pages. If this single tree page fills up, it is splitted and a new root is created to point to the two tree pages which are just created as a result of the split of the old root page. Because a tree page points to the highest value of the left/lower value branch or leaf page, there is no value for the left-most leaf, because there is no ‘left side’ block. This is also visible with the ‘itemlen’ field above when examining the index page using bt_page_items, only the header of 8 bytes is stored.

To get an idea of the id values stored in the leaf pages, take the hexadecimal value and convert it to decimal:

$ echo "ibase=16;016F" | bc
367
$ echo "ibase=16;02DD" | bc
733

This means pk_mytable leaf page 1 contains the values 1-367, leaf page 2 contains 368-733 and leaf page 4 contains 734-1000.

Let’s descent into the first leaf page, page 1 of pk_mytable, which was our old combined root and leaf page, and read the index meta data using bt_page_stats:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1

This shows this is a leaf page (type: l), but this time it is only a leaf page, which is visible with btpo_flags, only bit 1 is set indicating a leaf page. btpo_prev is set to 0, which indicates this is the left-most side of the index, there is no left block. There is a page with a higher value, btpo_next is set to 2, indicating block 2 of pk_mytable is the right side page.

There are 808 bytes of free space available in the page. An index entry for this index is 16 bytes, which means there is space available. The way this is handled is summarised in the documentation in the source code for the function _bt_findsplitloc. Essentially, if an index page fills up, the page contents are split between two pages, for which every page gets 50% of the page’s payload. In the case of the page being a leaf page and the right-most page (btpo_next=0), it is considered a special case, and the page is split leaving fillfactor% (default 90%) in the original page, and the remaining percentage in the new right-most page, optimising space usage and still leaving room for updates.

Now let’s look at the index tuples for the leftmost page:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,6)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
          7 | (0,5)   |      16 | f     | f    | 06 00 00 00 00 00 00 00
          8 | (0,7)   |      16 | f     | f    | 07 00 00 00 00 00 00 00
          9 | (0,8)   |      16 | f     | f    | 08 00 00 00 00 00 00 00
...etc...

What is visible is the data is ordered, however the highest value in the index page is put in the first row, the lowest value starts at the second row. In the readme with the nbtree source this is explained:

On a page that is not rightmost in its tree level, the "high key" is 
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused.  A page that is
rightmost has no "high key", so data items start with the first item.
Putting the high key at the left, rather than the right, may seem odd,
but it avoids moving the high key as we add data items.

Conclusion
In this post I moved beyond the page headers of a heap and looked at how tuple metadata looks like, notably xmin and xmax, the ctid and the infomasks which are bitmap fields to indicate specific status indicators per row, and how to extract these manually.

Next I moved to the (simple primary key) index on the id column. An index page partly looks the same as a heap block because it contains a common block header. However an index contains a metapage as the first page, and if all the index tuples fit in a single page has a combined root and leaf page. Once the combined root and leaf page grows out of the single page, a non-leaf root page is created which stores pointers to index leaf pages containing the index tuple values and the max values of the leaf pages, except for the leftmost leaf page.

Heap (table) pages are filled up to 100% by default, which can be changed with the fillfactor storage attribute. An index has a default fillfactor of 90%. If an index leaf page is filled up to fillfactor, it will split. However, if the right-most leaf page fills up (which means an increasing value, like a sequence or a timestamp), the split will be done 90% (fillfactor% actually)-10% (old-new page) instead of 50%-50%.

All index leaf pages outside of the rightmost page stores the max index field value as the first index tuple, and starts of with the lowest index field value as the second tuple, to optimise inserting new values.

The next post will look at what happens when row data gets changed. This is particularly exciting for me, because this is where postgres truly is different from my base reference, which is the Oracle database.

Advertisements
1 comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: