This is the third part in a series of blogposts about how postgresql manages data in its blocks (called ‘pages’ in postgres speak). If you found this post and did not read the previous ones, it might be beneficial to read block internals (part 1) and block internals, part 2 first. In these blogposts I’ve shown how heap and index pages look like, and how these can be investigated, including looking at the raw block information.
This blogpost is intended to show the effects on pages when DML happens. This is inherently different from my personal reference of database implementation, which is the oracle database.
If you followed the previous two articles, you have a table mytable with one thousand rows. In order to simplify and be able to grasp the essence of the implications of data changes, drop mytable and recreate the simple situation of mytable with four rows again:
test=# drop table if exists mytable;
DROP TABLE
test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4
This should result in the table having four rows in the first page:
test=# select * from heap_page_items(get_raw_page('mytable',0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 8152 | 1 | 39 | 1790 | 0 | 0 | (0,1) | 2 | 2050 | 24 | | | \x010000001761616161616161616161
2 | 8112 | 1 | 39 | 1790 | 0 | 0 | (0,2) | 2 | 2050 | 24 | | | \x020000001762626262626262626262
3 | 8072 | 1 | 39 | 1790 | 0 | 0 | (0,3) | 2 | 2050 | 24 | | | \x030000001763636363636363636363
4 | 8032 | 1 | 39 | 1790 | 0 | 0 | (0,4) | 2 | 2050 | 24 | | | \x040000001764646464646464646464
And the second index page having the index tuples of these four rows:
test=# select * from bt_page_items('pk_mytable',1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
4 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00
The next thing to do, is change the configuration of postgres to disable auto-vacuum. This should absolutely not be done on any postgres database serving something useful, but in order to see the effects of data changes, and the changes as a result of a vacuum, vacuum should not kick in automatically, because then we can’t predict what we will be looking at. This is governed by the parameter ‘autovacuum’, which is turned on by default:
test=# show autovacuum;
autovacuum
------------
on
Turn autovacuum off and reload the configuration:
test=# alter system set autovacuum=off;
ALTER SYSTEM
test=# select pg_reload_conf();
pg_reload_conf
----------------
t
test=# show autovacuum;
autovacuum
------------
off
Now let’s update id 1 and set f1 to ‘zzzzzzzzzz’:
test=# update mytable set f1 = 'zzzzzzzzzz' where id = 1;
UPDATE 1
Now let’s look at the page contents of the table mytable:
test=# select * from heap_page_items(get_raw_page('mytable',0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 8152 | 1 | 39 | 1790 | 1791 | 0 | (0,5) | 16386 | 258 | 24 | | | \x010000001761616161616161616161
2 | 8112 | 1 | 39 | 1790 | 0 | 0 | (0,2) | 2 | 2050 | 24 | | | \x020000001762626262626262626262
3 | 8072 | 1 | 39 | 1790 | 0 | 0 | (0,3) | 2 | 2050 | 24 | | | \x030000001763636363636363636363
4 | 8032 | 1 | 39 | 1790 | 0 | 0 | (0,4) | 2 | 2050 | 24 | | | \x040000001764646464646464646464
5 | 7992 | 1 | 39 | 1791 | 0 | 0 | (0,5) | 32770 | 10242 | 24 | | | \x01000000177a7a7a7a7a7a7a7a7a7a
This is a good example how read consistency and row versioning works in postgres. The update inserted a new tuple at offset 7992 (lp_off) in the page, and added a pointer to it in the line pointer array (lp 5). Mind the ‘the update inserted a new tuple’ which is fundamental! In the metadata of tuple ID 1, xmax is set to indicate the maximum transaction ID which should be able to see the this version of the tuple. The postgres documentation says ‘delete XID stamp’ for xmax. Tuple ID 5 is created with exactly the same XID for xmin as tuple 1 has for xmax, which is the transaction ID starting from which this tuple should be visible. Obviously, for this tuple xmax is set to 0, indicating no newer version exists. Also, the t_ctid (current tuple ID) for tuple 1 is set to 0,5 (block 0, tuple 5), indicating the next version of that tuple is tuple ID 5. To indicate what happened, infomask2 and infomask have been set.
infomask2 16386 means: heap has two attributes, heap HOT updated.
infomask 258 means: has variable attributes, xmin committed.
The new tuple has the following values for infomask2 and infomask:
infomask2 32770 means: heap has two attributes, is heap only tuple.
infomask 10242 means: has variable attributes, xmax invalid, this is an updated version of the row.
(see the blogpost postgresql block internals, part 2 for a description of the infomask2 and infomask bitfields)
It should be clear now the update caused the active version of the tuple with id = 1 to be tuple ID 5 instead tuple ID 1. Please mind tuple ID 1 is still entirely present (in the line pointer array and in the data part of the page).
How does this looks like in the index?
test=# select * from bt_page_items('pk_mytable',1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
4 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00
Nothing changed!
Actually, this is exactly what ‘HOT’ (heap only tuple) means to describe in the above descriptions (infomask2, heap HOT updated). A new version of a tuple is created, but it is not covered by any index. This means when table/heap tuple with id=1 is accessed using the index, the index points to the original tuple ID (1) in the heap/table, which points to the new version of the tuple (tuple ID 5). This is an optimisation to lessen the amount of work done when updating a field that is not indexed, at the cost of following an additional pointer.
What happens if the tuple with id = 1 gets updated again?
test=# update mytable set f1 = 'yyyyyyyyyy' where id = 1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 8152 | 1 | 39 | 1790 | 1791 | 0 | (0,5) | 16386 | 1282 | 24 | | | \x010000001761616161616161616161
2 | 8112 | 1 | 39 | 1790 | 0 | 0 | (0,2) | 2 | 2050 | 24 | | | \x020000001762626262626262626262
3 | 8072 | 1 | 39 | 1790 | 0 | 0 | (0,3) | 2 | 2050 | 24 | | | \x030000001763636363636363636363
4 | 8032 | 1 | 39 | 1790 | 0 | 0 | (0,4) | 2 | 2050 | 24 | | | \x040000001764646464646464646464
5 | 7992 | 1 | 39 | 1791 | 1792 | 0 | (0,6) | 49154 | 8450 | 24 | | | \x01000000177a7a7a7a7a7a7a7a7a7a
6 | 7952 | 1 | 39 | 1792 | 0 | 0 | (0,6) | 32770 | 10242 | 24 | | | \x010000001779797979797979797979
infomask 2 for tuple ID 5 now says: heap has two attributes, is heap only tuple and additionally: heap was HOT updated.
infomask for tuple ID 5 now says: heap has variable attributes, xmin committed and additionally: this is an updated version of the tuple.
Another update essentially performs exactly the same transaction on tuple ID 5 as previously done on tuple ID 1: a new tuple is created containing the result of the update (linepointer array entry and data at the end of the page), xmax is set for tuple ID 5 and xmin is set for tuple ID 6 with the same transaction ID, and the current tuple id for tuple ID 5 is set to the new version for that tuple, which is tuple ID 6.
As can be seen by looking at the value for infomask2, tuple ID 6 now has the same value as tuple ID 5 had previously. Just like the tuple ID 5 previously, the new tuple with ID 6 is a ‘hot’ updated tuple too, which means the index was once again left untouched, so the index still points to heap tuple ID 1. This means that a session that uses the index to find the tuple, will be pointed to heap tuple ID 1, which points to tuple ID 5, which points to tuple ID 6 in this case.
It’s obvious that this tuple pointer chasing can easily lead to more work to be done to find a row/tuple, meaning lesser performance. It also is obvious that this construction exists to provide a way to produce an older version of a tuple for read consistency. However, this old version is not useful anymore after some time, which is when no query exists with a transaction ID lower than the value for xmax for the old tuples, and needs to be purged. This is where vacuum comes in. Let’s vacuum mytable, and read the heap/table page again:
test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 6 | 2 | 0 | | | | | | | | | |
2 | 8152 | 1 | 39 | 1790 | 0 | 0 | (0,2) | 2 | 2306 | 24 | | | \x020000001762626262626262626262
3 | 8112 | 1 | 39 | 1790 | 0 | 0 | (0,3) | 2 | 2306 | 24 | | | \x030000001763636363636363636363
4 | 8072 | 1 | 39 | 1790 | 0 | 0 | (0,4) | 2 | 2306 | 24 | | | \x040000001764646464646464646464
5 | 0 | 0 | 0 | | | | | | | | | |
6 | 8032 | 1 | 39 | 1792 | 0 | 0 | (0,6) | 32770 | 10498 | 24 | | | \x010000001779797979797979797979
There are a few things vacuum did, and some things it didn’t:
– The line pointer array entries all stayed.
– The first entry points to heap tuple ID 6. lp_flags=2 together with lp_len=0 means the lp_off field points to the active tuple ID. The reason the first row still exists, is because the index entry still points to it.
– The fifth line pointer entry is all zero, lp_flags=0 means ‘unused’.
If you look closely at the line pointer array entries and the tuple data offsets before and after vacuum, you see that the past versions of the tuple data do not exist anymore. In fact, if you look closely to the offset (lp_off) values of the tuples that are stored in the page, you see that ID 2 took the offset value from ID 1 from before vacuuming, indicating it moved all the way to the back of the page. In fact, vacuum compacted the data tuples, maximising free space in the page.
A logical next question is: okay, this is way better than how it was left for read consistency after the two transactions, but it is still not optimal. Can it be fully cleaned? The answer is yes, vacuum can be told to fully clean up, including the removing any non-functional line pointer array entries, and as a logical consequence have all index entries point directly to the heap tuple it ought to point to. This is called ‘vacuum full’.
Why not use vacuum full all the time? The reason is vacuum full requires a table lock for the entire duration of running vacuum. This is not desirable and acceptable in a lot of databases. If normal vacuum is run sufficiently, running vacuum full will not yield much additional benefit in most cases, because when normal vacuum is run sufficiently, the old row versions are cleaned up in time, preventing excessive amounts of old versions in a page. This is important because normal vacuum cleans up within a page, but it does not move tuples across pages, freeing up pages and thus operating system disk space, which is what vacuum full does. This is why autovacuum should be turned on, to clean up old versions in time.
When an update performs an update on a field that is covered by the index, obviously the update can not be ‘HOT’, because the index must cover the fields it is created on. Let’s update the id of id=1 to 5:
test=# update mytable set id=5 where id=1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 6 | 2 | 0 | | | | | | | | | |
2 | 8152 | 1 | 39 | 1814 | 0 | 0 | (0,2) | 2 | 2306 | 24 | | | \x020000001762626262626262626262
3 | 8112 | 1 | 39 | 1814 | 0 | 0 | (0,3) | 2 | 2306 | 24 | | | \x030000001763636363636363636363
4 | 8072 | 1 | 39 | 1814 | 0 | 0 | (0,4) | 2 | 2306 | 24 | | | \x040000001764646464646464646464
5 | 7992 | 1 | 39 | 1817 | 0 | 0 | (0,5) | 2 | 10242 | 24 | | | \x050000001779797979797979797979
6 | 8032 | 1 | 39 | 1816 | 1817 | 0 | (0,5) | 40962 | 8450 | 24 | | | \x010000001779797979797979797979
As you can see, tuple ID/ line pointer number 5, which was emptied by vacuum, now is reused. The chain of tuple ID’s now is: ID 1 points to ID 6. ID 6 has xmax set, and t_ctid points to ID 5, with xmin set to the previous version’s xmax value. The value of 2 for t_infomask2 for ID 5 shows no bits indicating a HOT update. How does this look like in the index?
test=# select * from bt_page_items('pk_mytable',1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
4 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00
5 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00
Because this is not a HOT update, there is a change to the index, an index tuple is added for the new value: ‘5’, which points to heap page 0 tuple ID 5, which is the result of the update. It is interesting to see value ‘1’ still exists in the index, while value ‘5’ is the actual value and ‘1’ the old value of the same heap tuple. It is logical however if you realise there is no versioning data in the index, so in order to understand if a tuple should be visible requires the heap/table tuple to be read. Once again, if you vacuum the table, things get sorted:
test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 0 | 0 | 0 | | | | | | | | | |
2 | 8152 | 1 | 39 | 1814 | 0 | 0 | (0,2) | 2 | 2306 | 24 | | | \x020000001762626262626262626262
3 | 8112 | 1 | 39 | 1814 | 0 | 0 | (0,3) | 2 | 2306 | 24 | | | \x030000001763636363636363636363
4 | 8072 | 1 | 39 | 1814 | 0 | 0 | (0,4) | 2 | 2306 | 24 | | | \x040000001764646464646464646464
5 | 8032 | 1 | 39 | 1817 | 0 | 0 | (0,5) | 2 | 10498 | 24 | | | \x050000001779797979797979797979
6 | 0 | 0 | 0 | | | | | | | | | |
Tuple ID 1 now is empty, because it is relieved from its function of pointing to tuple ID 6, because tuple ID 6 gotten a new version, which means tuple ID 6 could be and is freed by vacuum. If you look in the index, you will see vacuum now freed up the tuple of the updated value:
test=# select * from bt_page_items('pk_mytable',1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
2 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
3 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00
4 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00
The bt_page_items function does not reveal too much information about line pointer and data offsets, in other words, what vacuum did, let’s look at the raw block contents:
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 78 60 b5 09 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 05 00 00 00 00 00 00 00
*
0008128 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008144 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008160 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192
The interesting bit is the second star indicating a duplicate row, which is the index tuple pointing to the value ‘5’. In order to understand what this means, let’s read the line pointer array for the index tuples. Here’s a little script to do that:
for TUPLE_ID in 1 2 3 4 5 6; do
LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done
And this is the output:
TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000
TID: 5 data offset: 8096 length: 16 binary: 0000000000100000 1001111110100000
TID: 6 data offset: 0 length: 0 binary: 0000000000000000 0000000000000000
That’s weird! It looks like there is an additional line pointer entry with tuple that points to the value ‘5’, which are at offset 8112 (which is the row that second asterisk was put in for), and offset 8096. If you look at the line pointer status bits (the last bit of the first bitfield ‘0000000000100000’) and the first bit of the next field, it’s all ’01’ which means LP_NORMAL (normal, live line pointer entry).
After some reading up in the source code, it turns out the upper value in the page header is used to determine which line pointer array entries are valid. So the above code should be:
PAGE_HEADER_UPPER=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t d2 -j 14 -N 2 | awk '{ print $1 }' )
for TUPLE_ID in $( seq 1 1000 ); do
LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
[ $PAGE_HEADER_UPPER -gt $TRUE_DATAPTR ] && break
TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done
Which will only show the actual entries based on the upper value in the page header. This shows:
TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000
If you look back to the raw page, it should be clear that vacuum did cleaning up in the index. The tuple for value ‘1’ was deleted, meaning that the first entry in the line pointer array now is the value ‘2’. Also the data for the index tuples at the end of the block has been cleaned up, the data for the new first tuple is copied over the old data, the tuple data moved to the end. As a result, there was a copy left of the data for the tuple with the value ‘5’ in the tuple data area and in the line pointer array, which is not used, because the ‘upper’ value in the page header will tell if a line pointer entry is active or not.
Another scenario is important to understand. What if you insert a lot of rows in a table, and then start updating rows? As we have seen, by default postgres fills up a table page to 100%. In other words: this is not an unlikely scenario. Let’s investigate how that works! In order to begin with a clean slate, drop mytable and create it again and insert a thousand rows. My second blogpost contains an anonymous PL/SQL block that inserts rows in a loop.
If we look at the first page of mytable, we see that it is entirely filled:
test=# select * from page_header(get_raw_page('mytable',0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
0/9BC4120 | 0 | 0 | 764 | 792 | 8192 | 8192 | 4 | 0
There is upper-lower=28 bytes free actually, but the heap tuple length is:
test=# select lp_len from heap_page_items(get_raw_page('mytable',0)) where lp=1;
lp_len
--------
39
So the row length is 39 bytes. So indeed page 0 of mytable is entirely filled.
New let’s update f1 of the first row (id=1) to something. As we know, this will create a new version of the row, which can’t fit in the first block, where the row with id=1 is located:
test=# update mytable set f1='ZZZZZZZZZZ' where id=1;
UPDATE 1
Let’s look at the first row how postgres handled that:
test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
1 | 8152 | 1 | 39 | 1835 | 1836 | 0 | (5,76) | 2 | 258 | 24 | | | \x010000001758585858585858585858
I think the update is entirely as expected, which means xmax has a transaction ID set indicating a new version exists. However t_ctid is set to 5,76, in other words: the new version is created in page 5 tuple ID 76, which means outside of the page where to original tuple version exist. If you look at the metadata of the new tuple:
test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
76 | 5152 | 1 | 39 | 1836 | 0 | 0 | (5,76) | 2 | 10242 | 24 | | | \x01000000175a5a5a5a5a5a5a5a5a5a
You see a normal new version, and no bits are set in t_infomask2 indicating a heap only tuple (hot). This means this update did change the index:
test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (1,182) | 16 | f | f | 6f 01 00 00 00 00 00 00
2 | (5,76) | 16 | f | f | 01 00 00 00 00 00 00 00
3 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
4 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
It seems weird that two index entries exist for id=1 (item offset 2 and 3) in a primary key index. If you think a little longer, it makes perfect sense: there is no version consistency information in the index. Currently two versions of the heap tuple with id=1 exist in the heap/table so it’s logical the index points to them both. Let’s see what happens when the table is vacuumed:
test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
1 | 0 | 0 | 0 | | | | | | | | | |
test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
76 | 5152 | 1 | 39 | 1836 | 0 | 0 | (5,76) | 2 | 10498 | 24 | | | \x01000000175a5a5a5a5a5a5a5a5a5a
test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (1,182) | 16 | f | f | 6f 01 00 00 00 00 00 00
2 | (5,76) | 16 | f | f | 01 00 00 00 00 00 00 00
3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
After vacuuming, the old version of the row is emptied, and the index entry for the old version is deleted.
There is an additional thing inherent to how postgres handles DML to any tuple (version), which needs to be taken into account. These is what is commonly called tuple ‘hint bits’. The hint bits are bits in t_infomask in the tuple header (let me emphasise: the header with every single tuple!). During DML on a tuple, the transaction obviously is ongoing, and as such the xmax bits for a deleted or old version of a tuple are reset, and the xmin bits of the new version or inserted tuple are reset. This is how a DML operation leaves the tuples it worked on.
The first process that visits a tuple after a DML operation reads t_xmin and t_xmax to understand which version of the tuple should be active. However, during reading the metadata, if a tuple does not have the transaction status set for t_xmax (HEAP_XMAX_COMMITTED or HEAP_XMAX_INVALID) or for t_xmin (HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID), it must investigate whether the transaction is still ongoing, rolled back or committed, and set the t_infomask bits accordingly.
In order to do this, the process needs to read pg_clog and potentially pg_subtrans. Reading pg_clog/pg_subtrans involves extra reading, and setting the bits means changes to the page, which need to be recorded in the WAL file. This means a select (read only) query can lead to writing to the WAL file! It should be noted that after the just described more expensive first read, which sets the hint bits for a tuple, the next readers can take advantage of that work and do not need to read up on the transaction state of a tuple again, unless a transaction performs DML again. Running vacuum on a table (whether by autovacuum or explicitly) will set the hint bits too.
Conclusion
This blogpost looks at updates to tuples. In postgres, an update means the previous version gets xmax set, and a new version of a tuple is created with xmin set to the xmax of the old version. If the update happened to a non-indexed field and the new tuple can be allocated in the same page, the new tuple can be ‘HOT’, alias a heap only tuple. This way no modification needs to be made to the index, which is an optimisation. However, when the changed tuple is requested via the index, it points to the tuple ID of the old version, which contains the tuple ID of the updated version of that tuple. Of course the updated version can also be updated and point to a newer version.
When a HOT updated table is vacuumed, the index entry of a HOT updated tuple remains untouched, and the tuple ID in the table to which the index points to will be a dedicated pointer to the current version (no matter how many HOT row versions existed prior to vacuuming), without space allocated in the data area. All allocated space of a line pointer entry that is cleaned by vacuum is compacted, meaning that all active data entries are defragmented and allocated from the bottom.
If an indexed entry is updated, the new version is created and an index entry is created for the update. Because the old version still exists for read consistency, the index entry is not deleted, it needs to be available for read consistency. If the table is vacuumed, the old version is emptied from the table, and the index entry is deleted.
Very much alike an indexed field update, if an update causes a new version to be created in another page than the original version of the tuple, the update is not HOT, and causes the index to be modified and the new version from the update is added.
Any DML to a row resets what is called the ‘hint bits’. Of course only the hint bits of either xmin or xmax which are relevant to the transaction are reset. The hint bits describe the state of xmin and xmax, and need to be set by the next process that touches the heap/page tuple. The setting of the hint bits require the process to read pg_clog and potentially pg_subtrans to find out the state of the transaction, and set the hint bits, which is a change to the page, which requires WAL logging.
Like this:
Like Loading...