This is the third part in a series of blogposts about how postgresql manages data in its blocks (called ‘pages’ in postgres speak). If you found this post and did not read the previous ones, it might be beneficial to read block internals (part 1) and block internals, part 2 first. In these blogposts I’ve shown how heap and index pages look like, and how these can be investigated, including looking at the raw block information.

This blogpost is intended to show the effects on pages when DML happens. This is inherently different from my personal reference of database implementation, which is the oracle database.

If you followed the previous two articles, you have a table mytable with one thousand rows. In order to simplify and be able to grasp the essence of the implications of data changes, drop mytable and recreate the simple situation of mytable with four rows again:

test=# drop table if exists mytable;
DROP TABLE
test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

This should result in the table having four rows in the first page:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

And the second index page having the index tuples of these four rows:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The next thing to do, is change the configuration of postgres to disable auto-vacuum. This should absolutely not be done on any postgres database serving something useful, but in order to see the effects of data changes, and the changes as a result of a vacuum, vacuum should not kick in automatically, because then we can’t predict what we will be looking at. This is governed by the parameter ‘autovacuum’, which is turned on by default:

test=# show autovacuum;
 autovacuum
------------
 on

Turn autovacuum off and reload the configuration:

test=# alter system set autovacuum=off;
ALTER SYSTEM
test=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
test=# show autovacuum;
 autovacuum
------------
 off

Now let’s update id 1 and set f1 to ‘zzzzzzzzzz’:

test=# update mytable set f1 = 'zzzzzzzzzz' where id = 1;
UPDATE 1

Now let’s look at the page contents of the table mytable:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |   1791 |        0 | (0,5)  |       16386 |        258 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1791 |      0 |        0 | (0,5)  |       32770 |      10242 |     24 |        |       | \x01000000177a7a7a7a7a7a7a7a7a7a

This is a good example how read consistency and row versioning works in postgres. The update inserted a new tuple at offset 7992 (lp_off) in the page, and added a pointer to it in the line pointer array (lp 5). Mind the ‘the update inserted a new tuple’ which is fundamental! In the metadata of tuple ID 1, xmax is set to indicate the maximum transaction ID which should be able to see the this version of the tuple. The postgres documentation says ‘delete XID stamp’ for xmax. Tuple ID 5 is created with exactly the same XID for xmin as tuple 1 has for xmax, which is the transaction ID starting from which this tuple should be visible. Obviously, for this tuple xmax is set to 0, indicating no newer version exists. Also, the t_ctid (current tuple ID) for tuple 1 is set to 0,5 (block 0, tuple 5), indicating the next version of that tuple is tuple ID 5. To indicate what happened, infomask2 and infomask have been set.
infomask2 16386 means: heap has two attributes, heap HOT updated.
infomask 258 means: has variable attributes, xmin committed.

The new tuple has the following values for infomask2 and infomask:
infomask2 32770 means: heap has two attributes, is heap only tuple.
infomask 10242 means: has variable attributes, xmax invalid, this is an updated version of the row.
(see the blogpost postgresql block internals, part 2 for a description of the infomask2 and infomask bitfields)

It should be clear now the update caused the active version of the tuple with id = 1 to be tuple ID 5 instead tuple ID 1. Please mind tuple ID 1 is still entirely present (in the line pointer array and in the data part of the page).

How does this looks like in the index?

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

Nothing changed!

Actually, this is exactly what ‘HOT’ (heap only tuple) means to describe in the above descriptions (infomask2, heap HOT updated). A new version of a tuple is created, but it is not covered by any index. This means when table/heap tuple with id=1 is accessed using the index, the index points to the original tuple ID (1) in the heap/table, which points to the new version of the tuple (tuple ID 5). This is an optimisation to lessen the amount of work done when updating a field that is not indexed, at the cost of following an additional pointer.

What happens if the tuple with id = 1 gets updated again?

test=# update mytable set f1 = 'yyyyyyyyyy' where id = 1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1790 |   1791 |        0 | (0,5)  |       16386 |       1282 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1791 |   1792 |        0 | (0,6)  |       49154 |       8450 |     24 |        |       | \x01000000177a7a7a7a7a7a7a7a7a7a
  6 |   7952 |        1 |     39 |   1792 |      0 |        0 | (0,6)  |       32770 |      10242 |     24 |        |       | \x010000001779797979797979797979

infomask 2 for tuple ID 5 now says: heap has two attributes, is heap only tuple and additionally: heap was HOT updated.
infomask for tuple ID 5 now says: heap has variable attributes, xmin committed and additionally: this is an updated version of the tuple.

Another update essentially performs exactly the same transaction on tuple ID 5 as previously done on tuple ID 1: a new tuple is created containing the result of the update (linepointer array entry and data at the end of the page), xmax is set for tuple ID 5 and xmin is set for tuple ID 6 with the same transaction ID, and the current tuple id for tuple ID 5 is set to the new version for that tuple, which is tuple ID 6.

As can be seen by looking at the value for infomask2, tuple ID 6 now has the same value as tuple ID 5 had previously. Just like the tuple ID 5 previously, the new tuple with ID 6 is a ‘hot’ updated tuple too, which means the index was once again left untouched, so the index still points to heap tuple ID 1. This means that a session that uses the index to find the tuple, will be pointed to heap tuple ID 1, which points to tuple ID 5, which points to tuple ID 6 in this case.

It’s obvious that this tuple pointer chasing can easily lead to more work to be done to find a row/tuple, meaning lesser performance. It also is obvious that this construction exists to provide a way to produce an older version of a tuple for read consistency. However, this old version is not useful anymore after some time, which is when no query exists with a transaction ID lower than the value for xmax for the old tuples, and needs to be purged. This is where vacuum comes in. Let’s vacuum mytable, and read the heap/table page again:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1790 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1790 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1790 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  6 |   8032 |        1 |     39 |   1792 |      0 |        0 | (0,6)  |       32770 |      10498 |     24 |        |       | \x010000001779797979797979797979

There are a few things vacuum did, and some things it didn’t:
– The line pointer array entries all stayed.
– The first entry points to heap tuple ID 6. lp_flags=2 together with lp_len=0 means the lp_off field points to the active tuple ID. The reason the first row still exists, is because the index entry still points to it.
– The fifth line pointer entry is all zero, lp_flags=0 means ‘unused’.

If you look closely at the line pointer array entries and the tuple data offsets before and after vacuum, you see that the past versions of the tuple data do not exist anymore. In fact, if you look closely to the offset (lp_off) values of the tuples that are stored in the page, you see that ID 2 took the offset value from ID 1 from before vacuuming, indicating it moved all the way to the back of the page. In fact, vacuum compacted the data tuples, maximising free space in the page.

A logical next question is: okay, this is way better than how it was left for read consistency after the two transactions, but it is still not optimal. Can it be fully cleaned? The answer is yes, vacuum can be told to fully clean up, including the removing any non-functional line pointer array entries, and as a logical consequence have all index entries point directly to the heap tuple it ought to point to. This is called ‘vacuum full’.

Why not use vacuum full all the time? The reason is vacuum full requires a table lock for the entire duration of running vacuum. This is not desirable and acceptable in a lot of databases. If normal vacuum is run sufficiently, running vacuum full will not yield much additional benefit in most cases, because when normal vacuum is run sufficiently, the old row versions are cleaned up in time, preventing excessive amounts of old versions in a page. This is important because normal vacuum cleans up within a page, but it does not move tuples across pages, freeing up pages and thus operating system disk space, which is what vacuum full does. This is why autovacuum should be turned on, to clean up old versions in time.

When an update performs an update on a field that is covered by the index, obviously the update can not be ‘HOT’, because the index must cover the fields it is created on. Let’s update the id of id=1 to 5:

test=# update mytable set id=5 where id=1;
UPDATE 1
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1814 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1814 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1814 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     39 |   1817 |      0 |        0 | (0,5)  |           2 |      10242 |     24 |        |       | \x050000001779797979797979797979
  6 |   8032 |        1 |     39 |   1816 |   1817 |        0 | (0,5)  |       40962 |       8450 |     24 |        |       | \x010000001779797979797979797979

As you can see, tuple ID/ line pointer number 5, which was emptied by vacuum, now is reused. The chain of tuple ID’s now is: ID 1 points to ID 6. ID 6 has xmax set, and t_ctid points to ID 5, with xmin set to the previous version’s xmax value. The value of 2 for t_infomask2 for ID 5 shows no bits indicating a HOT update. How does this look like in the index?

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          5 | (0,5) |      16 | f     | f    | 05 00 00 00 00 00 00 00

Because this is not a HOT update, there is a change to the index, an index tuple is added for the new value: ‘5’, which points to heap page 0 tuple ID 5, which is the result of the update. It is interesting to see value ‘1’ still exists in the index, while value ‘5’ is the actual value and ‘1’ the old value of the same heap tuple. It is logical however if you realise there is no versioning data in the index, so in order to understand if a tuple should be visible requires the heap/table tuple to be read. Once again, if you vacuum the table, things get sorted:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  2 |   8152 |        1 |     39 |   1814 |      0 |        0 | (0,2)  |           2 |       2306 |     24 |        |       | \x020000001762626262626262626262
  3 |   8112 |        1 |     39 |   1814 |      0 |        0 | (0,3)  |           2 |       2306 |     24 |        |       | \x030000001763636363636363636363
  4 |   8072 |        1 |     39 |   1814 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   8032 |        1 |     39 |   1817 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x050000001779797979797979797979
  6 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |

Tuple ID 1 now is empty, because it is relieved from its function of pointing to tuple ID 6, because tuple ID 6 gotten a new version, which means tuple ID 6 could be and is freed by vacuum. If you look in the index, you will see vacuum now freed up the tuple of the updated value:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          3 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          4 | (0,5) |      16 | f     | f    | 05 00 00 00 00 00 00 00

The bt_page_items function does not reveal too much information about line pointer and data offsets, in other words, what vacuum did, let’s look at the raw block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 78 60 b5 09 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 05 00 00 00 00 00 00 00
*
0008128 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008144 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008160 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

The interesting bit is the second star indicating a duplicate row, which is the index tuple pointing to the value ‘5’. In order to understand what this means, let’s read the line pointer array for the index tuples. Here’s a little script to do that:

for TUPLE_ID in 1 2 3 4 5 6; do
 LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
 DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
 LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
 LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
 TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
 TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
 LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
 LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
 echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done

And this is the output:

TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000
TID: 5 data offset: 8096 length: 16 binary: 0000000000100000 1001111110100000
TID: 6 data offset: 0 length: 0 binary: 0000000000000000 0000000000000000

That’s weird! It looks like there is an additional line pointer entry with tuple that points to the value ‘5’, which are at offset 8112 (which is the row that second asterisk was put in for), and offset 8096. If you look at the line pointer status bits (the last bit of the first bitfield ‘0000000000100000’) and the first bit of the next field, it’s all ’01’ which means LP_NORMAL (normal, live line pointer entry).

After some reading up in the source code, it turns out the upper value in the page header is used to determine which line pointer array entries are valid. So the above code should be:

PAGE_HEADER_UPPER=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t d2 -j 14 -N 2 | awk '{ print $1 }' )
for TUPLE_ID in $( seq 1 1000 ); do
 LP_OFFSET=$( echo "24+(4 * ($TUPLE_ID - 1) )" | bc )
 DATAPTR=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LP_OFFSET -N 2 | awk '{ print $1 }' )
 LENGTH_OFFSET=$( echo "$LP_OFFSET+2" | bc )
 LENGTH=$( psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j $LENGTH_OFFSET -N 2 | awk '{ print $1 }' )
 TRUE_DATAPTR=$( echo $(( 0x$DATAPTR & ~$((1<<15)))) )
 [ $PAGE_HEADER_UPPER -gt $TRUE_DATAPTR ] && break
 TRUE_LENGTH=$( echo $((0x$LENGTH >> 1)) )
 LP_BITS_2=$( printf "%016d" $( echo "ibase=16;obase=2;${DATAPTR^^}" | bc ) )
 LP_BITS_1=$( printf "%016d" $( echo "ibase=16;obase=2;${LENGTH^^}" | bc ) )
 echo "TID: $TUPLE_ID data offset: $TRUE_DATAPTR length: $TRUE_LENGTH binary: $LP_BITS_1 $LP_BITS_2"
done

Which will only show the actual entries based on the upper value in the page header. This shows:

TID: 1 data offset: 8160 length: 16 binary: 0000000000100000 1001111111100000
TID: 2 data offset: 8144 length: 16 binary: 0000000000100000 1001111111010000
TID: 3 data offset: 8128 length: 16 binary: 0000000000100000 1001111111000000
TID: 4 data offset: 8112 length: 16 binary: 0000000000100000 1001111110110000

If you look back to the raw page, it should be clear that vacuum did cleaning up in the index. The tuple for value ‘1’ was deleted, meaning that the first entry in the line pointer array now is the value ‘2’. Also the data for the index tuples at the end of the block has been cleaned up, the data for the new first tuple is copied over the old data, the tuple data moved to the end. As a result, there was a copy left of the data for the tuple with the value ‘5’ in the tuple data area and in the line pointer array, which is not used, because the ‘upper’ value in the page header will tell if a line pointer entry is active or not.

Another scenario is important to understand. What if you insert a lot of rows in a table, and then start updating rows? As we have seen, by default postgres fills up a table page to 100%. In other words: this is not an unlikely scenario. Let’s investigate how that works! In order to begin with a clean slate, drop mytable and create it again and insert a thousand rows. My second blogpost contains an anonymous PL/SQL block that inserts rows in a loop.

If we look at the first page of mytable, we see that it is entirely filled:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/9BC4120 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

There is upper-lower=28 bytes free actually, but the heap tuple length is:

test=# select lp_len from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp_len
--------
     39

So the row length is 39 bytes. So indeed page 0 of mytable is entirely filled.

New let’s update f1 of the first row (id=1) to something. As we know, this will create a new version of the row, which can’t fit in the first block, where the row with id=1 is located:

test=# update mytable set f1='ZZZZZZZZZZ' where id=1;
UPDATE 1

Let’s look at the first row how postgres handled that:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1835 |   1836 |        0 | (5,76) |           2 |        258 |     24 |        |       | \x010000001758585858585858585858

I think the update is entirely as expected, which means xmax has a transaction ID set indicating a new version exists. However t_ctid is set to 5,76, in other words: the new version is created in page 5 tuple ID 76, which means outside of the page where to original tuple version exist. If you look at the metadata of the new tuple:

test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
 76 |   5152 |        1 |     39 |   1836 |      0 |        0 | (5,76) |           2 |      10242 |     24 |        |       | \x01000000175a5a5a5a5a5a5a5a5a5a

You see a normal new version, and no bits are set in t_infomask2 indicating a heap only tuple (hot). This means this update did change the index:

test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (5,76)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00

It seems weird that two index entries exist for id=1 (item offset 2 and 3) in a primary key index. If you think a little longer, it makes perfect sense: there is no version consistency information in the index. Currently two versions of the heap tuple with id=1 exist in the heap/table so it’s logical the index points to them both. Let’s see what happens when the table is vacuumed:

test=# vacuum mytable;
VACUUM
test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp=1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
test=# select * from heap_page_items(get_raw_page('mytable',5)) where lp=76;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
 76 |   5152 |        1 |     39 |   1836 |      0 |        0 | (5,76) |           2 |      10498 |     24 |        |       | \x01000000175a5a5a5a5a5a5a5a5a5a
test=# select * from bt_page_items('pk_mytable',1) where itemoffset <5;
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (5,76)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00

After vacuuming, the old version of the row is emptied, and the index entry for the old version is deleted.

There is an additional thing inherent to how postgres handles DML to any tuple (version), which needs to be taken into account. These is what is commonly called tuple ‘hint bits’. The hint bits are bits in t_infomask in the tuple header (let me emphasise: the header with every single tuple!). During DML on a tuple, the transaction obviously is ongoing, and as such the xmax bits for a deleted or old version of a tuple are reset, and the xmin bits of the new version or inserted tuple are reset. This is how a DML operation leaves the tuples it worked on.

The first process that visits a tuple after a DML operation reads t_xmin and t_xmax to understand which version of the tuple should be active. However, during reading the metadata, if a tuple does not have the transaction status set for t_xmax (HEAP_XMAX_COMMITTED or HEAP_XMAX_INVALID) or for t_xmin (HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID), it must investigate whether the transaction is still ongoing, rolled back or committed, and set the t_infomask bits accordingly.

In order to do this, the process needs to read pg_clog and potentially pg_subtrans. Reading pg_clog/pg_subtrans involves extra reading, and setting the bits means changes to the page, which need to be recorded in the WAL file. This means a select (read only) query can lead to writing to the WAL file! It should be noted that after the just described more expensive first read, which sets the hint bits for a tuple, the next readers can take advantage of that work and do not need to read up on the transaction state of a tuple again, unless a transaction performs DML again. Running vacuum on a table (whether by autovacuum or explicitly) will set the hint bits too.

Conclusion
This blogpost looks at updates to tuples. In postgres, an update means the previous version gets xmax set, and a new version of a tuple is created with xmin set to the xmax of the old version. If the update happened to a non-indexed field and the new tuple can be allocated in the same page, the new tuple can be ‘HOT’, alias a heap only tuple. This way no modification needs to be made to the index, which is an optimisation. However, when the changed tuple is requested via the index, it points to the tuple ID of the old version, which contains the tuple ID of the updated version of that tuple. Of course the updated version can also be updated and point to a newer version.

When a HOT updated table is vacuumed, the index entry of a HOT updated tuple remains untouched, and the tuple ID in the table to which the index points to will be a dedicated pointer to the current version (no matter how many HOT row versions existed prior to vacuuming), without space allocated in the data area. All allocated space of a line pointer entry that is cleaned by vacuum is compacted, meaning that all active data entries are defragmented and allocated from the bottom.

If an indexed entry is updated, the new version is created and an index entry is created for the update. Because the old version still exists for read consistency, the index entry is not deleted, it needs to be available for read consistency. If the table is vacuumed, the old version is emptied from the table, and the index entry is deleted.

Very much alike an indexed field update, if an update causes a new version to be created in another page than the original version of the tuple, the update is not HOT, and causes the index to be modified and the new version from the update is added.

Any DML to a row resets what is called the ‘hint bits’. Of course only the hint bits of either xmin or xmax which are relevant to the transaction are reset. The hint bits describe the state of xmin and xmax, and need to be set by the next process that touches the heap/page tuple. The setting of the hint bits require the process to read pg_clog and potentially pg_subtrans to find out the state of the transaction, and set the hint bits, which is a change to the page, which requires WAL logging.

This is the second part of a blogpost about Postgresql database block internals. If you found this blogpost, and are interested in getting started with it, please read the first part, and then continue with this post.
I am doing the investigations on Oracle Linux 7u3 with postgres 9.6 (both the latest versions when this blogpost was written).

In the first part I talked about the pageinspect extension, and investigated the page header and line pointer array. This blogpost looks at the actual tuples, including the index, and how these are stored in the pages.

The block structures are explained in the main documentation at: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
We can see the tuple metadata and raw data using the heap_page_items function:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

As has been described in the first blogpost, the ‘lp’ entries fetch their data from the line pointer array in the page header, and the ‘t’ entries fetch the data from the actual tuple.

This is the raw block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

In the previous blog post I described how the offset of the rows can be found. This a description of the tuple header data in the documentation:

Field           Type            Length  Offset  Description
t_xmin          TransactionId   4 bytes 0       insert XID stamp
t_xmax          TransactionId   4 bytes 4       delete XID stamp
t_cid           CommandId       4 bytes 8       insert and/or delete CID stamp (overlays with t_xvac)
t_xvac          TransactionId   4 bytes 8       XID for VACUUM operation moving a row version
t_ctid          ItemPointerData 6 bytes 12      current TID of this or newer row version
t_infomask2     uint16          2 bytes 18      number of attributes, plus various flag bits
t_infomask      uint16          2 bytes 20      various flag bits
t_hoff          uint8           1 byte  22      offset to user data
All the details can be found in src/include/access/htup_details.h.

Let’s extract the fields from the raw block:

$ # xmin 8152+0
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8152 -N 2
   1760
$ # xmax 8152+4
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t d2 -j 8156 -N 2
      0
$ # ctid 8152+12 (current tuple id), first 4 bytes is the block number, second 2 bytes tuple id
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x4 -j 8164 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8168 -N 2
     1
$ # t_infomask2 8152+18
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8170 -N 2
     2

What does ‘2’ mean? Once again, the source code to the rescue! A description of t_infomask2 can be found here: https://doxygen.postgresql.org/htup__details_8h_source.html At line 260 there is a description. t_infomask2 is a (binary, obviously) bit mask. Here is how you can look it up. I took the mask definition from the source code, and printed the binary value using ‘echo “ibase=16;obase=2;7FF” | bc’ in front of it:

     11111111111 #define HEAP_NATTS_MASK         0x07FF  /* 11 bits for number of attributes */
  10000000000000 #define HEAP_KEYS_UPDATED       0x2000  /* tuple was updated and key cols
 100000000000000 #define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
1000000000000000 #define HEAP_ONLY_TUPLE         0x8000  /* this is heap-only tuple */
1110000000000000 #define HEAP2_XACT_MASK         0xE000  /* visibility-related bits */
1111111111111110 #define SpecTokenOffsetNumber       0xfffe

So, anything up to the number 2047 (0x7FF) in the infomask2 field is meant to describe the number of fields/attributes. In our case ‘2’ (binary 10) means there are two fields. Please mind additional bits can be set (for a HOT update for example) while the number of fields is still described. With a bitmask, every position functions independent from the other positions.

Now let’s look at the next field, t_infomask:

$ # t_infomask 8152+20
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t u2 -j 8172 -N 2
  2050

This too is a bitmask. The description is in https://doxygen.postgresql.org/htup__details_8h_source.html starting from line 173:

               1 #define HEAP_HASNULL            0x0001  /* has null attribute(s) */
              10 #define HEAP_HASVARWIDTH        0x0002  /* has variable-width attribute(s) */
             100 #define HEAP_HASEXTERNAL        0x0004  /* has external stored attribute(s) */
            1000 #define HEAP_HASOID             0x0008  /* has an object-id field */
           10000 #define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-shared locker */
          100000 #define HEAP_COMBOCID           0x0020  /* t_cid is a combo cid */
         1000000 #define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is exclusive locker */
        10000000 #define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax, if valid, is only a locker */
                    /* xmax is a shared locker */
                 #define HEAP_XMAX_SHR_LOCK  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
                 #define HEAP_LOCK_MASK  (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                          HEAP_XMAX_KEYSHR_LOCK)
       100000000 #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
      1000000000 #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
                 #define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
     10000000000 #define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
    100000000000 #define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */
   1000000000000 #define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
  10000000000000 #define HEAP_UPDATED            0x2000  /* this is UPDATEd version of row */
 100000000000000 #define HEAP_MOVED_OFF          0x4000  /* moved to another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
1000000000000000 #define HEAP_MOVED_IN           0x8000  /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
                 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
1111111111110000 #define HEAP_XACT_MASK          0xFFF0  /* visibility-related bits */

The value for t_infomask is 2050, this is binary (echo “obase=2;2050” | bc):

    100000000010

Which means: 10: has variable-width attributes and 100000000000: xmax is invalid (this means xmax is not set).

I created an index on this table too, called ‘pk_mytable’. The below text looks at an ordinary index. An index uses the same block size, the same block header, but different contents, quite obviously, because an index needs to maintain a balanced tree (btree) structure with ordered entries.

The first page of an index is a ‘metapage’ (a page that essentially points to the index root block). This is how the metapage can be inspected:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    1 |     0 |        1 |         0

As described earlier, a heap page, index metapage or index page all are normal postgres pages, which means they all contain a header which can be inspected using the ‘page_header’ function:

test=# select * from page_header(get_raw_page('pk_mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576A10 |        0 |     0 |    48 |  8176 |    8176 |     8192 |       4 |         0

The bt_metap function tells us the root block is in block 1. Index entries can be listed using the bt_page_items function:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          4 | (0,4) |      16 | f     | f    | 04 00 00 00 00 00 00 00

The index structure (like root/leaf blocks, next and previous block) can be investigated using the bt_page_stats function:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3

This tells us that this is a leaf block (type: l), we got four indexed tuples (live_items) in this leaf block, there is no previous index leaf page/left sibling (btpo_prev: 0), there is no next index leaf page/right sibling (btpo_next: 0), btpo is a union that either contains the tree level if not a leaf block or next transaction ID if this leaf page is deleted and btpo_flags, which is a bitmap. the bitmap explanation is available in the source code, I added the actual bits in front of the defines (this is how that is done on the linux prompt: ‘echo “obase=2;$((1 << 0 ))" | bc'):

             1 #define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
            10 #define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
           100 #define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
          1000 #define BTP_META        (1 << 3)    /* meta-page */
         10000 #define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
        100000 #define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
       1000000 #define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
      10000000 #define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */
10011011111011 #define MAX_BT_CYCLE_ID     0xFF7F

The btpo_flags field contains the value 3 (binary 11), which means this is a leaf block and this is the root block. This is how the BTPageOpaqueData struct looks like, with included offsets and byte sizes by me:

typedef struct BTPageOpaqueData
Bytes   {
4         BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
4         BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
          union
          {
4             uint32      level;      /* tree level --- zero for leaf pages */
              TransactionId xact;     /* next transaction ID, if deleted */
          }           btpo;
2         uint16      btpo_flags;     /* flag bits, see below */
2         BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
       } BTPageOpaqueData;

This struct is placed at the end of the block at the offset for ‘special’ when using the page_header function on the index block.

This is how the block contents look like for an index:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 e8 6b 57 01 00 00 00 00 28 00 b0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look closely, you see the header is the same, which is obvious because I just told that all pages contain the same page header. For index entries, the entries are allocated from the bottom up too, to allow the header (the line pointer array actually) to grow when additional entries are inserted into this page of the index. This is exactly the same line pointer array as we saw with a heap table, so we can extract them in the same way; first entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fe0

Now flip the 16th bit to get the offset number:

$ echo $((0x9fe0 & ~$((1<<15))))
8160

Get the length of the index entry:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
0020

Right-shift:

$ echo $((0x0020 >> 1))
16

The offset number from the line pointer array points to the index tuple data header, which contains two fields, described in the annotated source code:

typedef struct IndexTupleData
{
   ItemPointerData t_tid;      /* reference TID to heap tuple */
   /* ---------------
    * t_info is laid out in the following fashion:
    *
    * 15th (high) bit: has nulls
    * 14th bit: has var-width attributes
    * 13th bit: unused
    * 12-0 bit: size of tuple
    * ---------------
    */
    unsigned short t_info;      /* various info about tuple */
} IndexTupleData;               /* MORE DATA FOLLOWS AT END OF STRUCT */

So the header contains the block and tuple id in ItemPointerData to the row the index entry describes. The length of ItemPointerData is 6 bytes. And the header contains a bitmap called t_info, size 2 bytes. This means an index tuple header is 8 bytes. The header of an index is not described in the postgres online manual. Did you notice there is no xmin/xmax in the index structure? An index does not do multi-versioning like a table does!

Let’s fetch the index tuple header data directly. First the tuple id of the heap (block number and tuple offset number:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x4 -j 8160 -N 4
 00000000
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
     1

The next field in the header is t_info:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t u2 -j 8164 -N 2
    16
$ echo "obase=2;16" | bc
10000

The value in t_info has the highest bit set at the 5th position, while the description says that the bits 13/14/15 are used as a bitmask, so the only information in t_info is the size of the tuple: 16 bytes.

It is widely known that an regular btree index stores its entries ordered, which is why you can range scan the index leaf blocks, which is an optimised way of searching through data. But how is that organised in the index if I enter data unordered? In the table the data will be added unordered, bottom up to the block, because the table has no requirement to store the data ordered. But the index must provide the indexed value in an ordered way. The two options I see there are is the indexed value is added bottom up unordered, and the pointer to the entry is entered in the correct position in the line pointer array, or: the indexed value is placed at the correct position in the data area, AND the pointer is added at the correct position in the line pointer array. Let’s test this! We currently have the values 1-4 stored in the index, let’s add 6, and checkpoint:

test=# insert into mytable (id, f1) values (6, 'ffffffffff');
INSERT 0 1
test=# checkpoint;
CHECKPOINT

Now let’s look at the index block contents:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 c8 81 64 01 00 00 00 00 2c 00 a0 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

Here we see the new tuple with the value 6 is added at position 8096, and the corresponding line pointer entry is added (echo $((0x9fa0 & ~$((1<<15)))) = 8096). Now let's add 5 to mytable, and checkpoint:

test=# insert into mytable (id, f1) values (5, 'eeeeeeeeee');
INSERT 0 1
test=# checkpoint;

And dump the block contents again:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 00 85 64 01 00 00 00 00 30 00 90 1f
0000016 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00
0000032 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008080 00 00 00 00 06 00 10 00 05 00 00 00 00 00 00 00
0008096 00 00 00 00 05 00 10 00 06 00 00 00 00 00 00 00
0008112 00 00 00 00 04 00 10 00 04 00 00 00 00 00 00 00
0008128 00 00 00 00 03 00 10 00 03 00 00 00 00 00 00 00
0008144 00 00 00 00 02 00 10 00 02 00 00 00 00 00 00 00
0008160 00 00 00 00 01 00 10 00 01 00 00 00 00 00 00 00
0008176 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00
0008192

If you look in the index field data first (offset numbers 8080-8176), you see that the indexed field is added unordered. The number (field data) is at offset 8088, value 5, and at offset 8104, value 6 for the two rows we just inserted. If you look at the last two array members of the line pointer, you see that the ordering is happening there:

$ # 5th array member: 24 + (4 * 4)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 40 -N 2
 9f90
$ echo $((0x9f90 & ~$((1<<15))))
8080
$ # 6th array member: 24 + (4 * 5)
$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('pk_mytable',1)" | xxd -p -r | od -A n -t x2 -j 44 -N 2
 9fa0
echo $((0x9fa0 & ~$((1<<15))))
8096

Now let’s fill up the table to 1000 rows to see what happens when a heap and an index block fill up:

test=# do $$
test$# begin
test$# for nr in 7..1000 loop
test$# insert into mytable (id, f1) values (nr, 'XXXXXXXXXX');
test$# end loop;
test$# end $$;
DO
test=# checkpoint;
CHECKPOINT

This is the first block of mytable:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/164E6A8 |        0 |     0 |   764 |   792 |    8192 |     8192 |       4 |         0

If you subtract upper with lower (792-764), you get the amount of free space in the first block: 28 bytes. I deliberately created all the rows with the same mount of data (an int and a varchar with 10 characters in it), let’s extract the first row:

test=# select * from heap_page_items(get_raw_page('mytable',0)) where lp = 1;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1779 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |       | \x010000001761616161616161616161

This shows in lp_len that the length of a single row is 39 bytes. This means that by default (when no fill factor is set), a table block will be filled up to 100%.

Let’s look at the index. First scan the metapage of the index ‘pk_mytable’:

test=# select * from bt_metap('pk_mytable');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1

The root of the index now is block 3 (obviously in the ‘root’ field). Let’s examine that page:

test=# select * from bt_page_stats('pk_mytable',3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2

The old page block changed from being a combined root and leaf page to being a dedicated leaf page, and a new root page was created in page number 3 of pk_mytable! This also explains why there are only 3 items in the page (live_items), this is a block to guide the process to the leaf blocks. This also is visible in the btpo_flags, which now is 2: only the bit for root page is set.

Let’s look at the contents of the root page:

test=# select * from bt_page_items('pk_mytable',3);
 itemoffset | ctid  | itemlen | nulls | vars |          data
------------+-------+---------+-------+------+-------------------------
          1 | (1,1) |       8 | f     | f    |
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00

A root/tree (called ‘branch’ in Oracle) index block contains pointers to index pages which can be tree pages or leaf blocks. Non-root tree pages are called ‘internal’ pages. The data field contains the the highest value of the left/lower value branch or leaf. This way, with a tree level of one (bt_page_stats.btpo = 1), it contains the highest value of the lower offset leaf page ctid is pointing at for this index. A leaf page always points to heap (table) pages. If this single tree page fills up, it is splitted and a new root is created to point to the two tree pages which are just created as a result of the split of the old root page. Because a tree page points to the highest value of the left/lower value branch or leaf page, there is no value for the left-most leaf, because there is no ‘left side’ block. This is also visible with the ‘itemlen’ field above when examining the index page using bt_page_items, only the header of 8 bytes is stored.

To get an idea of the id values stored in the leaf pages, take the hexadecimal value and convert it to decimal:

$ echo "ibase=16;016F" | bc
367
$ echo "ibase=16;02DD" | bc
733

This means pk_mytable leaf page 1 contains the values 1-367, leaf page 2 contains 368-733 and leaf page 4 contains 734-1000.

Let’s descent into the first leaf page, page 1 of pk_mytable, which was our old combined root and leaf page, and read the index meta data using bt_page_stats:

test=# select * from bt_page_stats('pk_mytable',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1

This shows this is a leaf page (type: l), but this time it is only a leaf page, which is visible with btpo_flags, only bit 1 is set indicating a leaf page. btpo_prev is set to 0, which indicates this is the left-most side of the index, there is no left block. There is a page with a higher value, btpo_next is set to 2, indicating block 2 of pk_mytable is the right side page.

There are 808 bytes of free space available in the page. An index entry for this index is 16 bytes, which means there is space available. The way this is handled is summarised in the documentation in the source code for the function _bt_findsplitloc. Essentially, if an index page fills up, the page contents are split between two pages, for which every page gets 50% of the page’s payload. In the case of the page being a leaf page and the right-most page (btpo_next=0), it is considered a special case, and the page is split leaving fillfactor% (default 90%) in the original page, and the remaining percentage in the new right-most page, optimising space usage and still leaving room for updates.

Now let’s look at the index tuples for the leftmost page:

test=# select * from bt_page_items('pk_mytable',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,182) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,6)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
          7 | (0,5)   |      16 | f     | f    | 06 00 00 00 00 00 00 00
          8 | (0,7)   |      16 | f     | f    | 07 00 00 00 00 00 00 00
          9 | (0,8)   |      16 | f     | f    | 08 00 00 00 00 00 00 00
...etc...

What is visible is the data is ordered, however the highest value in the index page is put in the first row, the lowest value starts at the second row. In the readme with the nbtree source this is explained:

On a page that is not rightmost in its tree level, the "high key" is 
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused.  A page that is
rightmost has no "high key", so data items start with the first item.
Putting the high key at the left, rather than the right, may seem odd,
but it avoids moving the high key as we add data items.

Conclusion
In this post I moved beyond the page headers of a heap and looked at how tuple metadata looks like, notably xmin and xmax, the ctid and the infomasks which are bitmap fields to indicate specific status indicators per row, and how to extract these manually.

Next I moved to the (simple primary key) index on the id column. An index page partly looks the same as a heap block because it contains a common block header. However an index contains a metapage as the first page, and if all the index tuples fit in a single page has a combined root and leaf page. Once the combined root and leaf page grows out of the single page, a non-leaf root page is created which stores pointers to index leaf pages containing the index tuple values and the max values of the leaf pages, except for the leftmost leaf page.

Heap (table) pages are filled up to 100% by default, which can be changed with the fillfactor storage attribute. An index has a default fillfactor of 90%. If an index leaf page is filled up to fillfactor, it will split. However, if the right-most leaf page fills up (which means an increasing value, like a sequence or a timestamp), the split will be done 90% (fillfactor% actually)-10% (old-new page) instead of 50%-50%.

All index leaf pages outside of the rightmost page stores the max index field value as the first index tuple, and starts of with the lowest index field value as the second tuple, to optimise inserting new values.

The next post will look at what happens when row data gets changed. This is particularly exciting for me, because this is where postgres truly is different from my base reference, which is the Oracle database.

This blogpost is the result of me looking into how postgres works, and specifically the database blocks. The inspiration and essence of this blogpost comes from two blogs from Jeremiah Peschka: https://facility9.com/2011/03/postgresql-row-storage-fundamentals/ and https://facility9.com/2011/04/postgresql-update-internals/
I am using Oracle Linux 7u3 and postgres 9.6 (current versions when this blogpost was written).

Postgres is already installed, and a database cluster is already running. Let’s create a database ‘test’ for the sake of our tests:

$ createdb test

Once the database is created, logging on is done with ‘psql’:

$ psql -d test
psql (9.6.3)
Type "help" for help.

test=#

Once logged on, we need a table and index to investigate. Let’s create a very simple table with a primary key and insert some data:

test=# create table mytable ( id int not null, f1 varchar(30) );
CREATE TABLE
test=# alter table mytable add constraint pk_mytable primary key ( id );
ALTER TABLE
test=# insert into mytable ( id, f1 ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4

In order to look at the block structures there is a postgres extension called ‘pageinspect’. The extension is part of the ‘postgresql96-contrib’ RPM package. You can check if it’s installed on your machine by looking if the file ‘/usr/pgsql-9.6/share/extension/pageinspect.control’ exists.

Once you made sure the pageinspect operating system dependencies are met, you need to install it in the database. To verify if it’s installed or not, look in ‘pg_extension’. This is how it looks like on a fresh installed database cluster:

test=# select * from pg_extension;
 extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
---------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql |       10 |           11 | f              | 1.0        |           |

This means the pageinspect extension is not installed, install it in the following way:

test=# create extension pageinspect;
CREATE EXTENSION
test=# select * from pg_extension;
   extname   | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
-------------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql     |       10 |           11 | f              | 1.0        |           |
 pageinspect |       10 |         2200 | t              | 1.5        |           |

Now with the pageinspect extension installed, let’s look at the heap (table) block. In order to understand how a block looks like, go to the documentation: https://www.postgresql.org/docs/9.6/static/storage-page-layout.html
The essence is a heap bock contains a header, immediately followed by ItemIdData (also known as line pointers, which are pointers to rows in the same block) growing top to bottom, and then the tuples allocated bottom to top, allowing the line pointer array to grow.

The table mytable in this case consists of one block, because the table tuples (rows) fit in one. We can look at the page header by using the page_header function together with the get_raw_page function. Mind the name of the table and the block number in the function get_raw_page:

test=# select * from page_header(get_raw_page('mytable',0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/1576BA8 |        0 |     0 |    40 |  8032 |    8192 |     8192 |       4 |         0

Next we want to look at the ItemIdData/line pointer array and rows, which are combined in the heap_page_items function. The ItemIdData array fields are indicated by ‘lp’, the fields in the tuple are indicated by ‘t’:

test=# select * from heap_page_items(get_raw_page('mytable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1760 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1760 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   1760 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   1760 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464

In case you wondered how to decipher the data:

test=# select chr(x'61'::integer);
 chr
-----
 a

In order to get a better understanding, it often helps to physically see the block contents (at least, that is how my brain works). This is how that can be done:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A d -t x1
0000000 00 00 00 00 a8 6b 57 01 00 00 00 00 28 00 60 1f
0000016 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00
0000032 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0008032 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008048 04 00 02 00 02 08 18 00 04 00 00 00 17 64 64 64
0008064 64 64 64 64 64 64 64 00 e0 06 00 00 00 00 00 00
0008080 00 00 00 00 00 00 00 00 03 00 02 00 02 08 18 00
0008096 03 00 00 00 17 63 63 63 63 63 63 63 63 63 63 00
0008112 e0 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0008128 02 00 02 00 02 08 18 00 02 00 00 00 17 62 62 62
0008144 62 62 62 62 62 62 62 00 e0 06 00 00 00 00 00 00
0008160 00 00 00 00 00 00 00 00 01 00 02 00 02 08 18 00
0008176 01 00 00 00 17 61 61 61 61 61 61 61 61 61 61 00
0008192

(if xxd is not installed on your system, it’s in the vim-common package)
If you look at the header and page items results, you see the header and line pointer items on the top, then a star after offset 48, which indicates identical lines (all zero, indicating non-touched free space), and the row data allocated from the bottom (remember 0x61 is ‘a’, indicating the first added row is at the bottom).

Let’s work a bit on the raw data, to see how it works. What I found EXTREMELY helpful (for me coming from investigating a closed-source product like the Oracle database), is the source code available and fully searchable at http://doxygen.postgresql.org (thanks Jan and Devrim!!).

Let’s try to find the first record directly using the block contents.

First we need to find the line pointer array. In order to do that, I looked at the PageHeaderData struct in the source code:

Page header in the source code: https://doxygen.postgresql.org/structPageHeaderData.html
typedef struct PageHeaderData
  148 {
  149     /* XXX LSN is member of *any* block, not only page-organized ones */		size	/ offset
  150     PageXLogRecPtr pd_lsn;      /* LSN: next byte after last byte of xlog		8 bytes / 0
  151                                  * record for last change to this page */
  152     uint16      pd_checksum;    /* checksum */					2 bytes	/ 8
  153     uint16      pd_flags;       /* flag bits, see below */			2 bytes / 10
  154     LocationIndex pd_lower;     /* offset to start of free space */		2 bytes / 12
  155     LocationIndex pd_upper;     /* offset to end of free space */			2 bytes / 14
  156     LocationIndex pd_special;   /* offset to start of special space */		2 bytes / 16
  157     uint16      pd_pagesize_version;						2 bytes / 18
  158     TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */	4 bytes / 20
  159     ItemIdData  pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */		4 bytes array / 24
  160 } PageHeaderData;

I added the size and offset numbers myself, in order to make it easier. Above we see the first array member should be at offset 24 from the start of the header. However, we need to understand how ItemIdData looks like: https://doxygen.postgresql.org/structItemIdData.html

typedef struct ItemIdData
   25 {
   26     unsigned    lp_off:15,      /* offset to tuple (from start of page) */
   27                 lp_flags:2,     /* state of item pointer, see below */
   28                 lp_len:15;      /* byte length of tuple */
   29 } ItemIdData;

Now let’s get the offset number of the first row straight from the line pointer array in the block:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 24 -N 2
 9fd8

However, this number is too high, because it’s decimal 40920, binary: 1001 111 1101 1000; alias: the 16th bit is set. So we need to set the 16th bit to zero:

$ echo $((0x9fd8 & ~$((1<<15))))
8152

This is how to get the tuple length from the line pointer array:

$ psql -d test -tA -c "select encode(get_raw_page::bytea, 'hex') from get_raw_page('mytable',0)" | xxd -p -r | od -A n -t x2 -j 26 -N 2
 004e

This number is too high too: it’s decimal 78, while we know the length is 39 (see above). This is because this includes a bit from the lp_flags field.
We need to right-shift this number to get the actual number:

$ echo $((0x004e >> 1))
39

So there you got it: by reading the block directly, I fetched the line pointer values of the first block (offset number and length), which are 8152 and length 39 bytes. Using the above commands you can easily read the next line pointer value by setting the correct number at ‘-j’, and doing the bit manipulation.

So, to conclude this blogpost: I’ve touched on subject of the pageinspect extension, the postgres searchable source in http://doxygen.postgresql.org, the documentation at https://www.postgresql.org/docs/9.6/static/storage-page-layout.html, and showed the page header, the line pointer array in the page header, and the rows in the block. These can all be seen using functions page_header, heap_page_items and get_raw_page. Then I showed how to fetch the raw block contents, to truly see the block, and showed how to fetch data directly using the encode and get_raw_page functions in postgres, and the xxd and od functions on the linux command line.

NB. Updated July 1st, 17:19 CET after comments of Jan Karremans.

This is a short blogpost meant as both an introduction for those who don’t know Ara and a guide on how to install Ara.
Ara means ‘Ansible Runtime Analysis’, and is a tool for storing metadata that Ansible uses during execution. It is very valuable, because it takes a lot of guesswork and entering debug statements in your playbook away.

This is a guide on how to install Ara on Oracle Linux 7. I assume ansible is already installed. If want to start fresh, add EPEL and yum install ansible and git. That’s all you need to begin!

First, become root and install ara using a playbook:

mkdir ansible
cd $_
mkdir roles
git clone https://git.openstack.org/openstack/ansible-role-ara roles/ara
echo "- hosts: localhost
  roles:
  - ara" > install_ara.yml
ansible-playbook install_ara.yml
useradd ansible
systemctl stop ara

Ara and a systemd unit (service) is created and started by the playbook. I create an ‘ansible’ user and stop the service. The ansible ara role creates a service running as root, I don’t like running daemons as root if it’s not necessary.

The next part is executed as the user that is supposed to run ansible, which I created above:

echo "[defaults]
local_tmp = /home/ansible/.ansible/tmp
callback_plugins = /usr/lib/python2.7/site-packages/ara/plugins/callbacks
[ara]
dir = /home/ansible/.ara " > ~/.ansible.cfg

What this does, is define the callback for ara so executions are recorded. The default way of recording is entering the details in a sqlite3 database. This database (which is a file) is stored in /home/ansible/.ara, which is defined in the [ara] section using ‘dir’.

The next part is executed as root again:

pip install --upgrade ara
echo "[Unit]
Description=ARA
After=network.target

[Service]
Type=simple
User=ansible
Group=ansible
TimeoutStartSec=0
Restart=on-failure
RestartSec=10
RemainAfterExit=yes
ExecStart=/bin/ara-manage runserver -h 0.0.0.0 -p 9191

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/ara.service
systemctl daemon-reload
systemctl start ara

This first executes pip (which is installed by the playbook above) and does an upgrade of ara and it’s depended packages. If this is not done, the website run by ara-manage will throw a 500 internal error (!!). Then the echo overwrites the current ara webserver config run as root by a modified version that runs it as ansible. Next we inform systemd it needs to reload the config, and then we start the ara service again.

Now run an ansible playbook as the ansible user, and go to port 9191 with your browser. It will detail the entire run, and list all the plays and tasks, with all information! For the CLI die-hards, there is a CLI tool ‘ara’ which provides the same information as the website, but entirely text based. It can generate static html files for sending through though.

If you don’t want to record everything, remove or comment the callback_plugins setting in ~/.ansible.cfg. Uncomment or add back if you want to record again. You can quite simply purge the history by stopping the ara service and remove the ~/.ara/ansible.sqlite file. Ara has an option to use mysql or postgresql instead of sqlite.

Addendum:
The CLI provides easy access to the results once you figured out how to use it. This is an example:
First list all the playbook run (I ran one playbook):

$ ara playbook list
+--------------------------------------+--------------------------------+---------------------+----------+----------+-----------------+
| ID                                   | Path                           | Time Start          | Duration | Complete | Ansible Version |
+--------------------------------------+--------------------------------+---------------------+----------+----------+-----------------+
| a152d1e5-fdab-4f50-b84a-de2d0b336124 | /home/ansible/ansible/test.yml | 2017-05-11 16:57:25 | 0:00:00  | True     | 2.3.0.0         |
+--------------------------------------+--------------------------------+---------------------+----------+----------+-----------------+

Then show the tasks in this playbook:

$ ara task list -b a152d1e5-fdab-4f50-b84a-de2d0b336124
+--------------------------------------+-----------------+------+------+---------+---------------------+----------+
| ID                                   | Name            | Path | Line | Action  | Time Start          | Duration |
+--------------------------------------+-----------------+------+------+---------+---------------------+----------+
| bf0bbc33-b0a4-4ac3-8242-58f5c8f1c4e5 | Gathering Facts | None | None | setup   | 2017-05-11 16:57:25 | 0:00:00  |
| 2b952b48-611f-424d-ba06-b94e22e87b1c | command         | None | 3    | command | 2017-05-11 16:57:26 | 0:00:00  |
+--------------------------------------+-----------------+------+------+---------+---------------------+----------+

Once you got the task IDs, you can list the task results:

$ ara result list --task 2b952b48-611f-424d-ba06-b94e22e87b1c
+--------------------------------------+-----------+---------+---------+---------------+---------------------+----------+
| ID                                   | Host      | Action  | Status  | Ignore Errors | Time Start          | Duration |
+--------------------------------------+-----------+---------+---------+---------------+---------------------+----------+
| d9278719-209e-41f8-8ecc-4ad2681dcdf6 | localhost | command | changed | False         | 2017-05-11 16:57:26 | 0:00:00  |
+--------------------------------------+-----------+---------+---------+---------------+---------------------+----------+

And then show the result in long format:

$ ara result show d9278719-209e-41f8-8ecc-4ad2681dcdf6 --long
+---------------+---------------------------------------------+
| Field         | Value                                       |
+---------------+---------------------------------------------+
| ID            | d9278719-209e-41f8-8ecc-4ad2681dcdf6        |
| Playbook ID   | a152d1e5-fdab-4f50-b84a-de2d0b336124        |
| Playbook Path | /home/ansible/ansible/test.yml              |
| Play ID       | 683dbe2b-84cc-4e17-b2ec-9b013e7d7612        |
| Play Name     | localhost                                   |
| Task ID       | 2b952b48-611f-424d-ba06-b94e22e87b1c        |
| Task Name     | command                                     |
| Host          | localhost                                   |
| Action        | command                                     |
| Status        | changed                                     |
| Ignore Errors | False                                       |
| Time Start    | 2017-05-11 16:57:26                         |
| Time End      | 2017-05-11 16:57:26                         |
| Duration      | 0:00:00                                     |
| Result        | {                                           |
|               |     "changed": true,                        |
|               |     "cmd": "echo \"test\"",                 |
|               |     "delta": "0:00:00.002718",              |
|               |     "end": "2017-05-11 16:57:26.616876",    |
|               |     "rc": 0,                                |
|               |     "start": "2017-05-11 16:57:26.614158",  |
|               |     "stderr": "",                           |
|               |     "stderr_lines": [],                     |
|               |     "stdout": "test",                       |
|               |     "stdout_lines": [                       |
|               |         "test"                              |
|               |     ]                                       |
|               | }                                           |
+---------------+---------------------------------------------+

This is how to install ara on OSX:

sudo pip install ara --user

Not sure if I need to use sudo when installing with ‘–user’, I ended up with directories in my local user Library directory for use with python that were not readable to my user ‘frits.hoogland’. In order to correct that, I ran:

cd
sudo chown -R frits.hoogland Library 

Now fetch the ara directory in the python site-packages directory:

python -c "import os,ara; print(os.path.dirname(ara.__file__))"
/Users/frits.hoogland/Library/Python/2.7/lib/python/site-packages/ara

Now create a ansible config file to be able to active the ara callback for running ansible-playbook and point the ara server to the sqlite database:

echo "[defaults]
local_tmp = /Users/frits.hoogland/.ansible/tmp
callback_plugins = /Users/frits.hoogland/Library/Python/2.7/lib/python/site-packages/ara/plugins/callbacks
[ara]
dir = /Users/frits.hoogland/.ara" > .ansible.cfg

At this moment the ara callback is activated. If you don’t want the overhead of ara writing all information to the sqlite database, comment ‘callback_plugins’.

If you want to use the ara webserver, active it using:

nohup ara-manage runserver -h localhost -p 9191 &

If you want to deactivate the ara webserver, stop it using:

pkill -f ara-manage

This blogpost is about how to install the semaphore user-interface for running ansible. Ansible is an automation language for automating IT infrastructures. It consists of command-line executables (ansible, ansible-playbook for example) that can run a single task using a module (using the ansible executable), or can run multiple tasks using multiple modules in order to perform more complex setup requirements (using the ansible-playbook executable). The downside of running IT tasks via the command-line is that there is no logging by default, unless someone decides to save the standard out to a file, which, if multiple people start doing that by hand will probably lead to a huge collection of text files which are hard to navigate. Also, when tasks are run via a common place, it’s an all or nothing situation: everybody has access to all the scripts, or to nothing.

This is where semaphore comes in: semaphore is a web-based user interface for running Ansible playbooks, for which you can define users, grant users access to certain templates (groups of playbooks, inventories (=groups of hosts) and ssh keys), which they can run, for which the output is saved in a database.

Semaphore is an alternative to Ansible Tower, which is a non-free product from RedHat. Semaphore is an open source project under active development, and is relatively “young”, which means it can (still) have “sharp edges”. Also Ansible Tower has more functionality.

An other feature from semaphore (and tower for that matter), is that any manipulation done via the web interface can also be done via a REST API. This means that if you use another language or product for the planning and/or deployment of (virtual, cloud, etc) machines which can do REST calls, or if you can craft REST calls with it, you can hand over the management of running Ansible for provisioning to a machine or a group of machines to semaphore, so tracking and logging of running Ansible is taken care of.

However, I did find the documentation of how to install semaphore in a real-life usable way not as verbose as I would like it. For that reason, I decided to write it down in this blogpost. Actually, most of the tasks are done using an Ansible playbook, so the installation is quite simple.

I am using a machine running Oracle Linux 7u3, I think this instruction is usable on any version Centos/RHEL/OL 7.

1. Install EPEL
EPEL has additional packages to the packages in the distribution repositories.

# rpm -i https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

2. Install Ansible and git

# yum install -y ansible git

3. Create a key pair for the root user, and add the public key to authorized_keys

# ssh-keygen -N '' -f /root/.ssh/id_rsa
# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys

4. Clone install_semaphore repository

# git clone https://FritsHoogland@gitlab.com/FritsHoogland/install_semaphore.git install_semaphore

You might want to change the admin_(username|email|desc|password), which are the credentials for the semaphore superuser in the install_semaphore.yml file. If you like you can change the mysql credentials, however I do think the default settings are okay, because the mysql database will be hidden behind the firewall.

5. Create inventory, disable known_hosts validation and run the semaphore install playbook

# cd ~/install_semaphore
# echo "localhost" > hosts
# export ANSIBLE_HOST_KEY_CHECKING=False
# ansible-playbook -i hosts install_semaphore.yml

Please mind this requires root to be allowed to sudo to any user.

Done!

Actually, this is a follow up post from my performance deep dive into tablespace encryption. After having investigated how tablespace encryption works, this blogpost is looking at the other encryption option, column encryption. A conclusion that can be shared upfront is that despite they basically perform the same function, the implementation and performance consequences are quite different.

Column encryption gives you the ability to choose to encrypt per individual column, that’s kind of obvious. However, having to choose which columns to encrypt is what I see as the biggest downside of this encryption option. In most cases, especially with boxed applications, it is quite hard to try to figure out which columns you exactly want to encrypt in order to protect your sensitive data. Which columns do exactly contain your primary sensitive data, and which columns do contain secondary sensitive data (data derived from sensitive data). Do you, when you have to apply encryption, know what EXACTLY is defined as sensitive data, and what isn’t? I bet there isn’t a clear technical description.

A logical reaction then would be ‘couldn’t I then just encrypt all columns’? Well, that is what tablespace encryption is for, isn’t it? To summarise this: I do think the correct use of column encryption in reality is hard to implement and this very limited in usefulness, in most cases tablespace encryption should be used.

Okay…for this test I created a table with two columns, of which one is encrypted:

SQL> create table column_encryption (id number, a varchar2(10) encrypt);
SQL> insert into column_encryption values (1, 'AAAAAAAAAA');
SQL> commit;

The same table, but without encryption:

SQL> create table no_column_encryption (id number, a varchar2(10) );
SQL> insert into no_column_encryption values (1, 'AAAAAAAAAA');
SQL> commit;

And the same table with a lot of rows:

SQL> create table column_encryption_large (id number, a varchar2(10) encrypt);
SQL> begin
 	     for counter in 1..32000000 loop
 		     insert into column_encryption_large values ( counter, dbms_random.string('l',10) );
 	     end loop;
 end;
/

Let’s follow the path of the previous TDE post, and profile the execution of a SQL on the big table to see the impact of column encryption. The first test is a ‘select count(*) from column_encryption_large’ in one session, and ‘perf record -g -p PID’ in another. If you need more explanation on how to run it, please look at the previous blogpost. This is the output of ‘perf report –sort comm –max-stack 2’:

# perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 1418165467
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |--29.21%-- kdstf00000010000100kmP
            |--12.58%-- kdbulk
            |--3.32%-- gup_pte_range
            |--2.58%-- kdst_fetch0
            |--2.54%-- kcbgtcr
            |--2.25%-- __blk_bios_map_sg
            |--2.21%-- kcbhvbo
            |--2.18%-- unlock_page
            |--1.98%-- ktrgcm
            |--1.93%-- do_direct_IO
            |--1.86%-- kcbldrget
            |--1.52%-- kcoapl

This shows IO related functions, both Oracle and operating system level; kdstf is kernel data scan table full for example, gup_pte_range, do_direct_IO, unlock_page and __blk_bios_map_sg are Linux kernel functions. Most notably there are no encryption related functions, which is a big difference with tablespace encryption!
This is actually very logical if you understand the differences between column encryption and tablespace encryption. First let’s look at a block dump from a data block from segment in an encrypted tablespace:

Block dump from cache:
Dump of buffer cache at level 4 for pdb=0 tsn=5 rdba=907
Block dump from disk:
Encrypted block <5, 907> content will not be dumped. Dumping header only.
buffer tsn: 5 rdba: 0x0000038b (1024/907)
scn: 0x0.4e9af4 seq: 0x01 flg: 0x16 tail: 0x9af40601
frmt: 0x02 chkval: 0xf23a type: 0x06=trans data

Yes…you read that right: the block is encrypted, so it will not be dumped. Luckily, you can set the undocumented parameter “_sga_clear_dump” to true to make Oracle dump the block:

SQL> alter session set "_sga_clear_dump"=true;
SQL> alter system dump datafile 5 block 907;

This will make Oracle dump the block. The dump will show the decrypted version of the tablespace level encrypted block:

Block header dump:  0x0000038b
 Object id on Block? Y
 seg/obj: 0x17bc3  csc: 0x00.4e9aed  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x388 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0007.01d.000001d0  0x00000987.0390.27  --U-    1  fsc 0x0000.004e9af4
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x0000038b
data_block_dump,data header at 0x7f140f335374
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f140f335374
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f8a
avsp=0x1f76
tosp=0x1f76
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f8a
block_row_dump:
tab 0, row 0, @0x1f8a
tl: 14 fb: --H-FL-- lb: 0x1  cc: 1
col  0: [10]  41 41 41 41 41 41 41 41 41 41
end_of_block_dump

For the count(*), there is no need to read the data, the only thing needed is to read the row directory to fetch the number of rows (row 19). However, to do that, the block must be decrypted.

Now look at a block dump of a column encrypted data block:

Block header dump:  0x0000032b
 Object id on Block? Y
 seg/obj: 0x1821d  csc: 0x00.676d7e  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x328 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.007.000078a9  0x00000117.2246.07  --U-    1  fsc 0x0000.00676d7f
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x0000032b
data_block_dump,data header at 0x7f140f333264
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f140f333264
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f5d
avsp=0x1f49
tosp=0x1f49
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f5d
block_row_dump:
tab 0, row 0, @0x1f5d
tl: 59 fb: --H-FL-- lb: 0x1  cc: 2
col  0: [ 2]  c1 02
col  1: [52]
 fd e0 87 66 55 f7 e6 43 de be 31 f6 71 4f 7f 4e f1 75 fb 88 98 9d 13 ed 8e
 cb 69 02 bc 29 51 bd 21 ea 22 04 6b 70 e9 ec 01 9d d6 e4 5a 84 01 1d 90 b0
 e9 01
end_of_block_dump

The block and the row directory can be read normally without any need for decryption. The only thing encrypted is the column (“a”). That perfectly explains the absence of any functions that indicate decryption, because there isn’t any decryption taking place!

Now let’s rewrite the SQL to touch the data, and thus involve decryption: ‘select avg(length(a)) from column_encryption_large’. This way the row needs to be decrypted and read. This is how the output of a perf recording looks like:

# perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 65K of event 'cycles'
# Event count (approx.): 229042607170
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |--24.73%-- ztchsh1h
            |--14.91%-- ztchsh1n
            |--6.10%-- y8_ExpandRijndaelKey
            |--5.90%-- ownGetReg
            |--5.50%-- __intel_ssse3_rep_memcpy
            |--4.99%-- ztchsh1f
            |--4.28%-- ztcxi
            |--2.60%-- ipp_is_GenuineIntel
            |--1.88%-- _intel_fast_memcpy
            |--1.74%-- _intel_fast_memcpy.P
            |--1.52%-- kspgip
            |--1.16%-- kgce_init

The functions starting with ‘ztc’ are probably related to security (“zecurity”), and also probably related to decryption. The function name “y8_ExpandRijndaelKey” is clearly related to encryption. When you look up the function address of “ownGetReg”, it’s close to the “y8_ExpandRijndaelKey” function. The last group of functions are memcpy related functions, that seems consistent with decrypting: moving data.

On the performance side, it’s clear that the majority of the time is spend in the functions ztchsh1h and ztchsh1n. In order to understand more about these functions, let’s expand the stack:

# perf report --sort comm
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 65K of event 'cycles'
# Event count (approx.): 229035032972
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |
            |--25.01%-- ztchsh1h
            |          |
            |          |--99.63%-- ztchsh1n
            |          |          |
            |          |          |--50.85%-- ztchsh1f
            |          |          |          ztchf
            |          |          |          ztcxf
            |          |          |          ztcx
            |          |          |          kztsmohmwl
            |          |          |          kztsmhmwl
            |          |          |          kzekmetc
            |          |          |          kzecsqen
            |          |          |          kzecctex
            |          |          |          evaopn2
            |          |          |          evaopn2
            |          |          |          qesaAggNonDistSS
            |          |          |          kdstf00001010000000km
            |          |          |          kdsttgr
            |          |          |          qertbFetch
            |          |          |          qergsFetch
            |          |          |          opifch2
            |          |          |          kpoal8
------------------------------------------------------
            |--14.90%-- ztchsh1n
            |          |
            |          |--85.25%-- ztchsh1f
            |          |          ztchf
            |          |          ztcxf
            |          |          ztcx
            |          |          kztsmohmwl
            |          |          kztsmhmwl
            |          |          kzekmetc
            |          |          kzecsqen
            |          |          kzecctex
            |          |          evaopn2
            |          |          evaopn2
            |          |          qesaAggNonDistSS
            |          |          kdstf00001010000000km
            |          |          kdsttgr
            |          |          qertbFetch
            |          |          qergsFetch
            |          |          opifch2
            |          |          kpoal8

I fetched the stack of the two functions in which the most time was spend. The most important thing to see is that the encryption now takes place as part of processing the fetched data (qesaAggNonDistSS probably has something to do with aggregating data, evaopn2 probably is a function to evaluate operands) rather than performing the (logical) IO; mind the absence of the kcbgtcr function.

The reason for doing the decryption during operand evaluation rather than during doing the IO is because the data is stored encrypted in the block, and thus also in the buffer cache. So during IO time, there is no need to decrypt anything. This also has another rather important consequence: for every row that has an encrypted column that is processed, decryption needs to take place. There does not seem to be any caching of the decrypted value for column encryption, which is logical from a security point of view, but has a severe performance consequence.

When doing a pin tools debugtrace on the above SQL for the processing of a single row (the table ‘column_encryption’, rather than ‘column_encryption_large’), applying the sed filters, and then grepping for a selective set of functions, this is how the processing looks like:

 | | | | > qergsFetch(0x294512030, 0x7f871c9fa2f0, ...)
 | | | | | > qeaeAvg(0x7f8717ce9968, 0xe, ...)
 | | | | | < qeaeAvg+0x000000000063 returns: 0  | | | | | > qertbFetch(0x294512178, 0x7f871ca08a68, ...)
 | | | | | | | | | | > kcbgtcr(0x7ffe2f9b3ae0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > pread64@plt(0x100, 0x1f428c000, ...)
 | | | | | | | | | | < kcbgtcr+0x000000003221 returns: 0x1f428c014  | | | | | | | | | | | | | | > kcbgtcr(0x7ffe2f9b35d0, 0, ...)
 | | | | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1f428c014  | | | | | | > kdsttgr(0x7f871c9f9918, 0, ...)
 | | | | | | | > kdstf00001010000000km(0x7f871c9f9918, 0, ...)
 | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > pread64@plt(0x100, 0x2b1115000, ...)
 | | | | | | | | | | | < kcbgtcr+0x000000003221 returns: 0x1e4aa6014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1e4aa6076
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1e4aa6076  | | | | | | | | > qesaAggNonDistSS(0x7ffe2f9b45d0, 0x7fff, ...)
 | | | | | | | | | > evaopn2(0x294511ef0, 0x294512030, ...)
 | | | | | | | | | | > evaopn2(0x294511e68, 0x294512030, ...)
 | | | | | | | | | | | | | | | | | | | > ztchsh1n(0x7ffe2f9b1ef8, 0x11c4e8d0, ...)
 | | | | | | | | | | | | | | | | | | | > ztchsh1f(0x7ffe2f9b1ef8, 0x7ffe2f9b3100, ...)
 --> 168 times in total of ztchsh1n or ztchsh1f
 | | | | | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2c0  | | | | | | | | | | > evalen(0x294511ef0, 0x7f871c9fa2c0, ...)
 | | | | | | | | | | < evalen+0x000000000147 returns: 0x2
 | | | | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2d0  | | | | | | | | | > qeaeAvg(0x7f8717ce9968, 0xb, ...)
 | | | | | | | | | < qeaeAvg+0x000000000063 returns: 0x7f8717ce99c9
 | | | | | | | | < qesaAggNonDistSS+0x000000000193 returns: 0x7fff  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1dec30014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1dec30072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1dec30072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1deca4014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1deca4072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1deca4072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1e4be0014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1e4be0072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1e4be0072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1dedb2014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1dedb2072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1dedb2072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | < kdst_fetch0+0x0000000011c9 returns: 0
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0
 | | | | | | | < kdstf00001010000000km+0x00000000035d returns: 0x7fff
 | | | | | | < kdsttgr+0x00000000085f returns: 0x7fff
 | | | | | < qertbFetch+0x000000001301 returns: 0x7fff  | | | | | > qeaeAvg(0x7f8717ce9968, 0x294511f78, ...)
 | | | | | < qeaeAvg+0x000000000063 returns: 0x2  | | | | | | > evaopn2(0x294511f78, 0, ...)
 | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2e0
 | | | | < qergsFetch+0x000000000f25 returns: 0

This is how the explain plan of the ‘select avg(length(a)) from column_encryption’ SQL:

----------------------------------------------------------------------------------------
| Id  | Operation	   | Name	       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |		       |       |       |     3 (100)|	       |
|   1 |  SORT AGGREGATE    |		       |     1 |    53 |	    |	       |
|   2 |   TABLE ACCESS FULL| COLUMN_ENCRYPTION |     1 |    53 |     3	 (0)| 00:00:01 |
----------------------------------------------------------------------------------------

If you look back to the grepped debugtrace, and the execution plan:
Line 1: the sort aggregate rowsource (qergsFetch).
Line 4: the table acces full (qertbFetch).
Line 5: here a logical read (kcbgtcr) is issued, and because the block didn’t exist in the cache, was physically read (line 6: pread64). This is the segment header, the “real” scan of data blocks has not been started yet.
Line 11: this is the ultra fast full table scan (kdstf00001010000000km). My guess is this function is a full table scan function with certain decisions hard coded, instead of choices made on runtime, so the in-CPU execution prediction runs into less branch mispredictions.
Line 12: this is the part of the full table scan for fetching (loading) the data (kdst_fetch). What is special here is that a multiblock read is done, the kcbgtcr function triggers a single physical read for multiple logical blocks, which are later fetched per block (kdst_fetch and kcbgtcr functions starting from line 32, 38, etc).
Line 19: this function executes row based functions and aggregates the results per block/fetch (qesaAggNonDistSS).
Line 20: as part of fetching the row and executing functions, the row value is evaluated first (evaopn2).
Line 21/22: here the column is decrypted (made visible by the ztchsh1n/ztchsh1f functions, not necessarily the decryption functions theirselves).
Line 26/29: here probably the length (evalen) and average (qeaeAvg) row functions are executed.
Line 32: the next block is processed, but no rows are found, and thus no need to execute rowsource (qe*) functions afterwards.

So, what do we know at this point regarding column encryption?
– Columns that are encrypted are stored encrypted in the block in the buffer cache.
– Which means they have to be decrypted every time the column values are read, which is different from tablespace encryption, for which a block is encrypted, and is decrypted whenever a block is read into the buffer cache.
– Functions related to column encryption specifically (different functions than seen with tablespace encryption) take roughly 40% of the time in my case.

Can the time spend on column decryption be optimised?
There are multiple ways you can change the way Oracle applies column encryption. There are four encryption types: 3DES168, AES128, AES192 and AES256. The default is AES192.
Here are query timings of doing a select avg(length(a)) from TABLE on my “large” table with 32 million rows:

3DES168 4:53
AES256 1:09
AES192 1:06
AES128 1:03

A way to optimise column encryption is to specify not to use an extra verification by specifying ‘nomac’ at the encryption definition of the column. This saves space (by default, extra space is used for the message abstract that is used by the verification for every column). These are the timings with the ‘nomac’ addition to disable encryption verification:

3DES168 3:59
AES256 0:26
AES192 0:23
AES128 0:22

This shows a significant reduction of time. However, if no encryption at all is applied to the column, the query timing is 0:03.

Internals background information
The functions ztchsh1n/ztchsh1 are related to verification (again, if you read the tablespace encryption blogpost, where the most time consuming functions were verification too). Once ‘nomac’ is specified with the encryption definition of the column, the ztchsh* function vanish, and the top time consuming functions are y8_ExpandRijndaelKey and ownGetReg, which clearly are directly related to decryption. The performance gain of ‘nomac’ is lesser with 3DES168 encryption.

Conclusion
I think tablespace encryption is the encryption method of choice for a normal implementation. In most cases it will be too much work to exactly figure out which columns to encrypt. If you still consider column encryption, you also should be aware that the column value is stored encrypted in the block and (as a consequence) in the cache. Every use of the encrypted column involves encryption or decryption, for which the overhead is significant, even with ‘nomac’ specified to disable (additional) verification.

This is a run through of a performance investigation into Oracle tablespace encryption. These are the versions this test was performed on:

$ cat /etc/oracle-release
Oracle Linux Server release 6.8
$ /u01/app/oracle/product/12.1.0.2/dbhome_1/OPatch/opatch lspatches
24315824;Database PSU 12.1.0.2.161018, Oracle JavaVM Component (OCT2016)
24006101;Database Patch Set Update : 12.1.0.2.161018 (24006101)

In this test I created an encrypted tablespace:

SQL> create tablespace is_encrypted datafile size 10m autoextend on next 10m encryption default storage(encrypt);

(this assumes you have setup a master encryption key already)
And I created an encrypted simple heap table with one row:

SQL> create table t_is_encrypted ( f1 varchar2(100) ) tablespace is_encrypted;
SQL> insert into t_is_encrypted values ('AAAAAAAAAA');

Also, in order to do some performance tests, I created a simple heap table with a lot of rows:

SQL> create table t_encrypted_large ( id number, random_string varchar2( 100 ) ) tablespace is_encrypted;
SQL> begin
        for counter in 1..32000000 loop
                insert into t_encrypted_large values ( counter, dbms_random.string('l',50) );
        end loop;
end;
/

Let me first go back to basics a little and explain how tablespace encryption works: tablespace encryption sets the ‘E’ flag in the cache header of blocks of a segment which contain data (block type# 6, ‘trans data’), which means the block will be UNencrypted in the buffercache, but will be encrypted once the block is persisted to disk. If a block is read from disk into the cache, it means it must be decrypted first. Once a block is in the cache and decrypted, it can be read and used without the overhead of encryption, because it is not encrypted anymore.

A special case arises when direct path reads are used by Oracle. A direct path read means the blocks are read from disk into the process’ PGA, so there is no involvement of the buffer cache. This means that direct path reads ALWAYS must decrypt the data blocks with tablespace encryption, which means it will suffer the overhead of decryption. This inherent overhead applies to any direct path action (which means both direct reads and direct path writes).

Let’s first do a simple count(*) on table t_encrypted_large, and collect a wait profile. I am doing this with Tanel Poder’s snapper. First session, which collects the profile:

SQL> var snapper refcursor
SQL> @snapper all,begin 1 1 793

Now with the begin snapshot ready, head over to the other session, which will scan the t_encrypted_large table:

SQL> select count(*) from t_encrypted_large;

  COUNT(*)
----------
  32000000

Now end the snapper snapshot and look at where the time was spend:

SQL> @snapper all,end 1 1 793
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    793, SYS       , TIME, parse time elapsed                                        ,           140,     5.32us,      .0%, [          ],          ,           ,
    793, SYS       , TIME, PL/SQL execution elapsed time                             ,           113,     4.29us,      .0%, [          ],          ,           ,
    793, SYS       , TIME, DB CPU                                                    ,       4912502,   186.66ms,    18.7%, [@@        ],          ,           ,
    793, SYS       , TIME, sql execute elapsed time                                  ,       4915777,   186.78ms,    18.7%, [##        ],          ,           ,
    793, SYS       , TIME, DB time                                                   ,       4916441,   186.81ms,    18.7%, [##        ],          ,           ,       2.66 % unaccounted time
    793, SYS       , WAIT, Disk file operations I/O                                  ,            69,     2.62us,      .0%, [          ],         2,        .08,     34.5us average wait
    793, SYS       , WAIT, db file sequential read                                   ,           484,    18.39us,      .0%, [          ],         3,        .11,   161.33us average wait
    793, SYS       , WAIT, db file scattered read                                    ,           220,     8.36us,      .0%, [          ],         1,        .04,      220us average wait
    793, SYS       , WAIT, direct path read                                          ,          1641,    62.35us,      .0%, [          ],         4,        .15,   410.25us average wait
    793, SYS       , WAIT, SQL*Net message to client                                 ,             2,      .08us,      .0%, [          ],         3,        .11,      .67us average wait
    793, SYS       , WAIT, SQL*Net message from client                               ,      20701153,   786.56ms,    78.7%, [WWWWWWWW  ],         3,        .11,       6.9s average wait

The time spend executing SQL is 186.78ms, and during that time, 186.66ms was spend on CPU (DB CPU). The waits acknowledge this, the time in IO related waits is measured in microseconds. In case you wondered if this output is sane, this system is using local NVMe disks, and you see the full table scan flipping from buffered multiblock IO (db file scattered read) to direct path multiblock IO (direct path read). The conclusion at this point is that as good as all time was spend running on CPU. Of course this might be different if IO has higher latency (random IO latency of rotating media is between 5-12ms, reading from a SAN cache normally has a latency of around 1ms).

It is not possible to get a breakdown of where (ON) CPU time is spend inside the database. There are session counters, but these do not depict time. However, on the operating system level there is a useful tool: perf. In my experience, perf output is only really useful when the hardware PMU events are available from the operating system. This can be tested by specifying the event to use for perf. If you have the PMU events available, perf stat -e cycles sleep 1 will return something like:

$ perf stat -e cycles sleep 1

 Performance counter stats for 'sleep 1':

           736,481      cycles

       1.001210217 seconds time elapsed

If you don’t have PMU events available (for you are running in a virtual machine for example, however some virtualisation products can give access to the PMU events), it will look like:

$ perf stat -e cycles sleep 1

 Performance counter stats for 'sleep 1':

   <not supported>      cycles

       1.001025828 seconds time elapsed

Back to the original purpose of this blogpost: the performance of tablespace. Let’s get a breakdown of the functions the time is spend on doing the ‘select count(*) from t_encrypted_large’. This is done by first executing perf record in one session with the process ID specified of the server process, which then executes the ‘select count(*) from t_encrypted_large’. Once the SQL is ready, the perf session can be stopped (ctrl-c), and the contents of the gathered perf database can be reported. This is how the recording is done:

$ perf record -g -p 57531

Now the ‘select count(*) from t_encrypted_large’ is executed. Once the result comes back in the sqlplus session, go to the perf session and ctrl-c it:

^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.928 MB perf.data (5114 samples) ]

Now that the perf data is collected, display a summary of the data:

$ perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 5K of event 'cycles'
# Event count (approx.): 15445982213
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_57531_ao
            |--27.44%-- kdr4chk
            |--15.91%-- kdb4chk1
            |--10.14%-- cipher_loop_p3
            |--8.56%-- msort_with_tmp
            |--5.84%-- kdrrea2
            |--3.98%-- memcpy
            |--3.12%-- kd4_ent_cmp
            |--2.13%-- kdrrrh2
            |--1.99%-- kdstf00000010000100kmP
            |--1.53%-- __intel_ssse3_rep_memcpy
            |--1.42%-- __intel_new_memset
            |--1.40%-- blks_loop_p3
            |--1.14%-- y8_ExpandRijndaelKey
            |--1.06%-- kdbulk
            |--0.88%-- ownGetReg
            |--0.62%-- kghalf
             --12.86%-- [...]

What this shows, is 100% of the samples are from the process ‘oracle_57531_ao’, 27.44% of the samples were spend in a function called ‘kdr4chk’, 15.91% in a function called ‘kdb4chk1’, and so on.

Just by looking at the function names, these do not strike me as functions that do decryption. Function names like ‘cipher_loop_p3’ and ‘y8_ExpandRijndaelKey’ more look like functions doing some of the heavy lifting for decryption. In fact, the function names the most time is spend in (kdr4chk and kdb4chk1) seem to indicate some kind of checking.

A first thought is to look at the database parameters for block checking, and turn them off, and redo the test.

SQL> show parameter db_block_check

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_checking		     string	 OFF
db_block_checksum		     string	 FALSE

In fact, I already turned db block checking and db block checksum off. So it’s not the checking governed by these parameters, so what can it be then?

The best way I could come up with is getting a trace of the exact functions the process is executing. However, the number of C functions an executable goes through is typically gigantic. So for that reason, doing an investigation of the C functions, it’s best to use a very small table, which is ‘t_is_encrypted’, which contains only one row, in order to minimise the processing.

However, there is no tooling by default on Linux that gives you the ability to trace the exact C functions a process is executing. perf, which was just used SAMPLES execution. However, the intel pin tools give you the options to create a C function execution overview. Please use the link to see how you can download and link pin tools, and then create a debugtrace of the execution of ‘select * from t_is_encrypted’ for yourself. If you are interested in the findings based on the debugtrace, read on!

First of all, understanding the functions requires some experience. I don’t think there is any workaround for gaining the knowledge, in order to learn how to interprent them you just have to study them and spending time on it to work it out.

The output of debugtrace on ‘select * from t_is_encrypted’ (which fetches the one row in the table from an encrypted tablespace) looks like this:

         3 @@@ return underflow
Return 0x00007f225b355818 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x151
        14 @@@ return underflow
Return 0x000000000cf1036d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cf102c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cef19dd /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cef164a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000cf00486 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b2bdb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
       156 Tailcall 0x0000000005b2bdbe /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b2bda0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffe4ed10685, 0x7f225a965da0, ...)
       158 | Call 0x0000000005b2bda1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b32990 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
       172 | Return 0x0000000005b34d77 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffe4ed10685
       174 Return 0x0000000005b2bda7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffe4ed10685
       197 @@@ return underflow

That is not very well readable. I created some simple sed filters to make the output more readable:

cat debugtrace.out | sed 's/^\ *\([0-9]*\)\(\ .*\)Call.*-> 0x.*:\(.*\)/\2> \3/' | sed 's/^\ *\([0-9]*\)\(\ .*\)Return.*:\(.*returns: .*\)/\2< \3/' | sed 's/^\ *\([0-9]*\)\(\ .*\)Tailcall.*-> 0x.*:\(.*\)/\2<> \3/' | grep -v underflow

If you use that on the same file, it gets better, with the aim to show the function call flow:

Return 0x00007f225b355818 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x151
Return 0x000000000cf1036d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
Return 0x000000000cf102c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
Return 0x000000000cef19dd /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
Return 0x000000000cef164a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
 > _intel_fast_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
 <> _intel_fast_memcpy.P(0x7ffe4ed10685, 0x7f225a965da0, ...)
 | > __intel_ssse3_rep_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
 | < __intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffe4ed10685
 < _intel_fast_memcpy.P+0x000000000007 returns: 0x7ffe4ed10685
Return 0x000000000cf004ed /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
 > kslwtectx(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | > sltrgftime64(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | | > sltrgftime64_cgt(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | | | > clock_gettime@plt(0x1, 0x7ffe4ed10230, ...)
 | | | | > clock_gettime(0x1, 0x7ffe4ed10230, ...)
 | | | | < clock_gettime+0x000000000054 returns: 0
 | | | < clock_gettime+0x00000000006a returns: 0
 | | < sltrgftime64_cgt+0x000000000058 returns: 0x68724a87136
 | < sltrgftime64+0x00000000005c returns: 0x68724a87136
 | > kslwt_end_snapshot(0x2c674c080, 0x2c674c080, ...)
 | < kslwt_end_snapshot+0x000000000235 returns: 0x2c674b128
 | > kslwt_update_stats_int(0x2c674c080, 0x2c674c080, ...)
 | | > kews_update_wait_time(0x6, 0x53696ef, ...)
 | | < kews_update_wait_time+0x00000000002b returns: 0x1
 | < kslwt_update_stats_int+0x000000000314 returns: 0x1b0
 | > kskthewt(0x68724a87136, 0x184, ...)
 | < kskthewt+0x0000000000d7 returns: 0x23
 < kslwtectx+0x0000000003c0 returns: 0xe923

The rows that start with ‘Return’ are functions that were already entered when debugtrace started, and thus debugtrace could not determine the level, and reports an underflow in the output when the process returns from them.
The rows that start with ‘>’ mean the function after it is entered. The first two arguments are shown.
The rows that start with ‘<‘ mean the function after it is returned from. The return code is shown.
The rows that start with ‘<>’ mean a ‘tail call’ was executed, which is function that is executed as part of the return of another function.

To further emphasise this example, the above trace start with the server process waiting on input. On linux, this is done using a read function (the __read_nocancel function comes from the libpthread operating system library), which is a place a server process is expected to wait and is actually in an oracle wait event (SQL*Net message from client). It becomes apparent the process was in a wait event because the code for ending a wait event (kslwtectx) is executed after the __read_nocancel function. A word of warning: in this output, kslwtectx seems to be the beginning (left most) function. In reality, this is not true, there are many more functions, but the process has not returned to them so they are not known at that point of the trace. For the function a waiting database server process is sleeping in (__read_nocancel), this is sntpread>ntpfprd>nsbasic_brc>nsbrecv>nioqrc>opikndf2>opitsk>opiino>opiodr>opidrv>sou2o>optima_real>ssthrdmain>main.

Okay, now let’s look into the relevant output of the ‘select * from t_is_encrypted’. If you wade through the trace, it’s best to first go the the line with for ‘< kcbz_table_scan_read'. The kcbz_table_scan_read function performs a buffered multiblock read. '<' means the return, and you need to go up from there to find the first important bit. Here’s the important bit:

 | | | | | | | | | | | > ksfd_clear_pgaobjsc(0, 0x2b, ...)
 | | | | | | | | | | | < ksfd_clear_pgaobjsc+0x000000000023 returns: 0
 | | | | | | | | | | | > _intel_fast_memcpy(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x1a5ad6000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x1a5ad6000
 | | | | | | | | | | | > _intel_fast_memcpy(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x182808000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x182808000
 | | | | | | | | | | | > _intel_fast_memcpy(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x178ed0000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x178ed0000
 | | | | | | | | | | | > _intel_fast_memcpy(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x12927c000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x12927c000
 | | | | | | | | | | | > _intel_fast_memcpy(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x15411c000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x15411c000
 | | | | | | | | | | | > kcbz_free_table_scan_read_buffer(0x2b1115000, 0xa000, ...)
 | | | | | | | | | | | | > kslgetl(0x60023a80, 0x1, ...)
 | | | | | | | | | | | | < kslgetl+0x000000000144 returns: 0x1
 | | | | | | | | | | | | > kslfre(0x60023a80, 0x8, ...)
 | | | | | | | | | | | | < kslfre+0x000000000125 returns: 0
 | | | | | | | | | | | < kcbz_free_table_scan_read_buffer+0x0000000000f3 returns: 0
 | | | | | | | | | | < kcbz_table_scan_read+0x000000000664 returns: 0x5

Going up from kcbz_table_scan_read return, we see a call to kcbz_free_table_scan_read_buffer, which gets and frees a latch (‘cache table scan latch’). Above that we essentially see the table scan read buffer contents being copied into the buffer cache per block using an optimised version of the memcpy system call. Luckily, debugtrace shows the first two arguments of a call, for memcpy this is a pointer to the destination as first argument, and a pointer to the source as second. If you look closely to the memcpy arguments, you see that the destination is random (“scattered” through the buffer cache, which is where ‘db file scattered read’ gets its name from), and the source is starting at 0x2b1115000 and increases by exactly 0x2000. 0x2000 equals 8192, which is the size of the blocks. The reason for starting here is the blocks are further processed in the buffer cache buffers in order they are copied in kcbz_table_scan_read.

The first buffer copied is 0x1a5ad6000. The next sequence of functions:

 | | | | | | | | | | > kcbzvb(0x1a5ad6000, 0x5, ...)
 | | | | | | | | | | | > kcbhvbo(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | < kcbhvbo+0x0000000000cc returns: 0
 | | | | | | | | | | < kcbzvb+0x000000000084 returns: 0x1
 | | | | | | | | | | > kcbztek_trace_blk(0xd5e15b8, 0, ...)
 | | | | | | | | | | < kcbztek_trace_blk+0x00000000001d returns: 0x1a5fce498
 | | | | | | | | | | > _setjmp@plt(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | <> __sigsetjmp(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | <> __sigjmp_save(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | < __sigjmp_save+0x000000000010 returns: 0
 | | | | | | | | | | > kcbz_encdec_tbsblk(0x5, 0x38b, ...)

Next is a function ‘kcbzvb’ which probably is kernel cache buffers space management (Z) verify buffer. Then kcbztek_trace_blk. My current thinking is this function reads the block to see if the encryption flag is set in the cache header, and it hasn’t been decrypted already. When a block is is not encrypted, or is encrypted and already decrypted, kcbz_encdec_tbsblk is not called; in this case, the block is encrypted and not decrypted yet, so we see the function kcbz_encdec_tbsblk being called. A lot of function calls inside the kcbz_encdec_tbsblk are done which strongly indicate having something to do with keys and encryption, like kcbtse_get_tbskey, kcbtse_get_tbskey_info and a lot of functions starting with kgce (kgce_init, kgce_set_mode, kgce_set_key) most notably the function kgce_dec, which probably is kernel generic encryption – decryption. Here is how that function returns and what it is executes next (still inside the function kcbz_encdec_tbsblk):

 | | | | | | | | | | | < kgce_dec+0x0000000000a6 returns: 0
 | | | | | | | | | | | > _intel_fast_memcpy(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x0000000025e8 returns: 0x1a5ad6014
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x1a5ad6014
 | | | | | | | | | | | > kcbhxoro(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | | > sxorchk(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | | < sxorchk+0x0000000000e6 returns: 0x46dd
 | | | | | | | | | | | < kcbhxoro+0x000000000013 returns: 0x46dd

The kgce_dec functions returns. It is my assumption the decryption of the block is done. Next there is another memcpy function. The arguments of memcpy are *dest, *src. The first address is the same address we saw earlier, but at 0x1a5ad6014 instead of 0x1a5ad6000, so 14 bytes past the start of the block. The source is 0x7f225a96c0f0. In most cases these high numbers are PGA memory addresses. So what I think happens here, is the block was copied into the buffer, then kcbztek_trace_blk found the encryption flag and the block encrypted, so the kcbz_encdec_tbsblk function has been called to decrypt the block, which is done in PGA, and then the decrypted version of the block is copied into the buffer overwriting the encrypted version.

The next function kcbhxoro probably is a function to check the block. This must mean the checking is done on the unencrypted block contents. Please mind we are still in kcbz_encdec_tbsblk!

Continuing the trace we see:

 | | | | | | | | | | | > kcbchk_ctx(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | > ktbdbchk(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | | > kdgchk(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | | | > kd4chk(0x1a5ad6000, 0x7fffffc7, ...)
 | | | | | | | | | | | | | | | > kdb4chk1(0x1a5ad6014, 0, ...)
 | | | | | | | | | | | | | | | | > kd4_alloc_ok(0xc9cee30, 0xc9cee80, ...)
 | | | | | | | | | | | | | | | | | > kcbtmal(0, 0x4, ...)
 | | | | | | | | | | | | | | | | | | > kghalf(0x7f225f080c80, 0x7f225f085980, ...)
 | | | | | | | | | | | | | | | | | | | > _intel_fast_memset(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | <> _intel_fast_memset.J(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | > __intel_memset(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | < __intel_memset+0x00000000031e returns: 0x7f225ef1ffb0
 | | | | | | | | | | | | | | | | | | | < _intel_fast_memset.J+0x000000000007 returns: 0x7f225ef1ffb0
 | | | | | | | | | | | | | | | | | | < kghalf+0x00000000061f returns: 0x7f225ef1ffb0

These are the apparent check functions! We also see it takes the buffer address as first argument. This is where perf measured the majority of the time being spend (kdr1chk, not in this snippet, and kdb4chk1)! The last noteworthy things in the trace for handling this block are:

 | | | | | | | | | | | < kcbchk_ctx+0x0000000000ff returns: 0
 | | | | | | | | | | | > kgce_fini(0x7f225f080c80, 0x7ffe4ed03b98, ...)
 | | | | | | | | | | | | > skgce_end_ipp_f(0x7f225f080c80, 0x7ffe4ed03b98, ...)
 | | | | | | | | | | | | < skgce_end_ipp_f+0x000000000014 returns: 0
 | | | | | | | | | | | < kgce_fini+0x000000000037 returns: 0
 | | | | | | | | | | < kcbz_encdec_tbsblk+0x000000001f36 returns: 0

Here we see the return from the kcbchk_ctx call, which were identified above as ‘check functions’, then another call to a function in the encryption family (kgce_fini), which clearly indicates it’s stopping the encryption/decryption functions (“fini”), and then returning from the kcbz_encdec_tbsblk function. There are some functions executed after kcbz_encdec_tbsblk, like kcbz_check_objd_typ. A little further you will see the whole sequence starting again for the next block with the function kcbzvb.

This makes me wonder. We saw the block being read from disk into the table_scan_read_buffer, from the read buffer into the cache, then being decrypted, then being checked. I turned db_block_checking and db_block_checksum off, why would there be a need for checking? But let’s not get carried away, I made some assumptions. I am right in my assumptions?

If you look at the logical flow it’s:
– table scan read buffer -> buffer cache buffer (kcbz_table_scan_read>_intel_fast_memcpy)
– identify if block needs decrypting (kcbz_encdec_tbsblk>kcbztek_trace_blk)
– decrypt buffer cache buffer (kcbz_encdec_tbsblk>kcgce_dec)
– check buffer cache buffer (kcbz_encdec_tbsblk>kcbchk_ctx)

If I can identify the buffer, then stop the execution at kcbztek_trace_blk to look at the buffer to check for it being encrypted, then stop at kcbchk_ctx and look at the same buffer again, and see if it is decrypted, my assumption that the kgce_dec function is doing the decryption and that the checking is based on the unencrypted block contents are correct.

In order to do this, we can use gdb and have the function arguments printed of memcpy, and break at the function kcbztek_trace_blk, print the buffer and then continue and break at kcbchk_ctx and print the buffer again. This is how I created my gdb macro:

break _intel_fast_memcpy
  commands
    silent
    printf "_intel_fast_memcpy dest: %lx, src: %lx, size: %d\n", $rdi, $rsi, $rdx
    c
  end
break kcbztek_trace_blk
  commands
    silent
    print $rip
  end
break kcbchk_ctx
  commands
    silent
    print $rip
  end

Now create a database session, flush the buffer cache (alter system flush buffer_cache) to make sure the small table t_is_encrypted is read from disk again, and thus decrypted, and attach to it with gdb and source the macro:

$ gdb -p 57531 -x tde_verification.gdb
...
(gdb) set pagi off
(gdb) c
Continuing.

I turned pagination off (set pagi off) because there is a lot of memcpy going on, otherwise gdb will pause execution if the screen is filled, and used the continue (c) command, so we can use the sqlplus session again. Now execute ‘select * from t_is_encrypted’, this is what it looks like in the debugger:

_intel_fast_memcpy dest: 7f85093a7f0a, src: 7ffd589f6690, size: 7
_intel_fast_memcpy dest: 7f85093a7f11, src: 7ffd589f6624, size: 4
_intel_fast_memcpy dest: 7f85093a7f15, src: 7ffd589f6628, size: 4
_intel_fast_memcpy dest: 7f85093a7f19, src: 7ffd589f662c, size: 4
_intel_fast_memcpy dest: 7f85093a7f1d, src: 7ffd589f6630, size: 4
_intel_fast_memcpy dest: 7f85093bac08, src: 7ffd589f3238, size: 336
_intel_fast_memcpy dest: 7ffd589f0e30, src: 2b7f89c20, size: 456
_intel_fast_memcpy dest: 7ffd589f3238, src: 7f85093bac08, size: 336
$1 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>

This is actually reading the segment header, so we continue (c).

(gdb) c
Continuing.
_intel_fast_memcpy dest: 7f8508bf86a8, src: 7f8508bf8060, size: 712
_intel_fast_memcpy dest: 7f8508bf86c0, src: 7f8508bf8078, size: 264
_intel_fast_memcpy dest: 7f85093bac08, src: 7ffd589eb308, size: 336
_intel_fast_memcpy dest: 7ffd589e8f00, src: 2b7f89c20, size: 456
_intel_fast_memcpy dest: 7ffd589eb308, src: 7f85093bac08, size: 336
_intel_fast_memcpy dest: 1a5576000, src: 2b1115000, size: 8192
_intel_fast_memcpy dest: ee6b8000, src: 2b1117000, size: 8192
_intel_fast_memcpy dest: 12eb54000, src: 2b1119000, size: 8192
_intel_fast_memcpy dest: 200c56000, src: 2b111b000, size: 8192
_intel_fast_memcpy dest: 1a570e000, src: 2b111d000, size: 8192
$2 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>

Here we see kcbztek_trace_again, and we see memcpy executed 5 times with a size of 8192. These are the blocks copied from the table scan read buffer! The buffer we are going to look at is 0x1a5576000. In order to examine the block, we are going to print 8192 bytes starting from 0x1a5576000:

(gdb) x/8192bx 0x1a5576000
0x1a5576000:	0x06	0xa2	0x00	0x00	0x8b	0x03	0x00	0x00
0x1a5576008:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x01	0x16
0x1a5576010:	0x3a	0xf2	0x00	0x00	0x30	0x7f	0xf4	0x80
0x1a5576018:	0x5f	0xf1	0xd5	0xe9	0xd0	0x5a	0xeb	0x59
0x1a5576020:	0x5c	0x7c	0x90	0x87	0x21	0x46	0x44	0xf3
0x1a5576028:	0x66	0xa2	0x6a	0x3d	0xd4	0x97	0x6d	0x1e
0x1a5576030:	0xab	0xf8	0x06	0xd4	0x17	0x5c	0xc9	0x1a
0x1a5576038:	0xe6	0x86	0xf4	0x5a	0x20	0xbe	0x8d	0x22
0x1a5576040:	0x0c	0x2a	0x29	0x2b	0xa1	0x2a	0x3b	0x86
0x1a5576048:	0x4b	0x31	0x13	0xa7	0x4c	0xcd	0x4c	0xa4
0x1a5576050:	0xf1	0x1a	0xa3	0x59	0xf0	0x70	0x31	0xd5
...snip...
0x1a5577fb8:	0xf8	0x24	0x90	0x74	0xa3	0x23	0x65	0xf6
0x1a5577fc0:	0xee	0x76	0x6c	0xf5	0x8b	0xaf	0x23	0x7e
0x1a5577fc8:	0xc4	0x5b	0xbd	0x5f	0x26	0xb8	0x5e	0x3b
0x1a5577fd0:	0x65	0x96	0x48	0x88	0xb3	0x37	0x10	0x15
0x1a5577fd8:	0x3f	0xd8	0x62	0x54	0x4f	0xa3	0xf3	0x60
0x1a5577fe0:	0xfc	0x2e	0x9d	0xf4	0xc6	0x49	0xbc	0x4f
0x1a5577fe8:	0x85	0x3b	0x8a	0xd1	0xa3	0xc8	0x61	0x72
0x1a5577ff0:	0xfd	0xb8	0xe1	0x2e	0x65	0x92	0x0d	0x70
0x1a5577ff8:	0x50	0xd8	0x34	0x06	0x01	0x06	0xf4	0x9a

I can’t tell from looking at the hexadecimal numbers if this block is encrypted or not. However, this block is supposed to have a single row containing ‘AAAAAAAAAA’, which means there is supposed to be a lot of empty space. I can’t find a repeating value, and the entire block contains values. This might be the result of the block being encrypted. Now, if my assumption is correct, if we continue and then when execution breaks on kcbchk_ctx, the same buffer should look entirely different, because it then has been decrypted by kgce_dec:

(gdb) c
Continuing.
_intel_fast_memcpy dest: 7ffd589eb3f0, src: 2a5e2ced8, size: 240
_intel_fast_memcpy dest: 7ffd589ea860, src: e7ef774, size: 15
_intel_fast_memcpy dest: 7ffd589eaa80, src: 7ffd589eb3c8, size: 16
_intel_fast_memcpy dest: 1a5576014, src: 7f8508ccc908, size: 8168
$3 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
(gdb) x/8192bx 0x1a5576000
0x1a5576000:	0x06	0xa2	0x00	0x00	0x8b	0x03	0x00	0x00
0x1a5576008:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x01	0x06
0x1a5576010:	0xdd	0x46	0x00	0x00	0x01	0x00	0x00	0x00
0x1a5576018:	0xc3	0x7b	0x01	0x00	0xed	0x9a	0x4e	0x00
0x1a5576020:	0x00	0x00	0x00	0x00	0x02	0x00	0x32	0x00
0x1a5576028:	0x88	0x03	0x00	0x00	0x07	0x00	0x1d	0x00
0x1a5576030:	0xd0	0x01	0x00	0x00	0x87	0x09	0x00	0x00
0x1a5576038:	0x90	0x03	0x27	0x00	0x01	0x20	0x00	0x00
0x1a5576040:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x00	0x00
0x1a5576048:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576050:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576058:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576060:	0x00	0x00	0x00	0x00	0x00	0x01	0x01	0x00
0x1a5576068:	0xff	0xff	0x14	0x00	0x8a	0x1f	0x76	0x1f
0x1a5576070:	0x76	0x1f	0x00	0x00	0x01	0x00	0x8a	0x1f
0x1a5576078:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576080:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576088:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
...snip...
0x1a5577fa8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fb0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fb8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fc0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fc8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fd0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fd8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fe0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fe8:	0x00	0x00	0x00	0x00	0x00	0x00	0x2c	0x01
0x1a5577ff0:	0x01	0x0a	0x41	0x41	0x41	0x41	0x41	0x41
0x1a5577ff8:	0x41	0x41	0x41	0x41	0x01	0x06	0xf4	0x9a

After continuing, we see a couple of memcpy’s, after which the debugger breaks execution when kcbchk_ctx is called. We execute the examine command again, and now the block looks way different. Way more zero’d locations, the place of the different headers is visible because of the zero’s in between, and starting from 0x1a5576078 it’s all zero until 0x1a5577fe8. We see a repeating value 0x41. Is that my ‘A’?

(gdb) p (char) 0x41
$2 = 65 'A'

Yes! These are the 10 A’s!

So, this means that before we enter the kcbchk_ctx function inside kcbz_encdec_tbsblk the block is actually already decrypted, and kcbchk_ctx performs something that takes a lot of the time, which seems to be checking, based on the function naming. Because the block is already decrypted when we enter kcbchk_ctx, it seems to me that the checking step is technically optional.

Further background analysis
I stated that with tablespace encryption the block is decrypted when read into the cache, that non-data blocks are not encrypted and that once the block is decrypted in the cache, it can be read without any encryption overhead because it is not encrypted (the block being decrypted in the cache is actually documented by Oracle). This actually can be proven with a debugger macro:

break pread64
  commands
    silent
    print $rip
    c
  end
break kgce_dec
  commands
    silent
    print $rip
    c
  end
break kcbchk_ctx
  commands
    silent
    print $rip
    c
  end
break kcbz_encdec_tbsblk
  commands
    silent
    print $rip
    c
  end
break kcbztek_trace_blk
  commands
    silent
    print $rip
    c
  end
break kdr4chk
  commands
    silent
    print $rip
    c
  end
break kcbgtcr
  commands
    silent
    print $rip
    c
  end

If you flush the buffer cache, and then attach to a foreground session with gdb and source the above macro, you get the following output:

$55 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$56 = (void (*)()) 0x7fd2c72b8100 <pread64>
$57 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$58 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$59 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$60 = (void (*)()) 0x7fd2c72b8100 <pread64>
$61 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$62 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$63 = (void (*)()) 0x3923750 <kgce_dec>
$64 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$65 = (void (*)()) 0xcdd26a0 <kdr4chk>
$66 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$67 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$68 = (void (*)()) 0x3923750 <kgce_dec>
$69 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$70 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$71 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$72 = (void (*)()) 0x3923750 <kgce_dec>
$73 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$74 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$75 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$76 = (void (*)()) 0x3923750 <kgce_dec>
$77 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$78 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$79 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$80 = (void (*)()) 0x3923750 <kgce_dec>
$81 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$82 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$83 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$84 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$85 = (void (*)()) 0xc9d9a20 <kcbgtcr>

Rows 1-4: First we see kcbgtcr (kernel cache buffers get consistent read); as I have made clear in the past, all (normal) reads in the oracle databases start off as a logical read; but since we just flushed the cache, the segment header needs to be read from disk, which explains the pread64. After the single block read, the function kcbztek_trace_blk is executed, but because this is a non-data block, no encryption is applied to this block, so there is no decryption to be done.
Rows 5-11: Then the multiblock read is initiated, which is executed via a pread64. After the multibock read, the blocks are put in the buffer cache. For the first block, we see the kcbztek_trace_blk function being executed, and because this block is encrypted, the function kcbz_encdec_tbsblk is invoked, and then the actual decryption in the function kgce_dec. After performing decryption, the apparent checking function kcbchk_ctx is called, and inside that function kdr4chk is executed (which checks an individual row, we got one row).
Rows 12-15,16-19,20-23,24-27: The other blocks that are read do not contain any data, however since they are data blocks, you see kcbztek_trace_blk which finds the encryption flag, after which kcbz_encdec_tbsblk is called, which calls kgce_dec, and then kcbchk_ctx. kdr4chk is not called, because there are no rows.

However, what is interesting is what will be called when we execute the same full table scan again over the 5 blocks now that they are in the buffer cache:

$86 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$87 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$88 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$89 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$90 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$91 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$92 = (void (*)()) 0xc9d9a20 <kcbgtcr>

Exactly how this is documented it would work! The foreground process now can just use the consistent read mechanism to read the blocks in the cache, there is no encryption codepath called. Please mind that when you change blocks in the cache, the changing does not get any additional overhead too, because it works 100% the same as a non-encrypted block. However, what does get a performance penalty is writing the blocks to disk (!). The database writer needs to encrypt the blocks before it can store them on disk! This means that DML intensive databases that switch to tablespace encryption, might need a bigger buffercache because there is more time and resources needed to write them to disk, which might lead to more dirty buffers in the cache, and thus lesser buffers available for other usage.

Also mind that encryption functions of tablespace encryption get called as part of the IO codepath:

#0  0x00000000071d5320 in kcbz_encdec_tbsblk ()
#1  0x00000000012e8e7c in kcbzibmlt ()
#2  0x00000000012dd9d3 in kcbzib ()
#3  0x000000000c9dbda9 in kcbgtcr ()
#4  0x000000000c9c0ac5 in ktrget2 ()
#5  0x000000000ccbc786 in kdstgr ()
#6  0x000000000cd0a0e9 in qertbFetch ()
#7  0x000000000cb6e42b in opifch2 ()
#8  0x00000000021e49f3 in kpoal8 ()
#9  0x000000000cb75a5d in opiodr ()
#10 0x000000000cda421b in ttcpip ()
#11 0x0000000001baba3c in opitsk ()
#12 0x0000000001bb0401 in opiino ()
#13 0x000000000cb75a5d in opiodr ()
#14 0x0000000001ba7b8b in opidrv ()
#15 0x00000000026b31b1 in sou2o ()
#16 0x0000000000bbda7a in opimai_real ()
#17 0x00000000026bd8fc in ssthrdmain ()
#18 0x0000000000bbd9bc in main ()

1. This is the master encryption routine.
2. This is the kernel cache space management insert block multiblock routine.
3. This is the kernel cache space management insert block routine.
4. This is the consistent read call kcbgtcr.
5. This is the kernel transaction layer buffer get routine.
6. This is the kernel data scan table get rows (full table scan).
7. This is the SQL execution rowsource tablescan fetch routine.
In other words: the decryption is in the IO codepath, or: is executed as part of the IO routines.

Update april 4, 2016:
Further investigation showed the checking (entering the kcbchk_ctx function) CAN be eliminated when the undocumented parameter “_disable_block_checking” is set to FALSE.
There is a bug logged for this issue: 12943305 – decryption performance can be improved, and has a fix. This is reported to be fixed in 12.2, and a patches available for the different 12.1.0.2 PSUs.

%d bloggers like this: