Archive

Tag Archives: IO

This post is about one of the fundamentally important properties of a database: how IO is done. The test case I studied is doing a simple full table scan of a single large table. In both Oracle and postgres the table doesn’t have any indexes or constraints, which is not a realistic example, but this doesn’t change the principal topic of the study: doing a table scan.

I used a publicly available dataset from the US bureau of transportation statistics called FAF4.5.1_database.zip
The zipped file is 347MB, unzipped size 1.7GB.

In both cases Oracle Linux 7.7 (64 bit) is used, running in VirtualBox, with the storage being a USB3 SSD. Number of CPUs is 4, memory size is 6G. Filesystem type: xfs.
The Oracle version used is Oracle 19.5, the Postgresql version used is 12.1.
For Postgresql, the postgresql.conf file is not changed, except for max_parallel_workers_per_gather which is set to 0 to make postgres use a single process.
For Oracle, the parameters that I think are important: filesystemio_options=’setall’. Oracle is used filesystem based (so no ASM).

This is the table definition for Oracle:

create table faf451 (
  fr_origin varchar2(3),
  dms_orig varchar2(3),
  dms_dest varchar2(3),
  fr_dest varchar2(3),
  fr_inmode varchar2(1),
  dms_mode varchar2(1),
  fr_outmode varchar2(1),
  sctg2 varchar2(2),
  trade_type varchar2(1),
  tons number,
  value number,
  tmiles number,
  curval number,
  wgt_dist number,
  year varchar2(4)
);

This is the table definition for Postgresql:

create table faf451 (
  fr_origin varchar(3),
  dms_orig varchar(3),
  dms_dest varchar(3),
  fr_dest varchar(3),
  fr_inmode varchar(1),
  dms_mode varchar(1),
  fr_outmode varchar(1),
  sctg2 varchar(2),
  trade_type varchar(1),
  tons double precision,
  value double precision,
  tmiles double precision,
  curval double precision,
  wgt_dist double precision,
  year varchar(4)
);

In order for the data to be easy loadable into postgres using copy from, I had to remove ‘””‘ (double double quotes) for the empty numeric fields. In oracle I could say “optionally enclosed by ‘”‘”. For Oracle I used an external table definition to load the data.

Now, before doing any benchmarks, I have an idea where this is going. Oracle is using direct IO (DIO) so linux page cache management and “double buffering” are avoided. Also, oracle will be doing asynchronous IO (AIO), which means submitting is separated from waiting for the notification that the submitted IOs are ready, and on top of that oracle will submit multiple IO requests at the same time. And again on top of that, oracle does multi-block IO, which means that instead of requesting each 8K database block individually, it will group adjacent blocks and request for these in one go, up to a size of combined blocks of 1MB, which means it can requests up to 128 8K blocks in one IO. Postgres will request every block synchronous, so 1 8K block at a time, and waiting for each request to finish. That makes me have a strong idea where this is going.

It should be noted that postgres explicitly is depending on the operating system page cache for buffering as a design principle. Because of DIO, blocks that are read by oracle are not cached in the operating system page cache.

I executed my benchmark in the following way:
– A run for every size is executed 5 times.
– At the start of every run for a certain size (so before every “batch” of 5 runs), the page cache is flushed: (echo 3 > /proc/sys/vm/drop_caches).
– Before each individual run, the database cache is flushed (systemctl restart postgresql-12 for postgres, alter system flush buffer_cache for oracle).

I started off with 2G from the dataset, and then simply performed a ‘copy from’ again to load the same dataset into the table in postgres. Oracle required a bit more of work. Oracle was able to save the same data in way less blocks; the size became 1.18G. In order to have both postgres and oracle scan the same amount of data, I calculated roughly how much rows I needed to add to the table to make it 2G, and copied that table to save it as a 2G table, so I could insert that table to increase the size of the test table by 2G. This way in both oracle and postgres I could test with a 2G table and add 2G at a time until I reached 20G.

These are the results. As you can see in the legenda: oracle is orange, postgres is blue.
postgres oracle scan results(click graph to load full picture)

What we see, is that postgres is a bit slower with the first run of 5 for the smaller dataset sizes, which becomes less visible with larger datasets.
Also, postgres is way faster if the dataset fits into the page cache and it has been read into it. This is logical because postgres explicitly uses the page cache as a secondary cache, and the test is the only activity on this server, so it hasn’t been flushed by other activity.

What was totally shocking to me, is postgres is performing alike oracle and both roughly are able to perform at the maximum IO speed of my disk: 300MB/s, especially when the dataset is bigger, alias beyond the page cache size.

It wasn’t shocking that oracle could reach the total bandwidth of the disk: oracle uses all the techniques to optimise IO for bandwidth. But how can postgres do the same, NOT deploying these techniques, reading 8K at a time??

The first thing to check is whether postgres is doing something else than I suspected. This can simply be checked with strace:

poll_wait(3, [{EPOLLIN, {u32=18818136, u64=18818136}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0!select count(*) from faf451"..., 8192, 0, NULL, NULL) = 34
lseek(20, 0, SEEK_END)                  = 335740928
lseek(20, 0, SEEK_END)                  = 335740928
kill(1518, SIGUSR1)                     = 0
pread64(5, "\f\0\0\0\310ILc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846061568) = 8192
pread64(5, "\f\0\0\0HcLc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846069760) = 8192
pread64(5, "\f\0\0\0\260|Lc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846077952) = 8192
pread64(5, "\f\0\0\0000\226Lc\0\0\0\0D\1\210\1\0 \4 \0\0\0\0\230\237\312\0000\237\312\0"..., 8192, 846086144) = 8192
…etc…

The above strace output shows only 4 rows of pread64() calls, but this goes on. So no “secret” optimisation there.

Luckily, my VM has a new enough version of Linux for it to be able to use eBPF, so I can use biosnoop. Biosnoop is a tool to look at IO on one of the lower layers of the linux kernel, the block device interface (hence ‘bio’). This is the biosnoop output:

# /usr/share/bcc/tools/biosnoop
TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
0.000000000    postmaster     4143   sdb     R  66727776  708608     5.51
0.006419000    postmaster     4143   sdb     R  66731720  77824     11.06
0.006497000    postmaster     4143   sdb     R  66734432  786432    11.03
0.011550000    postmaster     4143   sdb     R  66731872  1310720   16.17
0.013470000    postmaster     4143   sdb     R  66729160  1310720   18.86
0.016439000    postmaster     4143   sdb     R  66735968  1310720   14.61
0.019220000    postmaster     4143   sdb     R  66738528  786432    15.20

Wow…so here it’s doing IOs of up to 1MB! So somewhere between postgres itself and the block device, the IOs magically grew to sizes up to 1MB…that’s weird. The only thing that sits between postgres and the block device is the linux kernel, which includes page cache management.

To get an insight into that, I ran ‘perf record -g -p PID’ during the scan, and then perf report to look at the recorded perf data. This is what is I found:

Samples: 21K of event 'cpu-clock', Event count (approx.): 5277000000
  Children      Self  Command     Shared Object       Symbol                                                                  ◆
-   41.84%     3.63%  postmaster  libpthread-2.17.so  [.] __pread_nocancel                                                    ▒
   - 38.20% __pread_nocancel                                                                                                  ▒
      - 38.08% entry_SYSCALL_64_after_hwframe                                                                                 ▒
         - 37.95% do_syscall_64                                                                                               ▒
            - 35.87% sys_pread64                                                                                              ▒
               - 35.51% vfs_read                                                                                              ▒
                  - 35.07% __vfs_read                                                                                         ▒
                     - 34.97% xfs_file_read_iter                                                                              ▒
                        - 34.69% __dta_xfs_file_buffered_aio_read_3293                                                        ▒
                           - 34.32% generic_file_read_iter                                                                    ▒
                              - 21.10% page_cache_async_readahead                                                             ▒
                                 - 21.04% ondemand_readahead                                                                  ▒
                                    - 20.99% __do_page_cache_readahead                                                        ▒
                                       + 14.14% __dta_xfs_vm_readpages_3179                                                   ▒
                                       + 5.07% __page_cache_alloc                                                             ▒
                                       + 0.97% radix_tree_lookup                                                              ▒
                                       + 0.54% blk_finish_plug                                                                ▒

If you look at rows 13-15 you see that the kernel is performing readahead. This is an automatic function in the linux kernel which looks if the requests are sequential of nature, and when that’s true performs readahead, so that the scan is made faster.

The Oracle database log writer is the process that fundamentally influences database change performance. Under normal circumstances the log writer must persist the changes made to the blocks before the actual change is committed. Therefore, it’s vitally important to understand what the log writer is exactly doing. This is widely known by the Oracle database community.

The traditional method for looking at log writer performance is looking at the wait event ‘log file parallel write’ and the CPU time, and comparing that to the ‘log file sync’ alias “commit wait time”. If ‘log file parallel write’ and ‘log file sync’ roughly match, a commit is waiting on the log writer IO latency, if it isn’t then it’s unclear, and things get vague.

Now we get to the gist of this blogpost: since Oracle 12, there are additional statistics in V$SESSTAT and v$SYSSTAT that describe where the log writer spending it’s time on. However, and sadly, these are not documented by Oracle. The statistics are:

– redo write gather time
– redo write schedule time
– redo write finish time
– redo write time (usec)
– redo write total time

Probably ‘redo write time (usec)’ doesn’t belong to these statistics because the timing is as good as the same as ‘redo write finish time’. Also the statistic ‘redo write time (usec)’ is updated by ‘kcrfw_postprocess_write’ right after the main writing function (kcrfw_do_write), whilst the other redo write timing statistics are updated/made visible by ksu_dispatch_tac outside of the actual writing functions, at the approximate rate of every 3 seconds.

All these statistics contain cumulative timing figure in microseconds (usec).

All these timings start at the start of the function kcrfw_redo_write_driver. This is the main function that handles all the redo writing, including investigating the fixed area structure (kcrfs_) and the shared pool structures (made visible using X$KCRFSTRAND) to see if there are changes.

1. redo write gather time.
This times reading the structures that control allocation of the redo structures, validating foreground processes are done writing the change vectors and applying these to the blocks, updating the LWN SCN. Part of this also happens when there is nothing to write, but then the time is not added to this statistic, only if it found something to write.

2. redo write schedule time.
This includes the time of 1, but now it found something to write, entered the writing function (kcrfw_do_write), and updated the ksfd IO related structures for writing.

3. redo write finish time.
This includes the time of 1 and 2, but this additionally times all the IO related functions for writing the changes it found in the public redo strands, as indicated by the shared pool and fixed area structures.

4. redo write time (usec).
This probably is not part of this group of statistics, but it works in exactly the same way: it’s a cumulative figure, it counts time since the start of the kcrfw_redo_write_driver and includes a tiny bit more code, it times until it has left the kcrfw_do_write function and enters the kcrfw_postprocess_write function.

5. redo write total time.
This includes the time of 1, 2 and 3, but additionally times the kcrfw_post function, which performs the signalling of the foreground processes that have committed and wait for a semop() signal.

This means that the Oracle database since Oracle 12 provides the timing information to see where the log writer is spending it’s time on. This can prevent a lot of guesswork.

Recently I was triggered about the ‘automatic big table caching’ feature introduced in Oracle version 12.1.0.2 with Roger Macnicol’s blogpost about Oracle database IO and caching or not caching (https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-1 https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-2). If you want to read something about the feature in general, search for the feature name, you’ll find several blogposts about it.

If you are not familiar with automatic big table caching, it’s a feature which allows you to dedicate a percentage of the buffer cache for scanning big tables. Big tables are tables that are not small :-), which technically means tables bigger (a higher number of blocks) than the “_small_table_threshold” parameter, which defaults to 2% of the buffer cache or 20 blocks, whichever is bigger. The choice to store blocks in the big table cache area is dependent on the segment “temperature”, which essentially is a kind of touch count. There is one other caveat of using this feature, which I currently find weird: once a database is operating in clustered mode (RAC), serial scans can not use the automatic big table caching anymore, the use of it then exclusively is for caching parallel query workers blocks scanned.

The purpose of this blogpost is to investigate the IO implementation of using the big table caching feature. The available general read code paths that I am aware of in Oracle are:
– Single block reads, visible via the wait event ‘db file sequential read’, which always go to the buffer cache.
– Multiple single block reads, visible via the wait ‘db file parallel read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘db file scattered read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the process’ PGA.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the in-memory area.

For full scans of a segment, essentially the choices are to go buffered, or go direct. The question that I asked myself is: what code path is used when the automatic big table caching feature is used?

The reason for asking myself this is because the difference between a cached multi block read and a direct path read is significant. The significance of this decision is whether multiple IO requests can be submitted for multiple extents leveraging asynchronous IO, which is possible in the direct path read code path, or only the IO requests can be submitted for a single extent, which is what is true for the db file scattered read code path. To summarise the essence: can I only execute a single IO request (scattered read) or multiple IO requests (direct path read).

The answer is very simple to find if you enable the big table caching feature, and then read a table that is considered big with an empty big table cache. A table that is read into the big table area, shows the ‘db file scattered read’ wait event. What is interesting when a table is scanned that does not fit in the big table area: when a big table is not read for reading into the big table area, the wait event switches to ‘direct path read’. Or actually I should say: the general rules are applied for choosing between buffered or direct path reads, potentially being a direct path read.

I want to emphasise again on the importance of the IO code path decision. If you have an Oracle database server that uses modern high performance hardware for its disk subsystem, not choosing the correct multi block read codepath can severely influence throughput.

For example, if your IO subsystem has an average IO time of 1ms, the maximal theoretical throughput for buffered IO is 1/0.001*1MB=1GBPS (this will be way lower in reality). Direct path reads start off with 2 IO requests, which obviously means that that’s double the amount of buffered IO, but this can, depending on an Oracle internal resource calculation, grow up to 32 IO requests at the same time.

This post is about the decision the Oracle database engine makes when it is using a full segment scan approach. The choices the engine has is to store the blocks that are physically read in the buffercache, or read the blocks into the process’ PGA. The first choice is what I refer to as a ‘buffered read’, which places the block in the database buffercache so the process itself and other processes can bypass the physical read and use the block from the cache, until the block is evicted from the cache. The second choice is what is commonly referred to as ‘direct path read’, which places the blocks physically read into the process’ PGA, which means the read blocks are stored for only a short duration and is not shared with other processes.

There are some inherent performance aspects different between a buffered and a direct path read. A buffered read can only execute a single physical read request for a single range of blocks, wait for that request to finish, fetch and process the result of the physical read request after which it can execute the next physical read request. So there is maximum of one outstanding IO for multiple (adjacent) Oracle blocks. A direct path read works differently, it submits two physical IO requests, each for a distinct range of Oracle blocks asynchronously, after which it waits one or more IOs to finish. If an IO is returned, it is processed, and an IO for another range of Oracle blocks is submitted to restore the number of IOs in flight to two. If the database engine determines (based upon a non-disclosed mechanism) that enough resources are available it can increase the amount of IO physical IO requests in flight up to 32. Other differences include a maximum for the total size of the IO request, which is 1MB for buffered requests, and 32MB for direct path requests (which is achieved by setting db_file_multiblock_read_count to 4096).

At this point should be clear that there are differences between buffered and direct path reads, and when full segment scans switch from direct path reads to buffered reads it could mean a significant performance difference. On top of this, if your database is using Exadata storage, this decision between buffered reads and direct path reads is even more important. Only once the decision for direct path reads has been made, an Exadata smartscan can be executed. I have actually witnessed cases where a mix of partitioning and HCC lead to the situation that the partitions were so small that a direct path read was not chosen, which meant a smartscan was not considered anymore, meaning that instead of the cells decompressing the compressed blocks all in parallel, the process now had to fetch them and do the decompression on the database layer.

There have been some posts on the circumstances of the decision. However, I have seen none that summarise the differences for the different versions. In order to investigate the differences between the different Oracle versions, I created a git repository at gitlab: https://gitlab.com/FritsHoogland/table_scan_decision. You can easily use the repository by cloning it: git clone https://gitlab.com/FritsHoogland/table_scan_decision.git, which will create a table_scan_decision directory in the current working directory.

Oracle version 11.2.0.2.12
Please mind this version is very old, and SHOULD NOT BE USED ANYMORE because it’s not an actively supported version. However, I do use this version, because this version has different behaviour than the versions that follow.

First determine the small table threshold of the database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

Let’s create tables just below and just over 1531 blocks/small table threshold:

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1531 1531
...
    BLOCKS
----------
      1664

So the small table threshold is 1531, this means that an internal statistic that is used for determining using the direct path mechanism, medium table threshold will be approximately 1531*5=7655. Let’s create tables just below and just over that number of blocks:

TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

For the other versions, trace event ‘nsmtio’ can be used to learn how the decision is made. However, this trace event does not exist in Oracle version 11.2.0.2. The workaround is to just execute a SQL trace and interpret the wait events. For a full table scan, the wait events ‘db file scattered read’ means a buffered read is done, and wait events ‘direct path read’ means a direct path read was done (obviously).

TS@test > alter session set events 'sql_trace level 8';
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1531';
TS@test > select count(*) from table_1531;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that in my case, with Oracle version 11.2.0.2, the switching point is at 5 times _small_table_threshold.

Oracle 11.2.0.3.15
This version too should NOT BE USED ANYMORE because it is not in active support. This too is for reference.
Small table threshold for this database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

With the small table threshold being 1531, the medium table threshold should be approximately 1531*5=7655.

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1440 1440
...
    BLOCKS
----------
      1536
TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans. By doing that I ran into something peculiar with the ‘nsmtio’ event in this version (11.2.0.3 with the latest PSU). This event does exist for this version (which you can validate by running ‘oradebug doc component’), however, it does not yield any output. This means I have to revert to the previous method of running sql_trace at level 8 and interpret the wait events.

TS@test > alter session set events 'trace[nsmtio]:sql_trace level 8'; -- no NSMTIO lines, only sql_trace!
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1440';
TS@test > select count(*) from table_1440;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that with Oracle version 11.2.0.3, the direct path read switching point seems to have moved from 5 times small table threshold to small table threshold itself.

Oracle 11.2.0.4.170718
This version is in active support!
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1538

With the small table threshold being 1538, the medium table threshold should be approximately 1538*5=7690.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1538 1538
...
    BLOCKS
----------
      1664
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7690 7690
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans.

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7693 blocks),
-- main event: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1538';
SQL> select count(*) from table_1538;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14422, objn: 14422
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14423, objn: 14423
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7690';
SQL> select count(*) from table_7690;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14424, objn: 14424
-- main event: direct path read

This shows that with Oracle version 11.2.0.4, the direct path read switching is at small table threshold, which was changed starting from 11.2.0.3.

Oracle version 12.1.0.2.170718
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1440

SQL>

With small table threshold being 1440, the medium table threshold is approximately 1440*5=7200.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7203 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20489, objn: 20489
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20490, objn: 20490
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20491, objn: 20491
-- main events: direct path read

This is in line with the switch in version 11.2.0.3 to small table threshold as the switching point between buffered reads and direct path reads.

Oracle 12.2.0.1.170814
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1444

SQL>

With small table threshold being 1444, the medium table threshold is approximately 1444*5=7220.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1504 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 7048 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 22502, objn: 22502
-- main events: direct path read

Hey! With 12.2.0.1 the direct path read switching point reverted back to pre-11.2.0.3 behaviour of switching on 5 times small table threshold instead of small table threshold itself.

Update!
Re-running my tests shows differences in the outcome between buffered and direct path reads. My current diagnosis is that the scan type determination uses a step based approach:

– The first determination of size is done with ‘NSMTIO: kcbism’ (kcb is medium). If islarge is set to 1, it means the segment is bigger than STT. If islarge is set to 0 it means the segment is smaller than STT, and the segment will be read buffered, and the line ‘qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown in the NSMTIO output.

– The next line is 'NSMTIO: kcbimd' (kcb is medium determination?) It shows the size of the segment (nblks), STT (kcbstt), MTT (kcbpnb) and is_large, which in my tests always is set to 0. Here, there are 4 options that I could find:

1) Segment size between STT and MTT and a buffered read is executed.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, which probably is externalised in the line 'NSMTIO: kcbcmt1'. The outcome can be a buffered read, for which the line 'qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown.

2) Segment size between STT and MTT and the direct path code path is chosen.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, probably externalised in the line 'NSMTIO: kcbcmt1'. If the costing determines it would be beneficial to use a direct path mechanism, it seems it switches to the direct path with cache determination code, which is also used for any table scan that is smaller than VLOT. Because of switching to that code, it will determine if the segment is bigger than VLOT: 'NSMTIO: kcbivlo', which of course in this case isn't true. Then, it will show the line 'NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]'

3) Segment size bigger than MTT but smaller than VLOT.
If the segment is between MTT and VLOT, the Oracle engine does not apply the costing mechanism (which is means the kcbcmt1 line is not shown). It will determine if the segment is bigger than VLOT ('NSMTIO: kcbivlo'), and then show 'NSMTIO: qertbFetch:[MTT VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

4) Segment size bigger than VLOT.
If the segment is bigger than VLOT, the Oracle engine execute the functions kcbimd and kcbivlo, the NSMTIO line for kcbivlo will show is_large 1 to indicate it’s a very large object (VLOT by default is ‘500’, which is 5 times the total number of buffers in the buffer cache. The qertbFetch line will say ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

In the cases where ‘NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]' is shown, which is either a segment between STT and MTT which switched to this code path, or between MTT and VLOT, the code will apply a second determination and potential switching point from buffered to direct path or vice versa, which is shown with the line 'kcbdpc' (kcb direct path check). The outcome can be:

– NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ] to indicate it will use a buffered read.
– NSMTIO: kcbdpc:DirectRead to indicate it will use a direct path read.

I have verified the above 'decision tree' in 11.2.0.2, 11.2.0.3, 11.2.0.4, 12.1.0.2 and 12.2.0.1. It all seems to work this way consistently. I derived this working by looking at the NSMTIO tracing of 12.2, and then gone back in version. You will see that going lower in versions, there is lesser (nsmtio) tracing output; 11.2.0.4 does show way lesser information, for example, it does not show the kcbcmt1 line, and of course 11.2.0.3 and 11.2.0.2 do not show NSMTIO lines altogether. In order to verify the working, I used gdb and quite simply breaked on the kcbism, kcbimd, kcbcmt1, kcbivlo and kcbdpc functions in the versions where this information was missing in the trace.

Still, at the kcbcmt1 point:
– 11.2.0.2 seems to quite consistently take MTT as the direct path switching point.
– 11.2.0.3-12.1.0.2 seem to quite consistently take STT as the direct path switching point.
– 12.2.0.1 varies.

Conclusion.
This article first explained the differences between buffered and direct path reads, and why this is important, and that it is even more important with Exadata for smartscans.

The next part shows how to measure the switching point. The most important message from this blog article is that starting from 11.2.0.3 up to 12.1.0.2 the direct path read switching point is small table threshold, and with Oracle database version 12.2.0.1, the direct path switching point is changed back to pre-11.2.0.3 behaviour which means 5 times the small table threshold of the instance.
The next part shows measurements of the switching point. The addition shows that between STT and MTT there is a cost based decision to go direct path or buffered path. Once the direct path is chosen, it still can go buffered if the majority of the blocks are in the cache.

If you look closely at the output of the nsmtio lines for version 11.2.0.3-12.1.0.1 for tables that had a size between small table threshold and medium table threshold, it seemed a bit weird, because the nsmtio trace said ‘[MTT < OBJECT_SIZE < VLOT]', which to me means that Oracle detected the object size to be between medium table threshold and very large object threshold, which was not true. I can't tell, but it might be a bug that is solved for measuring the wrong size.
The text description in the NSMTIO qertbFetch line is bogus, it simply is a code path; ‘[- STT < OBJECT_SIZE < MTT]' means it's a buffered read, and could be chosen when < STT or in between STT and MTT, '[MTT < OBJECT_SIZE < VLOT]' means it's a direct path read, and could be chosen when in between STT and MTT or MTT and VLOT.

I added the scripts and examples of the tracing events so you can measure this yourself in your environment.

This article is about the Oracle 12c in-memory option, and specifically looks at how the background worker processes do IO to populate the in-memory column store.

Hardware: Apple Macbook with VMWare Fusion 7.1.3.
Operating system: Oracle Linux 6.7, kernel: 3.8.13-118.el6uek.x86_64.
Database version: Oracle 12.1.0.2
Patch: opatch lspatches
19392604;OCW PATCH SET UPDATE : 12.1.0.2.1 (19392604)
19303936;Database Patch Set Update : 12.1.0.2.1 (19303936)

But first things first, let’s setup the in-memory option first with a test table. The first thing to consider is to create the in-memory area to store the objects. I only want a single table stored in the in-memory area, so I can very simply look at the size of object:

TS@fv12102 > select segment_name, round(bytes/power(1024,2)) "MB" from user_segments where segment_name = 'T2';

SEGMENT_NAME                                     MB
---------------------------------------- ----------
T2                                              168

So, that means that an in-memory area of 200MB should suffice, let’s set it and bounce the instance:

SYS@fv12102 AS SYSDBA> alter system set inmemory_size = 200m scope=spfile sid='*';

System altered.

SYS@fv12102 AS SYSDBA> startup force;
ORACLE instance started.

Total System Global Area 1048576000 bytes
Fixed Size		    2932336 bytes
Variable Size		  226492816 bytes
Database Buffers	  469762048 bytes
Redo Buffers		  139673600 bytes
In-Memory Area		  209715200 bytes
Database mounted.
Database opened.

Line 13 shows the in-memory area is created!

Now I need to set the inmemory attribute for my table T2. This code example shows the inmemory attribute for my T2 table without the in-memory set, then alters the table to belong in the in-memory area, and shows the inmemory attribute again:

TS@fv12102 > select inmemory from user_tables where table_name = 'T2';

INMEMORY
--------
DISABLED

TS@fv12102 > alter table t2 inmemory;

Table altered.

TS@fv12102 > select inmemory from user_tables where table_name = 'T2';

INMEMORY
--------
ENABLED

If we now look at the V$IM_SEGMENTS, the view is empty:

SYS@fv12102 AS SYSDBA> select owner, segment_name, bytes normal_size, bytes_not_populated, inmemory_size, populate_status from v$im_segments;

no rows selected

This is because the inmemory_priority has not been specified when we altered the table to inmemory, which means the default priority NONE is given, which means the table will be loaded into memory once it is touched.

Now, in order to understand how the in-memory area is populated we need to execute a sql trace including waits on the Wnnn process or processes that populate the in-memory area. However, that is not as straightforward as it seems: there are a couple of these (background) processes already in my instance, but there can be created more if it necessary. Luckily, we can use Oracle’s improvements to the diagnosability infrastructure and set sql trace and filter on the process name:

SYS@fv12102 AS SYSDBA> alter system set events 'sql_trace {process: pname = w} level 8';

System altered.

Now that we enabled sql_trace for all Wnnn processes, let’s touch the table, and see what is in the trace!

First touch the table:

TS@fv12102 > select count(*) from t2;

  COUNT(*)
----------
   1000000

That should result in the T2 table being read in the in-memory area:

SYS@fv12102 AS SYSDBA> select owner, segment_name, bytes normal_size, bytes_not_populated, inmemory_size, populate_status from v$im_segments;

OWNER  SEGMENT_NAME		      NORMAL_SIZE BYTES_NOT_POPULATED INMEMORY_SIZE POPULATE_
------ ------------------------------ ----------- ------------------- ------------- ---------
TS     T2				176160768		    0	   53673984 COMPLETED

Bingo!

Now go to the trace directory, and look for w processes:

[oracle@bigmachine [fv12102] trace]$ ls -l *_w*
-rw-r----- 1 oracle oinstall   2145 Nov 14 08:52 fv12102_w000_30938.trc
-rw-r----- 1 oracle oinstall    166 Nov 14 08:52 fv12102_w000_30938.trm
-rw-r----- 1 oracle oinstall   8616 Nov 14 09:10 fv12102_w000_32380.trc
-rw-r----- 1 oracle oinstall    752 Nov 14 09:10 fv12102_w000_32380.trm
-rw-r----- 1 oracle oinstall  35401 Nov 14 09:10 fv12102_w002_31964.trc
-rw-r----- 1 oracle oinstall   2612 Nov 14 09:10 fv12102_w002_31964.trm
-rw-r----- 1 oracle oinstall 232444 Nov 14 09:10 fv12102_w001_30940.trc
-rw-r----- 1 oracle oinstall   7120 Nov 14 09:10 fv12102_w001_30940.trm

Because of the size of the file, it was my guess that file would contain the trace output of the reading of the T2 table into the in-memory area. That is correct.

If a worker process was up before doing an action, we can see it sleeps for 5 seconds in the wait event ‘Space Manager: slave idle wait’, after which it does some (internal) housekeeping, which means reading its CPU usage using getrusage(), after which it will sleep on semtimedop() for 5 seconds again. This is how that looks like when traced with waits:

WAIT #0: nam='Space Manager: slave idle wait' ela= 5000827 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7189216230

*** 2015-11-14 08:52:32.981
WAIT #0: nam='Space Manager: slave idle wait' ela= 5000350 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7194217089

*** 2015-11-14 08:52:37.982
WAIT #0: nam='Space Manager: slave idle wait' ela= 5001004 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7199218359

*** 2015-11-14 08:52:42.982
WAIT #0: nam='Space Manager: slave idle wait' ela= 5000000 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7204218487

When the worker starts working on populating the in-memory cache, it first reads some dictionary views (partobj$, compression$ and hist_head$), after which it begins to read the table into memory:

WAIT #140530969221688: nam='Disk file operations I/O' ela= 21 FileOperation=2 fileno=4 filetype=2 obj#=20434 tim=7207107727
WAIT #140530969221688: nam='direct path read' ela= 78 file number=4 first dba=22416 block cnt=8 obj#=20434 tim=7207125617
WAIT #140530969221688: nam='direct path read' ela= 113 file number=4 first dba=22425 block cnt=7 obj#=20434 tim=7207126486
WAIT #140530969221688: nam='direct path read' ela= 136 file number=4 first dba=22432 block cnt=8 obj#=20434 tim=7207127263
WAIT #140530969221688: nam='direct path read' ela= 60 file number=4 first dba=22441 block cnt=7 obj#=20434 tim=7207128054
WAIT #140530969221688: nam='direct path read' ela= 68 file number=4 first dba=22448 block cnt=8 obj#=20434 tim=7207128718
WAIT #140530969221688: nam='direct path read' ela= 114 file number=4 first dba=22457 block cnt=7 obj#=20434 tim=7207129417
WAIT #140530969221688: nam='direct path read' ela= 106 file number=4 first dba=22464 block cnt=8 obj#=20434 tim=7207130248
WAIT #140530969221688: nam='direct path read' ela= 100 file number=4 first dba=22473 block cnt=7 obj#=20434 tim=7207130878
WAIT #140530969221688: nam='direct path read' ela= 202 file number=4 first dba=22480 block cnt=8 obj#=20434 tim=7207132096
WAIT #140530969221688: nam='direct path read' ela= 173 file number=4 first dba=22489 block cnt=7 obj#=20434 tim=7207132857
WAIT #140530969221688: nam='direct path read' ela= 293 file number=4 first dba=22496 block cnt=8 obj#=20434 tim=7207134040

The event ‘Disk file operations I/O’ is the instrumentation of meta-operations which need to happen before a process can work on a file. In our case, FileOperation=2 means opening a file.

Then we see the event ‘direct path read’! That means that the worker uses the modern reading method of reading data, instead of the older method indicated by ‘db file scattered read’.

The reason this is important, is the ‘scattered read’ method does not scale very well. The basic setup of the ‘scattered read’ method is:
1. Determine chunk to read (limited by DB_FILE_MULTIBLOCK_READ, extent border, space management blocks and blocks already in the cache).
2. Read chunk.
3. Repeat until high water mark is reached.
Even if this is done via asynchronous IO, there is never read more than one chunk at a time.

If we look at the ‘direct path read’ method, then it does this in a simplified way:
1. Determine number of IOs allowable in flight (default/minimum two!)
2. Submit reads for chunks of the segment to scan, amount of IOs is determined at 1.
3. Wait for minimal 1 IO to return.
4. Repeat until high water mark is reached.

Essentially, using the ‘direct path read’ method a single process can take advantage of the available bandwidth of modern storage and asynchronous IO and issue multiple IOs.

During tracing I noticed the worker processes writing to the database’s diagnostics destination, however in a directory ‘log’, to a file ‘imdb_ORACLE_SID.log’ which seem to contain information about chunks read by worker processes:

Sat Nov 14 09:51:10 2015
c:0o:20434d:20434/St.RDBA: 0xac00 No.blks: 896
c:0o:20434d:20434/St.RDBA: 0xaf84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xb384 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xb784 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xbb84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xbf84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xc384 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xc784 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xcb84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xcf84 No.blks: 644
c:0o:20434d:20434/rows buff: 460730

And in a directory ‘imdb’ inside the ‘log’ directory, which looks like the same data in an XML format:

<msg time='2015-11-14T08:52:45.889+01:00' org_id='oracle' comp_id='rdbms'
 msg_id='kdmlScanAndCreateCU:1537:858585840' type='UNKNOWN' group='IM_populate'
 level='16' host_id='bigmachine.local' host_addr='127.0.0.1'>
 <txt>c:0o:20434d:20434/St.RDBA: 0x5790 No.blks: 8
 </txt>
</msg>

It reveals a code location kdmlScanAndCreateCU, which probably means Kernel Data in-Memory for the first letters.

In most circumstances, and in my opinion, this information should be printed when some kind of debug flag is set, not ‘just because’, because writing these statistics costs time. I also don’t know if these files are maintained by ADR. If not, doing a lot of in-memory transactions can fill it up rather quickly.

In the previous post on the decision between buffered and direct path reads I showed the decision is depended on the version. Up to and including version 11.2.0.2 the size of a segment needs to be five times small table threshold in order to be considered for direct path reads, and starting from 11.2.0.3 the database starts considering direct path reads starting from small table threshold. The lower limit just discussed is small table threshold or five times small table threshold with lower versions, upper limit is called “very large object threshold” (VLOT) and is five times the size of the buffercache, which is the threshold after which a table scan always is going via direct path.

There seems to be an odd situation with 11.2.0.3 and up that if a table has a size between small table threshold and five times small table threshold, it is always read via direct path, while if the table is bigger than five times small table threshold, it is depended on the amount of blocks in the cache, and switches to buffered reads if the amount of blocks in the cache is 99% or more. This seems to be the opposite of what is logical; a smaller sized table is likely a better candidate than a larger table for caching.

The last thing the previous blogpost showed was the impact of dirty blocks on the buffered/direct path decision. Up to 11.2.0.2 a read that should go via direct path is done buffered when the amount of dirty blocks reaches around 75%. Starting from version 11.2.0.3, a read that should go via direct path is done buffered when the amount of dirty blocks for a table reaches approximately 50%. This is vital information for data warehouses!

A PL/SQL procedure was included so these tests could be to measure this for yourself. If you have any doubts, follow the link above and paste the procedure to test it for yourself.

The purpose of this blogpost is to further investigate the table direct path or buffered decision in various Oracle database versions when compression is used. I used compression using “COMPRESS FOR OLTP”, in order to see if compression changes the decisions.

The first measurement is reading, doing a full scan of a compressed table:

Version 11.2.0.1 (_STT 3979; _STT*5:19895)

TS@v11201 > @test_direct_path
table: T2_OLTP size: 64512 blocks, 10000000 rows.
cached - physical reads cache/direct: (10)		6419/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (20)		6487/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (30)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (40)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (50)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (60)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (70)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (80)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (90)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (100)		 6382/0
full scan - physical reads cache/direct:			  0/0

This pattern is repeated up to version 12.1.0.2: if the amount of blocks is 99% or more, the database switches from direct path to buffered reads for table scans. So for simple reading, the decision does not seem different than for non-compressed tables.

When dirty blocks are included, it is different:

Version 11.2.0.1:

TS@v11201 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  0/63911

If a large number of the blocks, or all of them, in the cache are dirty, a full scan does not flip to buffered reads at 99%.

However, it is interesting to see that this changes with version 11.2.0.2:

TS@v11202 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  461/0

Starting from version 11.2.0.2, the threshold of 99% for a full table scan to switch from direct path to buffered seems to be true for clean blocks and dirty blocks. The same is true for versions 11.2.0.3 and 11.2.0.4.

However, starting from 12.1.0.1, a full scan on a compressed table is done via direct path even if all the blocks are in the cache and dirty, which seems identical to the version 11.2.0.1 behaviour:

TS@v12101 > @test_direct_path
table: T2_OLTP size: 65536 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/65174
full scan - physical reads cache/direct: (20)			 0/65174
full scan - physical reads cache/direct: (30)			 0/65174
full scan - physical reads cache/direct: (40)			 0/65174
full scan - physical reads cache/direct: (50)			 0/65174
full scan - physical reads cache/direct: (60)			 0/65174
full scan - physical reads cache/direct: (70)			 0/65174
full scan - physical reads cache/direct: (80)			 0/65174
full scan - physical reads cache/direct: (90)			 5/65174
full scan - physical reads cache/direct: (100)			  0/65174

So, the conclusion overall is there is a difference between regular heap tables and compressed heap tables. I tested OLTP compression, the [nsmtio] tracing indicates this is the same for HCC compression (hybrid columnar compression). In general, for simple reading there doesn’t seem to be a difference between normal heap tables and compressed heap tables. However, I tested on idle systems, on “real” systems with more “stress” on the buffercache management there might be more mechanisms in play.

When there are dirty blocks (blocks changed sitting in the buffercache waiting to be written by the database writer), there is different behaviour with respect to the buffered read or direct path read choice the instance makes. In general, with dirty blocks involved, compressed tables are less likely to be read into the buffercache.

It seems that with compressed tables and version 11.2.0.1 and 12.1.0.1 and up there is no switch to buffered reads even if all the blocks are in the cache, but dirty. The versions in between there, which are version 11.2.0.2, 11.2.0.3 and 11.2.0.4 do switch to buffered reads when 99% or more blocks are in the cache, regardless if they are dirty.

Also, please mind I explicitly mentioned tables. For a full scan on an index, indicated by the ‘FAST FULL INDEX SCAN’ line in an explain plan, entirely different rules are in play.

I posted a fair amount of stuff on how Oracle is generating IOs, and especially large IOs, meaning more than one Oracle block, so > 8KB. This is typically what is happening when the Oracle database is executing a row source which does a full segment scan. Let’s start off with a quiz: what you think Oracle is the maximum IO size the Oracle engine is capable of requesting of the Operating System (so the IO size as can be seen at the SCI (system call interface) layer? If you made up your answer, remember it, and read on!

The real intention of this blogpost is to describe what is going on in the Oracle database kernel, but also what is being done in the Linux kernel. Being a performance specialised Oracle DBA means you have to understand what the operating system does. I often see that it’s of the utmost importance to understand how an IO ends up as a request at the NAS or SAN head, so you understand what a storage admin is talking about.

Many people (including myself in the past) would state that the maximum IO size on Linux is 1MB. For the Linux 2.6 kernel and higher this statement is incorrect because there is no such thing as a single maximum IO size on Linux. There used to be one in the Linux 2.4 era, which was set with the maxphys parameter, but that time is long gone. In order to find out what it is now, let’s test and see!

First let’s get a Linux system and a big table!
The system I got is a VMWare Fusion VM, running Linux 3.8.13-44.1.5.el6uek.x86_64 (UEK3) on Oracle Linux 6u6. On top of that I am using the Oracle database and grid infrastructure version 12.1.0.2. This system is using udev for providing disk access (as opposed to asmlib)
The redundancy mode of ASM is external, although for reading (what I will be covering) this doesn’t matter.

Now let’s get a normal database session, and use a combination of sql_trace with waits (10046/8) and strace to see how the Oracle database interfaces with the kernel. Please mind I’ve prepared a heap table with no indexes on it, so a count(*) on it always will result in a full table scan. Also, the buffercache is sized small enough (or the table is created large enough, it depends on how you look at it) to have the session make the decision to do a direct path read, instead of a buffered read. If you don’t know what that means: please search this blog on direct path reads, or even better, download my presentation ‘about multiblock reads’.

The direct path read decision is visible via the ‘direct path read’ wait event. If you get a full table scan operation and see ‘db file scattered read’ waits, you are doing a buffered scan.

In the most cases, you will get a maximum value of 1MB if possible, which seems to support the generally assumed 1MB maximum operating system induced IO size. Why? Well, because you probably set the DB_FILE_MULTIBLOCK_READ_COUNT parameter to 128, which means you have explicitly set the Oracle process not to do IO with a size more than 1MB (8192*128). In this blogpost, I explain that Oracle can request IOs bigger than 1MB.

In the blogpost series on extra huge database IOs, I show that Oracle can do huge (1MB+) IOs, but the physical request size (what Oracle actually requests at the SCI layer, visible with the pread/pwrite/io_submit/io_getevents functions) still is 1MB. This limit is imposed by the physical storage structure which the database uses with ASM, called allocation unit (often called ‘AU’). The default size of an allocation unit is 1MB. The allocation unit can be seen in both the database and the ASM instance with the following query:

SYS@+ASM AS SYSASM> select name, allocation_unit_size from v$asm_diskgroup;

NAME			       ALLOCATION_UNIT_SIZE
------------------------------ --------------------
DATA					    1048576

How about doing an unbuffered read on a filesystem? I’ve created a database on an (XFS, but this doesn’t matter AFAIK) filesystem, and tried to set the maximum value to DB_FILE_MULTIBLOCK_READ_COUNT. I’ve done this by setting DB_FILE_MULTIBLOCK_READ_COUNT to 10000 (ten thousand), and then bounce the database to see what the number has become. In my case, the value became 4096. I think this is the limit for Oracle 12.1.0.2 on Linux x86_64, but love to hear if you have gotten different results:

I set 10000:

SYS@fv12102 AS SYSDBA> select name, value from v$spparameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   10000

But Oracle limits this to 4096:

SYS@fv12102 AS SYSDBA> select name, value from v$parameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   4096

Okay. Let’s start our investigation at that point: a database which is set up with a DB_FILE_MULTIBLOCK_READ_COUNT set to 4096, alias 32MB (with a block size of 8KB), and a table which got extents large enough to accommodate huge (32MB) IOs.

Fire up a session regular database session, and enable sql trace at level 8:

$ sqlplus ts/ts@//localhost/v11204
...
SQL> alter session set events 'sql_trace level 8';

Now start another session as root on the database server, and find the PID of the server process of the sqlplus process we just created above. Issue strace with verbose writing setting:

# strace -e write=all -e all -p PID
Process PID attached - interrupt to quit
read(14,

Okay, we are setup and ready to go, but there is one additional thing: the way direct path reads work, they would probably give little waits with fast IO capabilities. One way to get the waits back, is to limit the IO capabilities of the process. Doing so is documented in this article.

Now issue the full table scan on a large table in sqlplus while strace is attached:

SQL> select count(*) from bigtab;

Now take a peek at the strace output!
The output first shows IOs as we expect:

io_getevents(139717184229376, 1, 128, {{0x7f126dd3d780, 0x7f126dd3d780, 33554432, 0}}, {600, 0}) = 1
times(NULL)                             = 431386800
write(7, "\n*** 2014-11-24 13:09:28.028\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  34 2d 31 31 2d 32 34 20  .*** 201 4-11-24  |
 | 00010  31 33 3a 30 39 3a 32 38  2e 30 32 38 0a           13:09:28 .028.    |
lseek(7, 0, SEEK_CUR)                   = 31181
write(7, "WAIT #139717129509840: nam='dire"..., 130) = 130
 | 00000  57 41 49 54 20 23 31 33  39 37 31 37 31 32 39 35  WAIT #13 97171295 |
 | 00010  30 39 38 34 30 3a 20 6e  61 6d 3d 27 64 69 72 65  09840: n am='dire |
 | 00020  63 74 20 70 61 74 68 20  72 65 61 64 27 20 65 6c  ct path  read' el |
 | 00030  61 3d 20 33 39 30 37 33  30 20 66 69 6c 65 20 6e  a= 39073 0 file n |
 | 00040  75 6d 62 65 72 3d 34 20  66 69 72 73 74 20 64 62  umber=4  first db |
 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |
 | 00070  37 34 20 74 69 6d 3d 31  39 32 30 30 37 31 30 31  74 tim=1 92007101 |
 | 00080  39 39                                             99                |

What is visible here, is first the reap of an I/O request (with asynchronous IO on Linux this is typically the io_getevents() call). If you take a close look at the arguments of the io_getevents() call (taken from the manpage of io_getevents):

int io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout);

And then focus on the struct io_event:

struct io_event {
         __u64           data;           /* the data field from the iocb */
         __u64           obj;            /* what iocb this event came from */
         __s64           res;            /* result code for this event */
         __s64           res2;           /* secondary result */
};

The above description is taken from the annotated Linux kernel source, as available here: http://lxr.free-electrons.com/source/include/uapi/linux/aio_abi.h#L58 I use this site for navigating the Linux kernel source. What is important, is that the third field (io_event.res) contains the size of the IO request. Having learned this, now look again in the io_getevents call. The size of the IO reaped above is 33554432, which is 33554432/1024/1024=32 MB. Yes, that’s a single IO of 32MB! Also, this is consistent with the wait line a little lower:

 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |

Block count = 4096 * 8192 (block size) = 33554432

So, I wonder what you thought was possible, the correct answer on my operating system (Linux x86_64) with Oracle 12.1.0.2 is 32MB. It turned out the big IOs in the ASM case were limited by the allocation unit size of 1MB.

The next thing I’ve wondered is how this matches with the maximum IO size of the disk devices as visible by the Operating System. You can request 32MB, but a normal SCSI disk doesn’t do 32MB IOs. Of course in my case the SCSI disk really is a VMWare virtual disk device.

Let’s keep the 32MB IO in mind, now dive from the top layer, the SCI (system call interface) where an IO enters the kernel to the bottom of the kernel from an IO perspective, to the block device. The block device settings are found in /sys/block/DEVICE/queue. The maximum IO size the device is capable of is found in max_hw_sectors_kb. This is in kilobytes, and read only (can’t change hardware, right?). In my case this is:

[root@bigmachine queue]# cat max_hw_sectors_kb
4096

My disk supports a maximum of 4M for an IO size! But this is not what is used, the actual setting is in max_sectors_kb:

[root@bigmachine queue]# cat max_sectors_kb
512

That’s half a megabyte!

So…we got (up to) 32MB sized IO requests coming in, and a device that is set to 512KB IOs. This means that somewhere between the SCI and the device, there is a mechanism to scatter the request size to the device’s maximum IO size, and once the IO requests are done, going back to gather the IO results to the original request.

There are a couple of layers in the Linux kernel through which the call travels (including common functions):

-SCI/system call interface: system_call, sys_io_submit…. (io_submit, do_io_submit, io_submit_one; these seem to be in the VFS layer)
-VFS/virtual filesystem: aio_run_iocb, do_aio_read, xfs_file_read_iter, generic_file_read_iter, xfs_vm_direct_IO, bio_*, kiocb_batch_refill
-Block layer: blk_finish_plug, blk_flush_plug_list, queue_unplugged, __blk_run_queue, blk_run_queue
-SCSI layer: scsi_*
-Device driver: mptspi_qcmd, mptscsih_qcmd, mpt_put_msg_frame

(note: there seems to be consensus the above mentioned layers exist, although there is different wording and different numbers by different sources. Also, there doesn’t seem to be a very clear description of what is done by which layer, and what typically defines a kernel layer. For some functions it is clear they belong to a certain layer (for example aio_run_iocb in Linux/fs/aio.c, bulk_finish_plug in Linux/block/blk-core.c, etc.), for some layers, like the SCI layer, it seems there isn’t a clear layer definition by looking at where the function is defined. Also please mind the SCSI layer is implemented as a driver, just like the actual device driver for the hardware. This is very understandable, but makes it a bit harder to see it in a layered way)

System Call Interface (SCI)
The request enters kernel space via the SCI. The function of the SCI is to elevate a process to system priority to perform a kernel mode task, like (but not limited to) doing I/O. The system call implementation on Linux makes use of a wrapper function in glibc, which executes the system call on behalf of the user systemcall request. The reason for mentioning this, is that sometimes the glibc wrapper “hides” the real system call, for example calling the semtimedop() function:

(gdb) break semtimedop
Breakpoint 1 at 0x3bb38eb090: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) c
Continuing.

Breakpoint 1, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Above is a gdb (GNU debugger) session which attaches to an Oracle background process, which I know is sleeping in the system call semtimedop() when idle, A breakpoint is set on the semtimedop function, and the execution of the attached process is resumed. It then breaks on the function, showing the source code at which the break happened. Instead of showing the actual semtimedop function, it shows the pseudo function in glibc which wraps this system call. This hides the arguments of calling the semtimedop() function. My current workaround is to read the kernel registers which “carry” the arguments (RDI, RSI, RDX, RCX, R8, R9 for the first 6 arguments in most cases).

Virtual File System (VFS)
The next layer is virtual filesystem. Here we see functions specific to asynchronous IO or synchronous IO, and doing direct IO or not, and also actual filesystem specific functions (in my case xfs, when ext4 is used, you will see specific functions for that. I highly recommend XFS!). This layer also uses a structure called ‘request_queue’, which keeps track of the actual IO requests for a block device, of which each individual request is a struct ‘request’, which contains one or more structs called ‘bio’ which contains a description of the request, which points to structure called ‘bio_vec’, which points to pages for storing the disk request contents. This is all setup and created in kernel memory by the user process in system mode. It’s my assumption that the properties of the disk device (=maximum advertised IO size) are taken into account when the VFS filesystem implementation creates requests and all necessary structs and memory area’s. Please mind it’s important that enough memory is available to setup the necessary structures, and enough CPU to make this happen. Also some of the crucial structures for doing IO (request, bio, bio_vec) seem to be setup in this layer. An IO can’t be done without a memory area for the IO request to hold the data for sending it to the device (alias a write), or a memory area for the IO request to hold the data which is fetched from the device (alias a read).

The funny thing is that when you use ASM (the simple version 11.2 ASM with a local ASM instance and local disk devices), you will still see some functions of the VFS layer, because you use a disk device which is opened using the local filesystem. Examples of these functions are: aio_run_iocb, do_aio_read.

Block Layer
The next layer is the block layer. Here the request queue is handled, and I/O scheduling is done. Oracle advises the deadline scheduler in all cases. The scheduler works by plugging a request queue, much like a plug in your bathtub, letting the requests enter the queue. Having multiple requests in a queue means it can be optimised by reordering the requests, and merging adjacent requests up to the device’s advertised maximum IO size. Once a request’s timeout expires, or the requesting process finishes submitting IO, the queue is unplugged.

SCSI layer
The SCSI layer is responsible for communicating with SCSI devices to do IOs.

Device driver
The device driver layer is the layer that truly physically communicates with a device, and implements the device specific communication. In my case the functions start with mpt, which is the driver for LSI PCI adapters.

To see how the flow of IO going through the block layer, there is a tool called blktrace. Actually this is a mini-suite of tools consisting of blktrace (tracing the IO requests through the block layer), blkparse (parsing the output of blktrace to make it human readable), btrace (script to combine blktrace and blkparse, and btt (a blktrace output post processing tool)), among others.

In order to use blktrace, the debug file system of the Linux kernel needs to be mounted. Here is how that is done:

# mount -t debugfs debugfs /sys/kernel/debug

If the kernel debugfs is not mounted, you get the following message:

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
Invalid debug path /sys/kernel/debug: 0/Success

I use blktrace in this article for looking at the IO requests to understand what is going on. The workflow for this use of blktrace is:
– create a trace file of the block flow using blktrace
– make the trace file human readable via blkparse or analyse via btt (block trace times)

Actually, you can parse the output of blktrace directly via blkparse using ‘blktrace -d DEVICE – | blkparse -i -‘. To make that even simpler, the script ‘btrace’ is created, to do exactly that.

Here’s how that looks like (depending on the number of processes using it, the output can be huge, this is only a snippet):

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
...
  8,16   0       57     0.260669503  2421  Q  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       58     0.260672502  2421  G  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       59     0.260673231  2421  P   N [asm_gmon_+asm]
  8,16   0       60     0.260674895  2421  I  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       61     0.260675745  2421  U   N [asm_gmon_+asm] 1
  8,16   0       62     0.260677119  2421  D  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       63     0.260882884     0  C  WS 4088 + 8 [0]
...

What is shown here, is the typical flow of an IO in the block layer:
Q – Queue. A request starts off sending a notification on the intent to queue at the given location.
G – Get request. A struct request is allocated.
P – Plug. When the block device queue is empty, the queue is plugged in order to receive further IOs and have the ability to optimise (merge and/or reorder) them before the data is sent to the device.
I – Insert. A request is sent to the IO scheduler for addition to the internal queue and later service by the driver. The request is fully allocated at this time.
U – Unplug. The start of sending requests to the driver.
D – Driver. A request has been sent to the driver and removed from the queue.
C – Complete. A previously issued request to the driver has been completed.

The main point is here, that you can truly see how the IO requests flow through the block layer and are issued to the storage device, in other words, you can see how the block layer receives the IOs, and what is exactly submitted to the driver as request for the physical storage layer.

This is a microscopic view of the disk IOs. In most cases, when you want to gain information on block layer IO processing, another view on it is provided by processing blktrace output with btt. This is an example output of btt:

First capture IO events using blktrace:

[root@bigmachine ~]# blktrace -w 60 -d /dev/oracleasm/disk1 -o - | blkparse -d sdb.blkparse -i -

In this example I captured IOs for 60 seconds. You can exclude ‘-w 60’, and press interrupt (ctrl-c) when you deem IO recording is enough. This produces a binary file ‘sdb.blkparse’, which can be used btt:

This is the first part, the flow through the block layer until IO completion:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.239795347   3.002829973         238
Q2G               0.000000001   0.159337842   3.011192142         264
G2I               0.000000679   0.000001724   0.000011618         264
I2D               0.000000764   0.000007633   0.000153436         264
D2C               0.000000001   0.103328167   3.012509148         233
Q2C               0.000000001   0.270961298   3.012516496         233

Note: time is in milli seconds.
Q2Q – Time between IO requests.
Q2G – Time it takes for a request struct to be allocated.
G2I – Time it takes for the request to be inserted in the device’s queue.
I2D – Time spend in the device queue waiting to be issued to the driver.
D2C – Time spend between issuing to the driver and completion of the request. This includes controller, storage. This is the same figure as the ‘svctm’ column with iostat -x.
Q2C – Total time spend in block layer and physical IO. This is the same figure as the ‘await’ column with iostat -x.

The second part is the device overhead section:

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%
---------- | --------- --------- --------- --------- ---------
   Overall |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%

This is partly the same as the IO flow table above. This is expressed as a percentage of where the total time of the IO is spend.
Q2G – Request struct allocation.
G2I – Insertion in the device queue.
Q2M – Total time until merge.
I2D – Time spend in the queue until it was dispatched to the driver.
D2C – Time spend on doing the IO after submitting the request to the driver.

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
Parameters:
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

Implementation:
One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).

strace is a linux utility to profile system calls. Using strace you can see the system calls that a process executes, in order to investigate the inner working or performance. In my presentation about multiblock reads I put the text ‘strace lies’. This is NOT correct. My current understanding is that strace does show every system call made by an executable. So…why did I make that statement? (editorial note: this article dives into the inner working of Linux AIO)

During the hotsos symposium in Dallas I was chatting with Tanel Poder, and he asked me to look a little bit more into the linux io_getevents() call and strace, because there might be an optimisation which keeps the call from truly issuing a system call, which means strace could be right. We started thinking about it a bit, and came to the conclusion it should be possible for the linux AIO code to cut the corner and peek at the IOs before executing the io_getevents system call (as a spoiler: because the IO context is in userspace).

So, what to do to investigate this? Well, let’s just look at how it works. The Oracle executable executes io_getevents_0_4() in order to do the system call io_getevents(). The function io_getevents_0_4() comes from libaio (the linux asynchronous IO library). After a small search, it appears libaio has a git source code repository, so we can peek into the source code directly from our browser!

If you browse to the source tree, you see the file io_getevents.c. If you click on it, you see the contents of this file, which has the function io_getevents_0_4() in it. This is a very simple function (actual function source code):

int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout)
{
	struct aio_ring *ring;
	ring = (struct aio_ring*)ctx;
	if (ring==NULL || ring->magic != AIO_RING_MAGIC)
		goto do_syscall;
	if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
		if (ring->head == ring->tail)
			return 0;
	}
	
do_syscall:	
	return __io_getevents_0_4(ctx, min_nr, nr, events, timeout);
}

If you look at line 7, you see ‘if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0)’. In other words: if timeout (struct) is set to any value (not NULL), and if the timeout->tv_sec (the seconds portion of the timeout struct) is set to 0 and if the timeout->nsec (the nanoseconds portion of the timeout struct) is set to 0, we enter this function. Once in the function, we look at the struct ring, which is defined as a struct aio_ring from the pointer ctx which is passed to the function io_getevents_0_4(); the first argument. If ring->head is the same as ring-> tail, in other words: if the ring (buffer) is empty, we cut the corner and return 0, without executing the system call. In any other case, the function __io_getevents_0_4() is executed, which executes the system call.

A way to check if this truly is happening, is using the gdb ‘catch syscall’ functionality. In my investigation, I executed ‘break io_getevents_0_4’, which breaks on the userland portion of the io_getevents() function, ‘catch syscall io_getevents’, which breaks when the system call truly is executed, and ‘break io_submit’ to understand which getevents are executed in what number. I setup a testcase with a sqlplus session with the server process throttled to 1 IO per second (see my article on using cgroups to throttle IO), attached to the server process with gdb, and executed the following commands:

break io_submit
commands
silent
printf "io_submit\n"
c
end
break io_getevents_0_4
commands
silent
printf "io_getevents_0_4-libaio\n"
c 
end
catch syscall io_getevents
commands
silent
printf "io_getevents-syscall\n"
c
end

Next I executed a SQL which did a direct path full table scan. This is the result:

io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall
io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall

If you recall what is in the about multiblock reads presentation: after the io_submit ‘phase’, Oracle executes up to 4 io_getevents() calls non-blocking to look for IO. In this case you see the calls being done in user land, but not making it to the system call, because of the shortcut in the io_getevents_0_4() code. After 4 times, Oracle executes io_getevents() with timeout set to 600 seconds, which makes the call truly execute a system call. Please mind that ‘catch syscall’ triggers twice (as can be seen from the two ‘io_getevents-syscall’ in the above example), but is in reality only 1 system call. This proves the working of the code of the io_getevents_0_4() function we looked into, and the reason why I thought the strace utility lied.

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.