Archive

Tag Archives: latency

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
Parameters:
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

Implementation:
One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

- IO size by number of reads and writes
– IO size versus latency for reads and writes
– IO size versus pending IO count for reads and writes
– IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

Follow

Get every new post delivered to your Inbox.

Join 2,191 other followers

%d bloggers like this: