Archive

Tag Archives: IO

This article is about the Oracle 12c in-memory option, and specifically looks at how the background worker processes do IO to populate the in-memory column store.

Hardware: Apple Macbook with VMWare Fusion 7.1.3.
Operating system: Oracle Linux 6.7, kernel: 3.8.13-118.el6uek.x86_64.
Database version: Oracle 12.1.0.2
Patch: opatch lspatches
19392604;OCW PATCH SET UPDATE : 12.1.0.2.1 (19392604)
19303936;Database Patch Set Update : 12.1.0.2.1 (19303936)

But first things first, let’s setup the in-memory option first with a test table. The first thing to consider is to create the in-memory area to store the objects. I only want a single table stored in the in-memory area, so I can very simply look at the size of object:

TS@fv12102 > select segment_name, round(bytes/power(1024,2)) "MB" from user_segments where segment_name = 'T2';

SEGMENT_NAME                                     MB
---------------------------------------- ----------
T2                                              168

So, that means that an in-memory area of 200MB should suffice, let’s set it and bounce the instance:

SYS@fv12102 AS SYSDBA> alter system set inmemory_size = 200m scope=spfile sid='*';

System altered.

SYS@fv12102 AS SYSDBA> startup force;
ORACLE instance started.

Total System Global Area 1048576000 bytes
Fixed Size		    2932336 bytes
Variable Size		  226492816 bytes
Database Buffers	  469762048 bytes
Redo Buffers		  139673600 bytes
In-Memory Area		  209715200 bytes
Database mounted.
Database opened.

Line 13 shows the in-memory area is created!

Now I need to set the inmemory attribute for my table T2. This code example shows the inmemory attribute for my T2 table without the in-memory set, then alters the table to belong in the in-memory area, and shows the inmemory attribute again:

TS@fv12102 > select inmemory from user_tables where table_name = 'T2';

INMEMORY
--------
DISABLED

TS@fv12102 > alter table t2 inmemory;

Table altered.

TS@fv12102 > select inmemory from user_tables where table_name = 'T2';

INMEMORY
--------
ENABLED

If we now look at the V$IM_SEGMENTS, the view is empty:

SYS@fv12102 AS SYSDBA> select owner, segment_name, bytes normal_size, bytes_not_populated, inmemory_size, populate_status from v$im_segments;

no rows selected

This is because the inmemory_priority has not been specified when we altered the table to inmemory, which means the default priority NONE is given, which means the table will be loaded into memory once it is touched.

Now, in order to understand how the in-memory area is populated we need to execute a sql trace including waits on the Wnnn process or processes that populate the in-memory area. However, that is not as straightforward as it seems: there are a couple of these (background) processes already in my instance, but there can be created more if it necessary. Luckily, we can use Oracle’s improvements to the diagnosability infrastructure and set sql trace and filter on the process name:

SYS@fv12102 AS SYSDBA> alter system set events 'sql_trace {process: pname = w} level 8';

System altered.

Now that we enabled sql_trace for all Wnnn processes, let’s touch the table, and see what is in the trace!

First touch the table:

TS@fv12102 > select count(*) from t2;

  COUNT(*)
----------
   1000000

That should result in the T2 table being read in the in-memory area:

SYS@fv12102 AS SYSDBA> select owner, segment_name, bytes normal_size, bytes_not_populated, inmemory_size, populate_status from v$im_segments;

OWNER  SEGMENT_NAME		      NORMAL_SIZE BYTES_NOT_POPULATED INMEMORY_SIZE POPULATE_
------ ------------------------------ ----------- ------------------- ------------- ---------
TS     T2				176160768		    0	   53673984 COMPLETED

Bingo!

Now go to the trace directory, and look for w processes:

[oracle@bigmachine [fv12102] trace]$ ls -l *_w*
-rw-r----- 1 oracle oinstall   2145 Nov 14 08:52 fv12102_w000_30938.trc
-rw-r----- 1 oracle oinstall    166 Nov 14 08:52 fv12102_w000_30938.trm
-rw-r----- 1 oracle oinstall   8616 Nov 14 09:10 fv12102_w000_32380.trc
-rw-r----- 1 oracle oinstall    752 Nov 14 09:10 fv12102_w000_32380.trm
-rw-r----- 1 oracle oinstall  35401 Nov 14 09:10 fv12102_w002_31964.trc
-rw-r----- 1 oracle oinstall   2612 Nov 14 09:10 fv12102_w002_31964.trm
-rw-r----- 1 oracle oinstall 232444 Nov 14 09:10 fv12102_w001_30940.trc
-rw-r----- 1 oracle oinstall   7120 Nov 14 09:10 fv12102_w001_30940.trm

Because of the size of the file, it was my guess that file would contain the trace output of the reading of the T2 table into the in-memory area. That is correct.

If a worker process was up before doing an action, we can see it sleeps for 5 seconds in the wait event ‘Space Manager: slave idle wait’, after which it does some (internal) housekeeping, which means reading its CPU usage using getrusage(), after which it will sleep on semtimedop() for 5 seconds again. This is how that looks like when traced with waits:

WAIT #0: nam='Space Manager: slave idle wait' ela= 5000827 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7189216230

*** 2015-11-14 08:52:32.981
WAIT #0: nam='Space Manager: slave idle wait' ela= 5000350 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7194217089

*** 2015-11-14 08:52:37.982
WAIT #0: nam='Space Manager: slave idle wait' ela= 5001004 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7199218359

*** 2015-11-14 08:52:42.982
WAIT #0: nam='Space Manager: slave idle wait' ela= 5000000 Slave ID=1 p2=0 p3=0 obj#=10986 tim=7204218487

When the worker starts working on populating the in-memory cache, it first reads some dictionary views (partobj$, compression$ and hist_head$), after which it begins to read the table into memory:

WAIT #140530969221688: nam='Disk file operations I/O' ela= 21 FileOperation=2 fileno=4 filetype=2 obj#=20434 tim=7207107727
WAIT #140530969221688: nam='direct path read' ela= 78 file number=4 first dba=22416 block cnt=8 obj#=20434 tim=7207125617
WAIT #140530969221688: nam='direct path read' ela= 113 file number=4 first dba=22425 block cnt=7 obj#=20434 tim=7207126486
WAIT #140530969221688: nam='direct path read' ela= 136 file number=4 first dba=22432 block cnt=8 obj#=20434 tim=7207127263
WAIT #140530969221688: nam='direct path read' ela= 60 file number=4 first dba=22441 block cnt=7 obj#=20434 tim=7207128054
WAIT #140530969221688: nam='direct path read' ela= 68 file number=4 first dba=22448 block cnt=8 obj#=20434 tim=7207128718
WAIT #140530969221688: nam='direct path read' ela= 114 file number=4 first dba=22457 block cnt=7 obj#=20434 tim=7207129417
WAIT #140530969221688: nam='direct path read' ela= 106 file number=4 first dba=22464 block cnt=8 obj#=20434 tim=7207130248
WAIT #140530969221688: nam='direct path read' ela= 100 file number=4 first dba=22473 block cnt=7 obj#=20434 tim=7207130878
WAIT #140530969221688: nam='direct path read' ela= 202 file number=4 first dba=22480 block cnt=8 obj#=20434 tim=7207132096
WAIT #140530969221688: nam='direct path read' ela= 173 file number=4 first dba=22489 block cnt=7 obj#=20434 tim=7207132857
WAIT #140530969221688: nam='direct path read' ela= 293 file number=4 first dba=22496 block cnt=8 obj#=20434 tim=7207134040

The event ‘Disk file operations I/O’ is the instrumentation of meta-operations which need to happen before a process can work on a file. In our case, FileOperation=2 means opening a file.

Then we see the event ‘direct path read’! That means that the worker uses the modern reading method of reading data, instead of the older method indicated by ‘db file scattered read’.

The reason this is important, is the ‘scattered read’ method does not scale very well. The basic setup of the ‘scattered read’ method is:
1. Determine chunk to read (limited by DB_FILE_MULTIBLOCK_READ, extent border, space management blocks and blocks already in the cache).
2. Read chunk.
3. Repeat until high water mark is reached.
Even if this is done via asynchronous IO, there is never read more than one chunk at a time.

If we look at the ‘direct path read’ method, then it does this in a simplified way:
1. Determine number of IOs allowable in flight (default/minimum two!)
2. Submit reads for chunks of the segment to scan, amount of IOs is determined at 1.
3. Wait for minimal 1 IO to return.
4. Repeat until high water mark is reached.

Essentially, using the ‘direct path read’ method a single process can take advantage of the available bandwidth of modern storage and asynchronous IO and issue multiple IOs.

During tracing I noticed the worker processes writing to the database’s diagnostics destination, however in a directory ‘log’, to a file ‘imdb_ORACLE_SID.log’ which seem to contain information about chunks read by worker processes:

Sat Nov 14 09:51:10 2015
c:0o:20434d:20434/St.RDBA: 0xac00 No.blks: 896
c:0o:20434d:20434/St.RDBA: 0xaf84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xb384 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xb784 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xbb84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xbf84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xc384 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xc784 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xcb84 No.blks: 1020
c:0o:20434d:20434/St.RDBA: 0xcf84 No.blks: 644
c:0o:20434d:20434/rows buff: 460730

And in a directory ‘imdb’ inside the ‘log’ directory, which looks like the same data in an XML format:

<msg time='2015-11-14T08:52:45.889+01:00' org_id='oracle' comp_id='rdbms'
 msg_id='kdmlScanAndCreateCU:1537:858585840' type='UNKNOWN' group='IM_populate'
 level='16' host_id='bigmachine.local' host_addr='127.0.0.1'>
 <txt>c:0o:20434d:20434/St.RDBA: 0x5790 No.blks: 8
 </txt>
</msg>

It reveals a code location kdmlScanAndCreateCU, which probably means Kernel Data in-Memory for the first letters.

In most circumstances, and in my opinion, this information should be printed when some kind of debug flag is set, not ‘just because’, because writing these statistics costs time. I also don’t know if these files are maintained by ADR. If not, doing a lot of in-memory transactions can fill it up rather quickly.

In the previous post on the decision between buffered and direct path reads I showed the decision is depended on the version. Up to and including version 11.2.0.2 the size of a segment needs to be five times small table threshold in order to be considered for direct path reads, and starting from 11.2.0.3 the database starts considering direct path reads starting from small table threshold. The lower limit just discussed is small table threshold or five times small table threshold with lower versions, upper limit is called “very large object threshold” (VLOT) and is five times the size of the buffercache, which is the threshold after which a table scan always is going via direct path.

There seems to be an odd situation with 11.2.0.3 and up that if a table has a size between small table threshold and five times small table threshold, it is always read via direct path, while if the table is bigger than five times small table threshold, it is depended on the amount of blocks in the cache, and switches to buffered reads if the amount of blocks in the cache is 99% or more. This seems to be the opposite of what is logical; a smaller sized table is likely a better candidate than a larger table for caching.

The last thing the previous blogpost showed was the impact of dirty blocks on the buffered/direct path decision. Up to 11.2.0.2 a read that should go via direct path is done buffered when the amount of dirty blocks reaches around 75%. Starting from version 11.2.0.3, a read that should go via direct path is done buffered when the amount of dirty blocks for a table reaches approximately 50%. This is vital information for data warehouses!

A PL/SQL procedure was included so these tests could be to measure this for yourself. If you have any doubts, follow the link above and paste the procedure to test it for yourself.

The purpose of this blogpost is to further investigate the table direct path or buffered decision in various Oracle database versions when compression is used. I used compression using “COMPRESS FOR OLTP”, in order to see if compression changes the decisions.

The first measurement is reading, doing a full scan of a compressed table:

Version 11.2.0.1 (_STT 3979; _STT*5:19895)

TS@v11201 > @test_direct_path
table: T2_OLTP size: 64512 blocks, 10000000 rows.
cached - physical reads cache/direct: (10)		6419/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (20)		6487/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (30)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (40)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (50)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (60)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (70)		6372/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (80)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (90)		6376/0
full scan - physical reads cache/direct:			  0/63911
cached - physical reads cache/direct: (100)		 6382/0
full scan - physical reads cache/direct:			  0/0

This pattern is repeated up to version 12.1.0.2: if the amount of blocks is 99% or more, the database switches from direct path to buffered reads for table scans. So for simple reading, the decision does not seem different than for non-compressed tables.

When dirty blocks are included, it is different:

Version 11.2.0.1:

TS@v11201 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  0/63911

If a large number of the blocks, or all of them, in the cache are dirty, a full scan does not flip to buffered reads at 99%.

However, it is interesting to see that this changes with version 11.2.0.2:

TS@v11202 > @test_direct_path.sql
table: T2_OLTP size: 64512 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/63911
full scan - physical reads cache/direct: (20)			 0/63911
full scan - physical reads cache/direct: (30)			 0/63911
full scan - physical reads cache/direct: (40)			 0/63911
full scan - physical reads cache/direct: (50)			 0/63911
full scan - physical reads cache/direct: (60)			 0/63911
full scan - physical reads cache/direct: (70)			 0/63911
full scan - physical reads cache/direct: (80)			 0/63911
full scan - physical reads cache/direct: (90)			 0/63911
full scan - physical reads cache/direct: (100)			  461/0

Starting from version 11.2.0.2, the threshold of 99% for a full table scan to switch from direct path to buffered seems to be true for clean blocks and dirty blocks. The same is true for versions 11.2.0.3 and 11.2.0.4.

However, starting from 12.1.0.1, a full scan on a compressed table is done via direct path even if all the blocks are in the cache and dirty, which seems identical to the version 11.2.0.1 behaviour:

TS@v12101 > @test_direct_path
table: T2_OLTP size: 65536 blocks, 10000000 rows.
--update
full scan - physical reads cache/direct: (10)			 0/65174
full scan - physical reads cache/direct: (20)			 0/65174
full scan - physical reads cache/direct: (30)			 0/65174
full scan - physical reads cache/direct: (40)			 0/65174
full scan - physical reads cache/direct: (50)			 0/65174
full scan - physical reads cache/direct: (60)			 0/65174
full scan - physical reads cache/direct: (70)			 0/65174
full scan - physical reads cache/direct: (80)			 0/65174
full scan - physical reads cache/direct: (90)			 5/65174
full scan - physical reads cache/direct: (100)			  0/65174

So, the conclusion overall is there is a difference between regular heap tables and compressed heap tables. I tested OLTP compression, the [nsmtio] tracing indicates this is the same for HCC compression (hybrid columnar compression). In general, for simple reading there doesn’t seem to be a difference between normal heap tables and compressed heap tables. However, I tested on idle systems, on “real” systems with more “stress” on the buffercache management there might be more mechanisms in play.

When there are dirty blocks (blocks changed sitting in the buffercache waiting to be written by the database writer), there is different behaviour with respect to the buffered read or direct path read choice the instance makes. In general, with dirty blocks involved, compressed tables are less likely to be read into the buffercache.

It seems that with compressed tables and version 11.2.0.1 and 12.1.0.1 and up there is no switch to buffered reads even if all the blocks are in the cache, but dirty. The versions in between there, which are version 11.2.0.2, 11.2.0.3 and 11.2.0.4 do switch to buffered reads when 99% or more blocks are in the cache, regardless if they are dirty.

Also, please mind I explicitly mentioned tables. For a full scan on an index, indicated by the ‘FAST FULL INDEX SCAN’ line in an explain plan, entirely different rules are in play.

I posted a fair amount of stuff on how Oracle is generating IOs, and especially large IOs, meaning more than one Oracle block, so > 8KB. This is typically what is happening when the Oracle database is executing a row source which does a full segment scan. Let’s start off with a quiz: what you think Oracle is the maximum IO size the Oracle engine is capable of requesting of the Operating System (so the IO size as can be seen at the SCI (system call interface) layer? If you made up your answer, remember it, and read on!

The real intention of this blogpost is to describe what is going on in the Oracle database kernel, but also what is being done in the Linux kernel. Being a performance specialised Oracle DBA means you have to understand what the operating system does. I often see that it’s of the utmost importance to understand how an IO ends up as a request at the NAS or SAN head, so you understand what a storage admin is talking about.

Many people (including myself in the past) would state that the maximum IO size on Linux is 1MB. For the Linux 2.6 kernel and higher this statement is incorrect because there is no such thing as a single maximum IO size on Linux. There used to be one in the Linux 2.4 era, which was set with the maxphys parameter, but that time is long gone. In order to find out what it is now, let’s test and see!

First let’s get a Linux system and a big table!
The system I got is a VMWare Fusion VM, running Linux 3.8.13-44.1.5.el6uek.x86_64 (UEK3) on Oracle Linux 6u6. On top of that I am using the Oracle database and grid infrastructure version 12.1.0.2. This system is using udev for providing disk access (as opposed to asmlib)
The redundancy mode of ASM is external, although for reading (what I will be covering) this doesn’t matter.

Now let’s get a normal database session, and use a combination of sql_trace with waits (10046/8) and strace to see how the Oracle database interfaces with the kernel. Please mind I’ve prepared a heap table with no indexes on it, so a count(*) on it always will result in a full table scan. Also, the buffercache is sized small enough (or the table is created large enough, it depends on how you look at it) to have the session make the decision to do a direct path read, instead of a buffered read. If you don’t know what that means: please search this blog on direct path reads, or even better, download my presentation ‘about multiblock reads’.

The direct path read decision is visible via the ‘direct path read’ wait event. If you get a full table scan operation and see ‘db file scattered read’ waits, you are doing a buffered scan.

In the most cases, you will get a maximum value of 1MB if possible, which seems to support the generally assumed 1MB maximum operating system induced IO size. Why? Well, because you probably set the DB_FILE_MULTIBLOCK_READ_COUNT parameter to 128, which means you have explicitly set the Oracle process not to do IO with a size more than 1MB (8192*128). In this blogpost, I explain that Oracle can request IOs bigger than 1MB.

In the blogpost series on extra huge database IOs, I show that Oracle can do huge (1MB+) IOs, but the physical request size (what Oracle actually requests at the SCI layer, visible with the pread/pwrite/io_submit/io_getevents functions) still is 1MB. This limit is imposed by the physical storage structure which the database uses with ASM, called allocation unit (often called ‘AU’). The default size of an allocation unit is 1MB. The allocation unit can be seen in both the database and the ASM instance with the following query:

SYS@+ASM AS SYSASM> select name, allocation_unit_size from v$asm_diskgroup;

NAME			       ALLOCATION_UNIT_SIZE
------------------------------ --------------------
DATA					    1048576

How about doing an unbuffered read on a filesystem? I’ve created a database on an (XFS, but this doesn’t matter AFAIK) filesystem, and tried to set the maximum value to DB_FILE_MULTIBLOCK_READ_COUNT. I’ve done this by setting DB_FILE_MULTIBLOCK_READ_COUNT to 10000 (ten thousand), and then bounce the database to see what the number has become. In my case, the value became 4096. I think this is the limit for Oracle 12.1.0.2 on Linux x86_64, but love to hear if you have gotten different results:

I set 10000:

SYS@fv12102 AS SYSDBA> select name, value from v$spparameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   10000

But Oracle limits this to 4096:

SYS@fv12102 AS SYSDBA> select name, value from v$parameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   4096

Okay. Let’s start our investigation at that point: a database which is set up with a DB_FILE_MULTIBLOCK_READ_COUNT set to 4096, alias 32MB (with a block size of 8KB), and a table which got extents large enough to accommodate huge (32MB) IOs.

Fire up a session regular database session, and enable sql trace at level 8:

$ sqlplus ts/ts@//localhost/v11204
...
SQL> alter session set events 'sql_trace level 8';

Now start another session as root on the database server, and find the PID of the server process of the sqlplus process we just created above. Issue strace with verbose writing setting:

# strace -e write=all -e all -p PID
Process PID attached - interrupt to quit
read(14,

Okay, we are setup and ready to go, but there is one additional thing: the way direct path reads work, they would probably give little waits with fast IO capabilities. One way to get the waits back, is to limit the IO capabilities of the process. Doing so is documented in this article.

Now issue the full table scan on a large table in sqlplus while strace is attached:

SQL> select count(*) from bigtab;

Now take a peek at the strace output!
The output first shows IOs as we expect:

io_getevents(139717184229376, 1, 128, {{0x7f126dd3d780, 0x7f126dd3d780, 33554432, 0}}, {600, 0}) = 1
times(NULL)                             = 431386800
write(7, "\n*** 2014-11-24 13:09:28.028\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  34 2d 31 31 2d 32 34 20  .*** 201 4-11-24  |
 | 00010  31 33 3a 30 39 3a 32 38  2e 30 32 38 0a           13:09:28 .028.    |
lseek(7, 0, SEEK_CUR)                   = 31181
write(7, "WAIT #139717129509840: nam='dire"..., 130) = 130
 | 00000  57 41 49 54 20 23 31 33  39 37 31 37 31 32 39 35  WAIT #13 97171295 |
 | 00010  30 39 38 34 30 3a 20 6e  61 6d 3d 27 64 69 72 65  09840: n am='dire |
 | 00020  63 74 20 70 61 74 68 20  72 65 61 64 27 20 65 6c  ct path  read' el |
 | 00030  61 3d 20 33 39 30 37 33  30 20 66 69 6c 65 20 6e  a= 39073 0 file n |
 | 00040  75 6d 62 65 72 3d 34 20  66 69 72 73 74 20 64 62  umber=4  first db |
 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |
 | 00070  37 34 20 74 69 6d 3d 31  39 32 30 30 37 31 30 31  74 tim=1 92007101 |
 | 00080  39 39                                             99                |

What is visible here, is first the reap of an I/O request (with asynchronous IO on Linux this is typically the io_getevents() call). If you take a close look at the arguments of the io_getevents() call (taken from the manpage of io_getevents):

int io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout);

And then focus on the struct io_event:

struct io_event {
         __u64           data;           /* the data field from the iocb */
         __u64           obj;            /* what iocb this event came from */
         __s64           res;            /* result code for this event */
         __s64           res2;           /* secondary result */
};

The above description is taken from the annotated Linux kernel source, as available here: http://lxr.free-electrons.com/source/include/uapi/linux/aio_abi.h#L58 I use this site for navigating the Linux kernel source. What is important, is that the third field (io_event.res) contains the size of the IO request. Having learned this, now look again in the io_getevents call. The size of the IO reaped above is 33554432, which is 33554432/1024/1024=32 MB. Yes, that’s a single IO of 32MB! Also, this is consistent with the wait line a little lower:

 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |

Block count = 4096 * 8192 (block size) = 33554432

So, I wonder what you thought was possible, the correct answer on my operating system (Linux x86_64) with Oracle 12.1.0.2 is 32MB. It turned out the big IOs in the ASM case were limited by the allocation unit size of 1MB.

The next thing I’ve wondered is how this matches with the maximum IO size of the disk devices as visible by the Operating System. You can request 32MB, but a normal SCSI disk doesn’t do 32MB IOs. Of course in my case the SCSI disk really is a VMWare virtual disk device.

Let’s keep the 32MB IO in mind, now dive from the top layer, the SCI (system call interface) where an IO enters the kernel to the bottom of the kernel from an IO perspective, to the block device. The block device settings are found in /sys/block/DEVICE/queue. The maximum IO size the device is capable of is found in max_hw_sectors_kb. This is in kilobytes, and read only (can’t change hardware, right?). In my case this is:

[root@bigmachine queue]# cat max_hw_sectors_kb
4096

My disk supports a maximum of 4M for an IO size! But this is not what is used, the actual setting is in max_sectors_kb:

[root@bigmachine queue]# cat max_sectors_kb
512

That’s half a megabyte!

So…we got (up to) 32MB sized IO requests coming in, and a device that is set to 512KB IOs. This means that somewhere between the SCI and the device, there is a mechanism to scatter the request size to the device’s maximum IO size, and once the IO requests are done, going back to gather the IO results to the original request.

There are a couple of layers in the Linux kernel through which the call travels (including common functions):

-SCI/system call interface: system_call, sys_io_submit…. (io_submit, do_io_submit, io_submit_one; these seem to be in the VFS layer)
-VFS/virtual filesystem: aio_run_iocb, do_aio_read, xfs_file_read_iter, generic_file_read_iter, xfs_vm_direct_IO, bio_*, kiocb_batch_refill
-Block layer: blk_finish_plug, blk_flush_plug_list, queue_unplugged, __blk_run_queue, blk_run_queue
-SCSI layer: scsi_*
-Device driver: mptspi_qcmd, mptscsih_qcmd, mpt_put_msg_frame

(note: there seems to be consensus the above mentioned layers exist, although there is different wording and different numbers by different sources. Also, there doesn’t seem to be a very clear description of what is done by which layer, and what typically defines a kernel layer. For some functions it is clear they belong to a certain layer (for example aio_run_iocb in Linux/fs/aio.c, bulk_finish_plug in Linux/block/blk-core.c, etc.), for some layers, like the SCI layer, it seems there isn’t a clear layer definition by looking at where the function is defined. Also please mind the SCSI layer is implemented as a driver, just like the actual device driver for the hardware. This is very understandable, but makes it a bit harder to see it in a layered way)

System Call Interface (SCI)
The request enters kernel space via the SCI. The function of the SCI is to elevate a process to system priority to perform a kernel mode task, like (but not limited to) doing I/O. The system call implementation on Linux makes use of a wrapper function in glibc, which executes the system call on behalf of the user systemcall request. The reason for mentioning this, is that sometimes the glibc wrapper “hides” the real system call, for example calling the semtimedop() function:

(gdb) break semtimedop
Breakpoint 1 at 0x3bb38eb090: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) c
Continuing.

Breakpoint 1, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Above is a gdb (GNU debugger) session which attaches to an Oracle background process, which I know is sleeping in the system call semtimedop() when idle, A breakpoint is set on the semtimedop function, and the execution of the attached process is resumed. It then breaks on the function, showing the source code at which the break happened. Instead of showing the actual semtimedop function, it shows the pseudo function in glibc which wraps this system call. This hides the arguments of calling the semtimedop() function. My current workaround is to read the kernel registers which “carry” the arguments (RDI, RSI, RDX, RCX, R8, R9 for the first 6 arguments in most cases).

Virtual File System (VFS)
The next layer is virtual filesystem. Here we see functions specific to asynchronous IO or synchronous IO, and doing direct IO or not, and also actual filesystem specific functions (in my case xfs, when ext4 is used, you will see specific functions for that. I highly recommend XFS!). This layer also uses a structure called ‘request_queue’, which keeps track of the actual IO requests for a block device, of which each individual request is a struct ‘request’, which contains one or more structs called ‘bio’ which contains a description of the request, which points to structure called ‘bio_vec’, which points to pages for storing the disk request contents. This is all setup and created in kernel memory by the user process in system mode. It’s my assumption that the properties of the disk device (=maximum advertised IO size) are taken into account when the VFS filesystem implementation creates requests and all necessary structs and memory area’s. Please mind it’s important that enough memory is available to setup the necessary structures, and enough CPU to make this happen. Also some of the crucial structures for doing IO (request, bio, bio_vec) seem to be setup in this layer. An IO can’t be done without a memory area for the IO request to hold the data for sending it to the device (alias a write), or a memory area for the IO request to hold the data which is fetched from the device (alias a read).

The funny thing is that when you use ASM (the simple version 11.2 ASM with a local ASM instance and local disk devices), you will still see some functions of the VFS layer, because you use a disk device which is opened using the local filesystem. Examples of these functions are: aio_run_iocb, do_aio_read.

Block Layer
The next layer is the block layer. Here the request queue is handled, and I/O scheduling is done. Oracle advises the deadline scheduler in all cases. The scheduler works by plugging a request queue, much like a plug in your bathtub, letting the requests enter the queue. Having multiple requests in a queue means it can be optimised by reordering the requests, and merging adjacent requests up to the device’s advertised maximum IO size. Once a request’s timeout expires, or the requesting process finishes submitting IO, the queue is unplugged.

SCSI layer
The SCSI layer is responsible for communicating with SCSI devices to do IOs.

Device driver
The device driver layer is the layer that truly physically communicates with a device, and implements the device specific communication. In my case the functions start with mpt, which is the driver for LSI PCI adapters.

To see how the flow of IO going through the block layer, there is a tool called blktrace. Actually this is a mini-suite of tools consisting of blktrace (tracing the IO requests through the block layer), blkparse (parsing the output of blktrace to make it human readable), btrace (script to combine blktrace and blkparse, and btt (a blktrace output post processing tool)), among others.

In order to use blktrace, the debug file system of the Linux kernel needs to be mounted. Here is how that is done:

# mount -t debugfs debugfs /sys/kernel/debug

If the kernel debugfs is not mounted, you get the following message:

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
Invalid debug path /sys/kernel/debug: 0/Success

I use blktrace in this article for looking at the IO requests to understand what is going on. The workflow for this use of blktrace is:
– create a trace file of the block flow using blktrace
– make the trace file human readable via blkparse or analyse via btt (block trace times)

Actually, you can parse the output of blktrace directly via blkparse using ‘blktrace -d DEVICE – | blkparse -i -‘. To make that even simpler, the script ‘btrace’ is created, to do exactly that.

Here’s how that looks like (depending on the number of processes using it, the output can be huge, this is only a snippet):

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
...
  8,16   0       57     0.260669503  2421  Q  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       58     0.260672502  2421  G  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       59     0.260673231  2421  P   N [asm_gmon_+asm]
  8,16   0       60     0.260674895  2421  I  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       61     0.260675745  2421  U   N [asm_gmon_+asm] 1
  8,16   0       62     0.260677119  2421  D  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       63     0.260882884     0  C  WS 4088 + 8 [0]
...

What is shown here, is the typical flow of an IO in the block layer:
Q – Queue. A request starts off sending a notification on the intent to queue at the given location.
G – Get request. A struct request is allocated.
P – Plug. When the block device queue is empty, the queue is plugged in order to receive further IOs and have the ability to optimise (merge and/or reorder) them before the data is sent to the device.
I – Insert. A request is sent to the IO scheduler for addition to the internal queue and later service by the driver. The request is fully allocated at this time.
U – Unplug. The start of sending requests to the driver.
D – Driver. A request has been sent to the driver and removed from the queue.
C – Complete. A previously issued request to the driver has been completed.

The main point is here, that you can truly see how the IO requests flow through the block layer and are issued to the storage device, in other words, you can see how the block layer receives the IOs, and what is exactly submitted to the driver as request for the physical storage layer.

This is a microscopic view of the disk IOs. In most cases, when you want to gain information on block layer IO processing, another view on it is provided by processing blktrace output with btt. This is an example output of btt:

First capture IO events using blktrace:

[root@bigmachine ~]# blktrace -w 60 -d /dev/oracleasm/disk1 -o - | blkparse -d sdb.blkparse -i -

In this example I captured IOs for 60 seconds. You can exclude ‘-w 60’, and press interrupt (ctrl-c) when you deem IO recording is enough. This produces a binary file ‘sdb.blkparse’, which can be used btt:

This is the first part, the flow through the block layer until IO completion:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.239795347   3.002829973         238
Q2G               0.000000001   0.159337842   3.011192142         264
G2I               0.000000679   0.000001724   0.000011618         264
I2D               0.000000764   0.000007633   0.000153436         264
D2C               0.000000001   0.103328167   3.012509148         233
Q2C               0.000000001   0.270961298   3.012516496         233

Note: time is in milli seconds.
Q2Q – Time between IO requests.
Q2G – Time it takes for a request struct to be allocated.
G2I – Time it takes for the request to be inserted in the device’s queue.
I2D – Time spend in the device queue waiting to be issued to the driver.
D2C – Time spend between issuing to the driver and completion of the request. This includes controller, storage. This is the same figure as the ‘svctm’ column with iostat -x.
Q2C – Total time spend in block layer and physical IO. This is the same figure as the ‘await’ column with iostat -x.

The second part is the device overhead section:

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%
---------- | --------- --------- --------- --------- ---------
   Overall |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%

This is partly the same as the IO flow table above. This is expressed as a percentage of where the total time of the IO is spend.
Q2G – Request struct allocation.
G2I – Insertion in the device queue.
Q2M – Total time until merge.
I2D – Time spend in the queue until it was dispatched to the driver.
D2C – Time spend on doing the IO after submitting the request to the driver.

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
Parameters:
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

Implementation:
One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).

strace is a linux utility to profile system calls. Using strace you can see the system calls that a process executes, in order to investigate the inner working or performance. In my presentation about multiblock reads I put the text ‘strace lies’. This is NOT correct. My current understanding is that strace does show every system call made by an executable. So…why did I make that statement? (editorial note: this article dives into the inner working of Linux AIO)

During the hotsos symposium in Dallas I was chatting with Tanel Poder, and he asked me to look a little bit more into the linux io_getevents() call and strace, because there might be an optimisation which keeps the call from truly issuing a system call, which means strace could be right. We started thinking about it a bit, and came to the conclusion it should be possible for the linux AIO code to cut the corner and peek at the IOs before executing the io_getevents system call (as a spoiler: because the IO context is in userspace).

So, what to do to investigate this? Well, let’s just look at how it works. The Oracle executable executes io_getevents_0_4() in order to do the system call io_getevents(). The function io_getevents_0_4() comes from libaio (the linux asynchronous IO library). After a small search, it appears libaio has a git source code repository, so we can peek into the source code directly from our browser!

If you browse to the source tree, you see the file io_getevents.c. If you click on it, you see the contents of this file, which has the function io_getevents_0_4() in it. This is a very simple function (actual function source code):

int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout)
{
	struct aio_ring *ring;
	ring = (struct aio_ring*)ctx;
	if (ring==NULL || ring->magic != AIO_RING_MAGIC)
		goto do_syscall;
	if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
		if (ring->head == ring->tail)
			return 0;
	}
	
do_syscall:	
	return __io_getevents_0_4(ctx, min_nr, nr, events, timeout);
}

If you look at line 7, you see ‘if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0)’. In other words: if timeout (struct) is set to any value (not NULL), and if the timeout->tv_sec (the seconds portion of the timeout struct) is set to 0 and if the timeout->nsec (the nanoseconds portion of the timeout struct) is set to 0, we enter this function. Once in the function, we look at the struct ring, which is defined as a struct aio_ring from the pointer ctx which is passed to the function io_getevents_0_4(); the first argument. If ring->head is the same as ring-> tail, in other words: if the ring (buffer) is empty, we cut the corner and return 0, without executing the system call. In any other case, the function __io_getevents_0_4() is executed, which executes the system call.

A way to check if this truly is happening, is using the gdb ‘catch syscall’ functionality. In my investigation, I executed ‘break io_getevents_0_4’, which breaks on the userland portion of the io_getevents() function, ‘catch syscall io_getevents’, which breaks when the system call truly is executed, and ‘break io_submit’ to understand which getevents are executed in what number. I setup a testcase with a sqlplus session with the server process throttled to 1 IO per second (see my article on using cgroups to throttle IO), attached to the server process with gdb, and executed the following commands:

break io_submit
commands
silent
printf "io_submit\n"
c
end
break io_getevents_0_4
commands
silent
printf "io_getevents_0_4-libaio\n"
c 
end
catch syscall io_getevents
commands
silent
printf "io_getevents-syscall\n"
c
end

Next I executed a SQL which did a direct path full table scan. This is the result:

io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall
io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall

If you recall what is in the about multiblock reads presentation: after the io_submit ‘phase’, Oracle executes up to 4 io_getevents() calls non-blocking to look for IO. In this case you see the calls being done in user land, but not making it to the system call, because of the shortcut in the io_getevents_0_4() code. After 4 times, Oracle executes io_getevents() with timeout set to 600 seconds, which makes the call truly execute a system call. Please mind that ‘catch syscall’ triggers twice (as can be seen from the two ‘io_getevents-syscall’ in the above example), but is in reality only 1 system call. This proves the working of the code of the io_getevents_0_4() function we looked into, and the reason why I thought the strace utility lied.

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

– IO size by number of reads and writes
– IO size versus latency for reads and writes
– IO size versus pending IO count for reads and writes
– IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

%d bloggers like this: