Recently I am involved in a project which requires a lot of data to be extracted from Oracle. The size of the data was so huge that the filesystems filled up. Compressing the output (using tar j (bzip2) or z (gzip)) is an obvious solution, but this can only be done after the files are created. This is why I proposed compressing the output without ever existing in uncompressed form.

This solution works with a so called ‘named pipe’, which is something for which I know for sure it can be done on Linux and unix. A named pipe has the ability to let two processes transfer data between each other. This solution will look familiar to “older” Oracle DBA’s: this was how exports where compressed from the “original” export utility (exp).

I’ve created a small script which calls sqlplus embedded in it, and executes sqlplus commands using a “here command”:

mknod /tmp/oracle.pipe p

sqlplus / as sysdba << _EOF
set escape on

host nohup gzip -c < /tmp/oracle.pipe > /tmp/out1.gz \&
spool /tmp/oracle.pipe
select * from dual;
spool off

host nohup gzip -c < /tmp/oracle.pipe > /tmp/out2.gz \&
spool /tmp/oracle.pipe
select * from dual;
spool off

_EOF

rm /tmp/oracle.pipe

First a pipe is created (mknod filename p). As far as I know, this command is the same on Linux and the unixes. This pipe is removed as the last step of the script.

Inside the sqlplus script, I issue the gzip operating system command using the ‘host’ command. The line with the host command starts the gzip command with the pipe as input, and output to a .gz file in /tmp. The process is put in the background using ‘&’.
Next, the sqlplus spool command starts output to the pipe, and I execute a dummy sql (select * from dual).
With ‘spool off’, the output to the pipe is stopped. This makes the gzip process in the background to stop.
Because the gzip process is not compressing anymore to the first file, it can be used for a second time, and more times of course.

The result is two gzipped files:

zcat -v /tmp/out*.gz
/tmp/out1.gz:	SQL> select * from dual;

D                                                                               
-                                                                               
X                                                                               

SQL> spool off
 84.2%
/tmp/out2.gz:	SQL> select * from dual;

D                                                                               
-                                                                               
X                                                                               

SQL> spool off
 84.2%

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

- IO size by number of reads and writes
- IO size versus latency for reads and writes
- IO size versus pending IO count for reads and writes
- IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18″, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)"; 

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2″ means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

The purpose of this post is to show what the wait event ‘cell smart table scan’ means, based on reproducible investigation methods.

First of all, if you see the ‘cell smart table scan’ event: congratulations! This means you are using your exadata how it’s intended to be used, which means your full table scan is offloaded to the cells (storage layer), and potentially all kinds of optimisations are happening, like column filtering, predicate filtering, storage indexing, etc.

But what is exactly happening when you see the wait event ‘cell smart table scan’? Can we say anything based on this waits, like you can with other wait events?

If you have access to an exadata database machine, and have a table that is big enough to be smartscanned by the database, issue the following statements:
(Please mind that I’ve found these statements to be harmless, but execute these statements at your own risk. As with any execution of SQL: this uses resources on the Exadata, and could influence the performance)

SQL> alter session set tracefile_identifier='frits'; --make it easy to find the tracefile
SQL> alter session set events 'trace[libcell.client_library.*]';
SQL> alter session set events 'sql_trace level 8';
SQL> --- now issue a full table scan on a table 
SQL> select /*+ noparallel */ count(*) from bigtable1;

  COUNT(*)
----------
  63938560
SQL> exit

Now take the tracefile and look inside!

WAIT #139739525441360: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752114437651

*** 2013-10-14 07:02:01.516
WAIT #139739525441360: nam='SQL*Net message from client' ela= 7078360 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121516483
CLOSE #139739525441360:c=0,e=13,dep=0,type=1,tim=1381752121516628
=====================
PARSING IN CURSOR #139739525440176 len=57 dep=0 uid=0 oct=3 lid=0 tim=1381752121516762 hv=2246504030 ad='85b3e090' sqlid='gmm4k0y2ydvky'
select /*+ noparallel */ count(*) from bigtable1
END OF STMT
PARSE #139739525440176:c=0,e=83,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516760
EXEC #139739525440176:c=0,e=87,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516992
WAIT #139739525440176: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121517053
WAIT #139739525440176: nam='reliable message' ela= 2053 channel context=2700762248 channel handle=2700440152 broadcast message=2669087352 obj#=-1 tim=1381752121519402
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 504 name|mode=1263468550 2=65571 0=1 obj#=-1 tim=1381752121520099
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 330 name|mode=1263468545 2=65571 0=2 obj#=-1 tim=1381752121520594
WAIT #139739525440176: nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=6 filetype=2 obj#=-1 tim=1381752121520664

*** 2013-10-14 07:02:01.525
initializing OSS
ossnet_optip_trace: Host is not NUMA. Returning no IP preference

This is the start of my tracefile (but yours could be a bit different). First the usual SQL*Net waits, then the query which gets parsed and executed, then the wait for ‘enq: KO – fast object checkpoint’, which is an indication we are going to do direct path IO, because the KO enqueue indicates this object is checkpointed to make sure that the version of the blocks on disk is current.

Then there’s ‘initialising OSS’. This is where the communication with the Exadata storage servers is started. OSS probably means ‘Oracle Storage Server’ in this context. The next line tells us something nice: apparently, the code in the oracle database to behave as a client of the Exadata storage server is NUMA (Non Uniform Memory Access) aware. However, on Exadata versions “dash two” (X2-2, X3-2), NUMA is turned off on the compute layer (the kernel boot line in /etc/grub.conf explicitly disables NUMA with “numa=off”).

Now let’s look in the tracefile, but to get a better idea of what is happening, let’s filter the tracefile information. Issue the following command on the tracefile:

grep -e ^oss tracefile.trc | grep -v done | less

The first part looks something like this:

ossnet_optip_trace: Host is not NUMA. Returning no IP preference
oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_09_enkcel01 with guid 217e470e-d3f5298-1ffed28e-a00133d2
oss_open called for o/192.168.12.5/DATA_CD_00_enkcel03 with guid 78804872-173010d6-2c12c6d2-b7bc6780
oss_open called for o/192.168.12.5/DATA_CD_08_enkcel03 with guid fd1f4fce-15180efd-83fbde2f-8465401c
oss_open called for o/192.168.12.4/DATA_CD_00_enkcel02 with guid ead44d03-e64725e4-c25a79b5-bff05885
oss_open called for o/192.168.12.4//box/predicate
oss_open called for o/192.168.12.5/DATA_CD_05_enkcel03 with guid 20c344ef-53a0387b-c2dbd306-ad971476
oss_open called for o/192.168.12.4/DATA_CD_08_enkcel02 with guid e2294076-cce51fa-edaecc3f-a966ca8e
oss_open called for o/192.168.12.4/DATA_CD_09_enkcel02 with guid 3073459f-8f6928a0-c70f77dc-9ed9e840
oss_open called for o/192.168.12.5/DATA_CD_10_enkcel03 with guid 73b44c5a-a44544b2-82933066-8cf0410a

I’ve included the NUMA line again. Next we see ‘oss_open’ lines. For every griddisk there’s a line to indicate that the ‘disk’ is identified. Also, when a storage server is found for the “first time” for a smart scan, we see a line with “/box/predicate”, which probably sets up a communication channel to the storage server to push the predicate information, so the storage server can use that. It seems logical to me that first information about the storage environment is gathered by the foreground process that is going to use it.

The next phase looks like this:

oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 4, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 2, opcode: 99
oss_ioctl open file: 4, opcode: 99
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_wait called for request: 0xdcb52e0
oss_wait called for request: 0xdcb59f0

The phase seems to execute “oss_ioctl”, and after every ioctl there seems to be a corresponding “oss_wait” line. “ioctl” probably means Input Output Control. I am not sure what happens during the ioctl phase. The number of ioctl requests (19 in my case) is too low to have a ioctl per griddisk, and the number of extents is much higher for too.

The next phase is executing “oss_cread”:

oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38

My guess (again) is “cread” means compressed read, and seems to be the indication that the database is reading the results of an offloaded query. When a non-offloaded query is executed, the OSS read function is “oss_read”, but is referred to as “BREAD” in the tracefile (block(mode) read). The offset of oss_cread seems to be always “0″, and the size seems to be “1048576″ (1MB) in all cases. The line with ‘failed’ does not seem to indicate a problem.

The IOCTL and CREAD’s seem to be mixed for a little while, after which the CREAD’s are the only ones left.

So, an offloaded query has a open, ioctl and cread phase. Now let’s add in the waits:

grep -e ^oss -e ^WAIT tracefile.trc | grep -v done | less

This is an example with different snippets to indicate what the ‘cell smart table scan’ means:

oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 384 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121550711
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 526 cellhash#=3176594409 p2=0 p3=0 obj#=15909 tim=1381752121558781
oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
WAIT #139739525440176: nam='cell smart table scan' ela= 898 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121631152
oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38
WAIT #139739525440176: nam='cell smart table scan' ela= 9249 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122240642
oss_wait called for request: (nil)
oss_wait called for request: (nil)
WAIT #139739525440176: nam='cell smart table scan' ela= 197 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122250860

What we see, is that during the entire “life” of a smartscan, during the 3 phases the same wait event ‘cell smart table scan’ is used. This means that there’s little that can be derived from the wait event, except for the fact that a smartscan is used.

This is a quick post about how you can get the debuginfo packages on your Oracle Linux system in the easiest way thinkable: via yum.

I guess most people reading this are familiar with Oracle Linux, and do know how to install it, and how to use the public yum server to install and update packages on Linux right from Oracle’s free internet repository. If you do not know, follow the link and learn.

As a sidestep from the purpose of this blog articel: during the ACE Director briefing prior to Oracle Openworld Wim Coekaerts announced that the public-yum repository is now hosted on Akamai instead of somewhere “unofficial” in the Oracle infra. This is really, really noticeable when using it now, previously I could not get beyond a speed of approximately 500K/s, now I can get speeds of 10M/s.

Anyways, in some cases you want to look deep inside your Linux system (for example using gdb), and want gdb to understand the functions it encounters as much as possible. In order for gdb to understand functions, the debuginfo package of the executable you are running must be installed. Please mind some things simply do not have a corresponding debuginfo package, or even an rpm package; for example the Oracle database.

Finding debuginfo packages has been a painful process, at least for me this has been this way. For Oracle Linux, the debuginfo packages are at http://oss.oracle.com/ol6/debuginfo. As far as I know, this is not widely known (but this could be my ignorance).

But it gets even better! You can use this location as a yum location, and install the debuginfo packages via yum, if you add this piece of text in /etc/yum.repos.d as the file ‘debuginfo.repo’:

[ol6_debuginfo]
name=Oracle Linux 6 debuginfo
baseurl=http://oss.oracle.com/ol6/debuginfo
gpgkey=https://oss.oracle.com/ol6/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

A word of caution: I’ve found the yum metadata to be outdated with the available packages, which means that I’ve had to install a package “directly” (via rpm) from the site instead of via yum, because the metadata didn’t contain this new package. So if you update your kernel (for example), and can’t find the corresponding debuginfo package, take a look at the package list yourself, it might be there, but not in the metadata at that time.

Follow

Get every new post delivered to your Inbox.

Join 1,883 other followers

%d bloggers like this: