Archive

Tag Archives: oracle

strace is a linux utility to profile system calls. Using strace you can see the system calls that a process executes, in order to investigate the inner working or performance. In my presentation about multiblock reads I put the text ‘strace lies’. This is NOT correct. My current understanding is that strace does show every system call made by an executable. So…why did I make that statement? (editorial note: this article dives into the inner working of Linux AIO)

During the hotsos symposium in Dallas I was chatting with Tanel Poder, and he asked me to look a little bit more into the linux io_getevents() call and strace, because there might be an optimisation which keeps the call from truly issuing a system call, which means strace could be right. We started thinking about it a bit, and came to the conclusion it should be possible for the linux AIO code to cut the corner and peek at the IOs before executing the io_getevents system call (as a spoiler: because the IO context is in userspace).

So, what to do to investigate this? Well, let’s just look at how it works. The Oracle executable executes io_getevents_0_4() in order to do the system call io_getevents(). The function io_getevents_0_4() comes from libaio (the linux asynchronous IO library). After a small search, it appears libaio has a git source code repository, so we can peek into the source code directly from our browser!

If you browse to the source tree, you see the file io_getevents.c. If you click on it, you see the contents of this file, which has the function io_getevents_0_4() in it. This is a very simple function (actual function source code):

int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout)
{
	struct aio_ring *ring;
	ring = (struct aio_ring*)ctx;
	if (ring==NULL || ring->magic != AIO_RING_MAGIC)
		goto do_syscall;
	if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
		if (ring->head == ring->tail)
			return 0;
	}
	
do_syscall:	
	return __io_getevents_0_4(ctx, min_nr, nr, events, timeout);
}

If you look at line 7, you see ‘if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0)’. In other words: if timeout (struct) is set to any value (not NULL), and if the timeout->tv_sec (the seconds portion of the timeout struct) is set to 0 and if the timeout->nsec (the nanoseconds portion of the timeout struct) is set to 0, we enter this function. Once in the function, we look at the struct ring, which is defined as a struct aio_ring from the pointer ctx which is passed to the function io_getevents_0_4(); the first argument. If ring->head is the same as ring-> tail, in other words: if the ring (buffer) is empty, we cut the corner and return 0, without executing the system call. In any other case, the function __io_getevents_0_4() is executed, which executes the system call.

A way to check if this truly is happening, is using the gdb ‘catch syscall’ functionality. In my investigation, I executed ‘break io_getevents_0_4′, which breaks on the userland portion of the io_getevents() function, ‘catch syscall io_getevents’, which breaks when the system call truly is executed, and ‘break io_submit’ to understand which getevents are executed in what number. I setup a testcase with a sqlplus session with the server process throttled to 1 IO per second (see my article on using cgroups to throttle IO), attached to the server process with gdb, and executed the following commands:

break io_submit
commands
silent
printf "io_submit\n"
c
end
break io_getevents_0_4
commands
silent
printf "io_getevents_0_4-libaio\n"
c 
end
catch syscall io_getevents
commands
silent
printf "io_getevents-syscall\n"
c
end

Next I executed a SQL which did a direct path full table scan. This is the result:

io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall
io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall

If you recall what is in the about multiblock reads presentation: after the io_submit ‘phase’, Oracle executes up to 4 io_getevents() calls non-blocking to look for IO. In this case you see the calls being done in user land, but not making it to the system call, because of the shortcut in the io_getevents_0_4() code. After 4 times, Oracle executes io_getevents() with timeout set to 600 seconds, which makes the call truly execute a system call. Please mind that ‘catch syscall’ triggers twice (as can be seen from the two ‘io_getevents-syscall’ in the above example), but is in reality only 1 system call. This proves the working of the code of the io_getevents_0_4() function we looked into, and the reason why I thought the strace utility lied.

For some time now, I am using gdb to trace the inner working of the Oracle database. The reason for using gdb instead of systemtap or Oracle’s dtrace is the lack of user-level tracing with Linux. I am using this on Linux because most of my work is happening on Linux.

In order to see the same information with gdb on the system calls of Oracle as strace, there’s the Oracle debug info repository. This requires a bit of explanation. When strace is used on a process doing IO that Oracle executes asynchronous, the IO calls as seen with strace look something like this:

io_submit(140425370206208, 1, {{0x7fb7516c4bc0, 0, 0, 0, 257}}) = 1
io_getevents(140425370206208,1,128,{{0x7fb7516c45e8,0x7fb7516c45e8,106496,0}}, {600, 0}) = 1

This reveals exactly how Oracle used these calls. In case you wonder how to read these calls: Linux (as well as any other Unix like operating system) provides man pages (manual pages) for not only for the command line tools, but also on system calls, c library functions and device and special files, among others. So if you wonder what the io_submit line means, type ‘man io_submit’, or to be 100% sure you look in the manual pages of the system calls, type ‘man 2 io_submit’ to specify you want section 2: system calls.

When I use gdb, and break on io_submit and io_getevents, I get this information:

Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/libaio.so.1
Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/libaio.so.1
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()

I think everybody can spot that I got less information now. In fact, I now know the calls have happened, and that’s all, there is no additional information. In order to get part of the information back that was visible with strace, use the debuginfo package of libaio. The debug info package must match 100% the version of the package it is meant to provide debug symbols about, because it provides debug information about the executable or library based on physical code locations.

In order to get information on these specific calls (libaio calls), the libaio-debuginfo package can be installed. Once done, we get a great deal of information which resembles strace:

Breakpoint 1, io_submit (ctx=0x7ff8b626c000, nr=1, iocbs=0x7fffa5c31a80) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 2, io_getevents_0_4 (ctx=0x7ff8b626c000, min_nr=2, nr=128, events=0x7fffa5c37b68, timeout=0x7fffa5c38b70) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

This shows all the arguments which are used by the process which is traced with gdb. Please mind that gdb breaks on entering the call, so it doesn’t give a return code. And the return code of io_getevents() is what returns the number of IO’s which are ready, so that information is still not visible, but is visible with strace, which does provides the return code.

How about the Oracle user land calls? I use breaking on kslwtbctx() and kslwtectx() a lot, which indicate the starting (kslwtbctx()) and stopping (kslwtectx()) of a wait event. When doing so, this is how it looks like:

Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/libaio.so.1
Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/libaio.so.1
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 4, 0x0000000007cf47b6 in kslwtbctx ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 5, 0x0000000007cfb4f2 in kslwtectx ()

Here we see the libaio functions again, together with the Oracle wait event functions. When using these calls this way, we can safely say that there are some calls done outside of a wait, and one call is done inside of a wait. Because this measurement is done on a well known piece of Oracle code (well known to me: executing a full table scan via direct path), I just know the wait is ‘direct path read’. But what if you do not know? Wouldn’t it be nice to know which wait is called here?

The simplest way to get more information on Oracle function calls is to get the debug information for the Oracle database. However, since that makes references to the source code, that will probably never happen. So, does that mean this is all we can get? No.

In order to get more information out of a function call, we need to dive a little deeper into the internals of Linux x86_64. When a function is called, the arguments are passed on via processor registers. This is implementation specific, and differs between 32-bit and 64-bit. An overview of how that works is summarised in this table. The important line is: “The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9, while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments. For system calls, R10 is used instead of RCX.”

So. This means that if I look at the CPU registers when breaking on a function, there might be something usable. I say “something usable” deliberately, because the Oracle function calls are not publicly documented (I think/hope they are inside Oracle development). I’ve done some investigation, and it turns out that at the END of a wait event, there are a few functions which are called which have some information stored in a CPU register which is useful:
a) First the function kslwtectx() is called to mark the ending of a wait event.
b) Then a function called kslwtrk_enter_wait_int is called, which stores the time the took in the register R13.
c) Next a function called kskthewt is called, which stores the number of the wait event (V$EVENT_NAME.EVENT#) in RSI.

If we combine that information in a little gdb macro, it looks like this:

break kslwtbctx
  commands
    silent
    printf "kslwtbctx\n"
    c
  end
break kslwtectx
  commands
    silent
    printf "kslwtectx -- "
    c
  end
break kslwtrk_enter_wait_int
  commands
    silent
    set $time=$r13
    c
  end
break kskthewt
  commands
    silent
    printf "wait: %d, time: %d\n", $rsi, $time
    c
  end

Put this in a text file, and once attached to a process to trace with gdb, load it using ‘source ‘.
Here is how it looks like when you put it on a process (I’ve put it on the checkpoint process):

kslwtbctx
kslwtectx -- wait: 7, time: 2999054
kslwtbctx
kslwtectx -- wait: 81, time: 1979
kslwtbctx
kslwtectx -- wait: 81, time: 1050
kslwtbctx
kslwtectx -- wait: 81, time: 1216
kslwtbctx
kslwtectx -- wait: 81, time: 2031
kslwtbctx
kslwtectx -- wait: 83, time: 10443

If you want to learn more about this stuff, don’t forget I will be doing a hands-on session on using gdb as a pre-conference training during Collaborate 2014 in Las Vegas.

Recently I am involved in a project which requires a lot of data to be extracted from Oracle. The size of the data was so huge that the filesystems filled up. Compressing the output (using tar j (bzip2) or z (gzip)) is an obvious solution, but this can only be done after the files are created. This is why I proposed compressing the output without ever existing in uncompressed form.

This solution works with a so called ‘named pipe’, which is something for which I know for sure it can be done on Linux and unix. A named pipe has the ability to let two processes transfer data between each other. This solution will look familiar to “older” Oracle DBA’s: this was how exports where compressed from the “original” export utility (exp).

I’ve created a small script which calls sqlplus embedded in it, and executes sqlplus commands using a “here command”:

mknod /tmp/oracle.pipe p

sqlplus / as sysdba << _EOF
set escape on

host nohup gzip -c < /tmp/oracle.pipe > /tmp/out1.gz \&
spool /tmp/oracle.pipe
select * from dual;
spool off

host nohup gzip -c < /tmp/oracle.pipe > /tmp/out2.gz \&
spool /tmp/oracle.pipe
select * from dual;
spool off

_EOF

rm /tmp/oracle.pipe

First a pipe is created (mknod filename p). As far as I know, this command is the same on Linux and the unixes. This pipe is removed as the last step of the script.

Inside the sqlplus script, I issue the gzip operating system command using the ‘host’ command. The line with the host command starts the gzip command with the pipe as input, and output to a .gz file in /tmp. The process is put in the background using ‘&’.
Next, the sqlplus spool command starts output to the pipe, and I execute a dummy sql (select * from dual).
With ‘spool off’, the output to the pipe is stopped. This makes the gzip process in the background to stop.
Because the gzip process is not compressing anymore to the first file, it can be used for a second time, and more times of course.

The result is two gzipped files:

zcat -v /tmp/out*.gz
/tmp/out1.gz:	SQL> select * from dual;

D                                                                               
-                                                                               
X                                                                               

SQL> spool off
 84.2%
/tmp/out2.gz:	SQL> select * from dual;

D                                                                               
-                                                                               
X                                                                               

SQL> spool off
 84.2%

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

- IO size by number of reads and writes
- IO size versus latency for reads and writes
- IO size versus pending IO count for reads and writes
- IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18″, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)"; 

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2″ means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

Recently we upgraded an Exadata to the currently latest version, 11.2.3.2.0. The Exadata software itself consists of an image for the storage servers (the storage servers are essentially re-imaged), and a set of updates for the database/computing nodes, including: firmware for ILOM (lights out adapter), BIOS, LSI RAID adapter, Infiniband adapter, linux kernel, drivers, mandatory packages, to name some.

One of the exceptional things this upgrade does, is remove the hot-spare out of the RAID set on the database/compute nodes. This is documented in MOS note: 1468877.1, as ‘known issue 5: hotspare removed for compute nodes’. For some sites, this actually can be a good thing, if they are really tight on disk space on the compute nodes of Exadata. I must say that we have not encountered this situation. What this means, is that the actual HDD configuration on the compute node is left to the customer, instead of having one mandatory configuration (having 3 disks in a RAID-5 configuration, and one hot-spare).

So if you decide to use the former hot-spare disk as an active part of the RAID configuration, you are effectively trading availability for diskspace. Please mind the RAID set itself already provides redundancy, even without the hot-spare!

On the other hand, I think in most configurations, it makes sense to convert the disk back to being hot-spare.

This is done in the following way:

a) Get an overview of the current disk configuration:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B70

This shows the disk in slot number 3 being left “unconfigured”, but in “good” state (of a disk has gone bad because of errors, it will be removed from the RAID set, and will show up as “unconfigured(bad)”!)
This is the state the upgrade to 11.2.3.2.0 leaves your system.

New let’s make the disk hot spare again!

b) Get the inclosure id:
/opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252

This means we know the enclosure id (252) and the slot number (3), which is the information needed for the MegaCli utility to revert the unconfigured disk to hot-spare again!

c) Revert the unconfigured disk back to hot-spare
/opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[252:3] -a0
Adapter: 0: Set Physical Drive at EnclId-252 SlotId-3 as Hot Spare Success.
Exit Code: 0x00

d) Check the disk configuration again:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Hotspare, Spun Up
Device Firmware Level: 0B70

It appears that having the disk been removed from the RAID set by the update to 11.2.3.2.0 generates an ASR message. At least it did at our site, despite this being an undocumented bug (7161023,’ASR generating false errors in relation to disks’) which is marked resolved in 11.2.3.1.0 (?). Most sites I encounter have ASR setup, but not having all messages send additionally to local, onsite monitoring. I want to stress it’s very important to have the ASR messages sent to your own monitoring too!

Oracle Support does not list all the specifications from an ASR message it has gotten. Instead, a Service Request is made with enough information for Oracle itself (!!). In our case, the exact error message was NOT specified, only ‘compute server hard disk predictive failure’ and the node name.

Where do you look on an Exadata for that information? The first logical point is the ASR daemon. I didn’t spend too much time on it, but it seems that it’s more a proxy for messages than a database. I wasn’t able to find useful information about the systems which where using this daemon.

What are the sources for ASR with an Exadata? These are:

Computing node:
- “compmon daemon” / Linux level monitoring
- ILOM
Storage node:
- “cell daemon” / Linux level monitoring
- ILOM

For the computing node, it’s quite easy to see if there are any detected failed devices from the viewpoint of the ILOM:
(please mind ipmitool -I open only works on the local system)
# ipmitool -I open sunoem cli "show /SP/logs/event/list Severity==(Major,Critical,Down)"
Connected. Use ^D to exit.
-> show /SP/logs/event/list Severity==(Major,Critical,Down)

ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------

-> Session closed
Disconnected

This shows no messages with the severity Major, Critical or Down are in the eventlog in the ILOM. Please mind that the logons to the ILOM have severity “Minor”. These are in most system the vast majority of the messages, which are not of interest for this investigation. If you want to know if something has failed, there even a simpler command:
# ipmitool -I open sunoem cli "show faulty"

For the “compmon daemon”, grep the processlist for “compmon”:
# ps -ef | grep compmon
root 12812 1 0 Oct22 ? 00:00:11 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server

The most important part here is the directory: /opt/oracle/cellos/compmon
If you navigate to that directory, you will see a number of “state files”: asrs.state, traps.state and disks.state.
The disks.state lists the disk status as listed with a) with the firmware state.
The most important file for the ASR message investigation is the traps.state file. This file lists traps it has sent to ASR. In our case:
1 ; Mon Oct 22 14:39:10 2012 ; 86425886-b359-4587-8d46-f31ff2ecb135 ; Physicaldisk : Make Model: is at status predictive failure. Raised fault id: HALRT-02008 ; Physical disk should be replaced. Exadata Compute Server: Disk Serial Number:
Yes, this is pasted correctly, it misses Physicaldisk, Make Model and Disk Serial Number information. This has not been omitted for safety, it just is not listed.
So, the failure which was sent was HALRT-02008 in our case.

For completeness, the ILOM layer can be investigated identically to the description of the ILOM handling on the computing layer. The Linux layer messages can be investigated with: # cellcli -e list alerthistory
32 2012-10-17T02:00:27+02:00 info "HDD disk controller battery on disk contoller at adapter 0 is going into a learn cycle. This is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours. The disk controller cache might go into WriteThrough caching mode during the learn cycle. Disk write throughput might be temporarily lower during this time. The message is informational only, no action is required."
33 2012-10-22T11:43:21+02:00 info "Factory defaults restored for Adapter 0"
34 2012-10-22T11:43:23+02:00 info "Factory defaults restored for Adapter 0"

An Oracle Exadata database machine consists of several parts: Intel based servers, infiniband switches, a cisco ethernet switch, a KVM switch and the hardware surrounding it like cables, 19″ rack, power distribution units. The Intel based servers are what “Exadata administrators” are administering the most. The intention of this article is to let the reader gain a little more insight into Exadata specific administration on those.

Two server layers: computing and storage
The two layers have quite different properties: the Exadata computing layer is Linux with Oracle grid infrastructure and the Oracle database software installed, very much as you would do yourself (if you install it in a strict OFA way), and the storage layer is Linux too, but with specific Exadata storage software.

Computing layer
The computing layer filesystems are using LVM on newer (X2) versions of Exadata, and using plain partitions without LVM on the V2 version of Exadata. There are 3 filesystems on a X2 computing node:

[root@xxxxdb01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
30G 14G 15G 49% /
/dev/sda1 124M 16M 102M 14% /boot
/dev/mapper/VGExaDb-LVDbOra1
99G 29G 65G 32% /u01
tmpfs 81G 196M 81G 1% /dev/shm

These are two filesystems inside logical volume, and the boot partition on a plain partition. The /boot partition consists of the grub configuration, kernel and initrd (initial ramdisk). The root logical volume (LVDbSys1) contains the linux operating system files, and the /u01 logical volume (LVDbOra1) contains the Oracle grid/cluster and database software.

A nice thing to know about the computing layer storage is that there is space left in the single volume group which is used for the logical volumes:

[root@emc1db01 ~]# vgdisplay
--- Volume group ---
VG Name VGExaDb
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 3
Max PV 0
Cur PV 1
Act PV 1
VG Size 557.62 GB
PE Size 4.00 MB
Total PE 142751
Alloc PE / Size 39424 / 154.00 GB
Free PE / Size 103327 / 403.62 GB <<-- 400GB space left!
VG UUID AZbMbf-papI-6FE2-hYlR-bzE1-G4pp-Bou81e

The disk configuration on the computing layer is build upon a RAID 5 set of 3 disks with a hot spare. The RAID is hardware based (the diskcontroller is reponsible for the RAID maintenance). This is visible through the MegaCli64 command:

[root@emc1db01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name : DBSYS
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
Size : 557.75 GB
State : Optimal
Stripe Size : 1.0 MB
Number Of Drives : 3
Span Depth : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disabled
Encryption Type : None

Number of Dedicated Hot Spares: 1
0 : EnclId - 252 SlotId - 3

Exit Code: 0x00

Let me emphasize this disks/RAID-set only is responsible for serving the operating system and accompanying files, and the grid/clusterware and database executables. All true database processing is done on the storage layer nodes.

Storage layer
The storage layer filesystems are not using LVM, nor regular partitions (/dev/sd* for disks visible as SCSI devices for linux):

[root@xxxxcel01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md5 9.9G 3.4G 6.0G 37% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/md7 2.0G 623M 1.3G 33% /opt/oracle
/dev/md4 116M 37M 74M 34% /boot
/dev/md11 2.3G 222M 2.0G 10% /var/log/oracle

The devices used are linux software RAID devices (‘md’ means multiple devices, which is a linux device which is build from independed underlying devices). The reason for using linux software RAID for the operating system and Oracle storage software is to provide redundancy, because the underlying disks on the storage servers do not provide that. Redundancy/error protection for the databases data on the storage servers is done using ASM normal redundancy, which is, put blunt and simple, RAID 10 handled on the ASM layer.

A logical next question is: okay, how do these linux software RAID/MD devices look like? Well, the most simple way is to use /proc/mdstat:

[root@emc1cel01 ~]# cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb1[1] sda1[0]
120384 blocks [2/2] [UU]

md5 : active raid1 sdb5[1] sda5[0]
10482304 blocks [2/2] [UU]

md6 : active raid1 sdb6[1] sda6[0]
10482304 blocks [2/2] [UU]

md7 : active raid1 sdb7[1] sda7[0]
2096384 blocks [2/2] [UU]

md8 : active raid1 sdb8[1] sda8[0]
2096384 blocks [2/2] [UU]

md1 : active raid1 sdb10[1] sda10[0]
714752 blocks [2/2] [UU]

md11 : active raid1 sdb11[1] sda11[0]
2433728 blocks [2/2] [UU]

md2 : active raid1 sdb9[1] sda9[0]
2096384 blocks [2/2] [UU]

unused devices:

We see the MD device name: md and a device number, the state (‘active’), the RAID type (‘raid1′) and the actual devices the md device is build from (sdb1[1] sda1[0] for example). On the next line the number of blocks in the md device, the total number of disks and the number of active disks [2/2], and the status of the disks [UU]. “U” means up. A failed device (probably as a result of a failed disk, but it could be the result of the clearing of a partition) is visible as “_”.

Okay, this is all strongly linux related, where’s the Exadata stuff?

Cellos
‘Cell’ is a name used for the storage server in the Exadata context. Both the computing nodes and the storage nodes have a directory ‘/opt/oracle.cellos’. This directory contains Oracle software which is embedded in Linux. The function of cellos is to maintain and check firmware versions and hardware configurations and hardware activation/deactivation.

Validation functions are done by ‘cellos’, or ‘scripts inside the /opt/oracle.cellos directory’ during startup phase of linux. One of the functions cellos can perform is flashing firmware. Strongly related to cellos are the linux kernel and the OFED stack.

Imageinfo
The state of the cellos image is listed using the imageinfo utility (which exists both in /usr/local/bin and in /opt/oracle.cellos), which executes the /opt/oracle.cellos/patch/bin/imageinfo.sh script:

On a database/computing node:

[root@xxxxdb01 bin]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Image version: 11.2.2.2.0.101206.2
Image created: 2010-12-09 10:06:22 -0800
Image activated: 2011-01-28 14:19:59 -0800
Image image type: production
Image status: success
Internal version:
Image label: OSS_11.2.0.3.0_LINUX.X64_101206.2
Node type: COMPUTE
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

On a cell/storage node:

[root@xxxxcel01 bin]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Cell version: OSS_11.2.0.3.0_LINUX.X64_101206.2
Cell rpm version: cell-11.2.2.2.0_LINUX.X64_101206.2-1

Active image version: 11.2.2.2.0.101206.2
Active image created: 2010-12-09 09:37:46 -0800
Active image activated: 2011-01-28 14:33:44 -0800
Active image type: production
Active image status: success
Active internal version:
Active image label: OSS_11.2.0.3.0_LINUX.X64_101206.2
Active node type: STORAGE
Active system partition on device: /dev/md5
Active software partition on device: /dev/md7

In partition rollback: Impossible

Cell boot usb partition: /dev/sdac1
Cell boot usb version: 11.2.2.2.0.101206.2

Inactive image version: undefined
Rollback to the inactive partitions: Impossible

We see imageinfo knows if it’s run on a computing or storage node (‘Active node type’).

CheckHWnFWProfile
Another executable in /opt/oracle.cellos is ‘CheckHWnFWProfile’. The name of this executabe is quite self-explanory: it checks if the configuration of the server it’s run on is valid, which means it is listed in one of the profiles (Exadata has different incarnations with different hardware, of which Oracle calls the hardware and firmware combination a ‘profile’).

If it’s run without any option, it should return [SUCCESS], which means the hardware and firmware are correct for that version of cellos:

[root@xxxxdb01 ~]# /opt/oracle.cellos/CheckHWnFWProfile
[SUCCESS] The hardware and firmware profile matches one of the supported profiles

If you want more information about the server, you can list the configuration with the ‘-d’ option:

[root@xxxxdb01 ~]# /opt/oracle.cellos/CheckHWnFWProfile -d
[SystemManufacturer]
SUN MICROSYSTEMS
[SystemModel]
SUN FIRE X4170 M2 SERVER
[BIOSVendor]
American Megatrends Inc.
[BIOSVersion]
08040203
[BIOSDate]
09/14/2010
[ProcessorInformation]
Cores:24
Intel(R) Xeon(R) X5670 2.93GHz
step: 2
[ILOMVersion]
3.0.9.27.a r58740
[PhysicalMemoryGB]
96
[DiskControllerPCIeSlotNumber]
PCI-E Slot 0
[DiskControllerModel]
LSI MegaRAID SAS 9261-8i
[DiskControllerFirmwareRevision]
12.9.0-0049
[DiskControllerSilicon]
B4
[DiskControllerBatteryBackupUnitVersion]
iBBU08
[InfinibandHCAPCIeSlotNumber]
PCI-E Slot 2
[FansAndFanCount]
FM0 FM1 FM2 FM3 4
[PowersuppliesAndCount]
PS0:OK PS1:OK 2
[InfinibandHCAPCIeSlotWidth]
5Gbps,x8
[InfinibandHCADeviceID]
26428
[InfinibandHCAChipRevision]
B0
[InfinibandHCAImageType]
ConnectX
[InfinibandHCAFirmwareVersion]
2.7.8100
[SASBackplaneFirmwareVersion]
NotAvailable
[DiskControllerPCIeSlotWidth]
x8
[10GEtherModelPCISlot]
82599EB PCI-E Slot 1
82599EB PCI-E Slot 1
[InfinibandHCAHardwareRev]
50
[LightsOutFirmwareVersion]
Firmware Revision : 3.0
[Enclosure:Slot:DiskType:DiskMake:DiskModel:Diskfw]
ALL:ALL SAS SEAGATE ST930003SSUN300G 0B70

If ‘CheckHWnFWProfile’ resulted in an error, you can list your actual configuration with ‘-d’, and use the ‘-s’ option to list what the profile or supported configuration is.

In my previous post I described how the IO resource manager (IORM) in the exadata storage can be used both to guarantee a minimum amount of IO a database can get (which is what is covered in most material available), but also to set a maximum amount of IO. This is what Oracle calls an inter-database resource manager plan. This is set and configured at the cell level using the cellcli with ‘alter iormplan dbplan’.

But Randy Johnson (who is one of the authors of the upcoming book ‘Expert Oracle Exadata’) commented this is not the only way to limit IO: this can be done using the database resource manager (DBRM) too. This means you can set a limit on the CPU resource usage on the database layer, which gets pushed to the storage layer and enforces IO limits according to the limit set on the database layer. Interesting!

Let’s try it!

Here’s a simple query without any resource management:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:00:44.00

Now create a simple resource manager plan, with one resource group which limits usage to 1%:

BEGIN
DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

DBMS_RESOURCE_MANAGER.CREATE_PLAN(
PLAN => 'TESTPLAN',
COMMENT => 'Testplan! Limit a user to 1 percent');

DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP (
CONSUMER_GROUP => 'LIMITED',
COMMENT => 'This is a limited group');

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
PLAN => 'TESTPLAN',
GROUP_OR_SUBPLAN => 'LIMITED',
COMMENT => 'This group is limited to 1 percent',
MAX_UTILIZATION_LIMIT => 1);

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
PLAN => 'TESTPLAN',
GROUP_OR_SUBPLAN => 'OTHER_GROUPS',
COMMENT => 'Mandatory group');

DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.ORACLE_USER, 'CG', 'LIMITED');

DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA();
DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/

Mind the “MAX_UTILIZATION_LIMIT => 1″! Using this plan, all sessions in this database are not limited, except for the Oracle User ‘CG’, which has a maximal utilization limit of 1 percent.

Now grant access to the resource group ‘LIMITED’ for the user ‘CG’:

BEGIN
DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group(
grantee_name => 'CG',
consumer_group => 'LIMITED',
grant_option => FALSE);
END;
/

And activate the resource manager plan:

ALTER SYSTEM SET RESOURCE_MANAGER_PLAN = 'TESTPLAN' sid='*';

And try the same simple SQL again:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:36:10.21

This is clearly limited!

The resource management is done on the cell/storage layer. This means the waitevents do not indicate resource management (!). When looking at the waits, the process is mostly waiting for the ‘cell smart table scan’ event.

Follow

Get every new post delivered to your Inbox.

Join 1,883 other followers

%d bloggers like this: