Archive

Linux

This post really is about using LVM (Logical Volume Manager; an abstraction layer for disk devices) snapshots. A snapshot is a frozen image of a logical volume, which simply means “filesystem”. It’s not really “frozen”, LVM2 snapshots are read/write by default. But you can freeze a filesystem in time with a LVM snapshot.

The background of this really is Exadata (computing node) and upgrading, but has nothing unique to Exadata. So don’t let this bother you. But the idea of using LVM snapshots popped up when dealing with Exadata computing nodes and upgrades.

First of all: LVM is in development, which mean different Linux versions have different LVM options available to them. I am using the Exadata X2 Linux version: RHEL/OL 5u7 x86_64. I guess OL6 has more and more advanced features inside LVM, but with X2, OL5u7 is what I have to use. So the steps in this blogpost are done with this version. Any comments are welcome!

Second: if you want to experiment with this: most people allocate all space in the volume group to logical volumes upfront. A snapshot is a COW (Copy on Write) copy of a logical volume. This means a snapshot starts off with zero extra bytes (source and snapshot are equal), and grows as the source gets modified. This means you need to have free/available space in the volume group to facilitate the snapshot.

Then there is another caveat: the ‘/boot’ filesystem cannot be in LVM, so is a normal partition on most systems (also on Exadata). This means snapshots do not help if you want a backup of that filesystem. You need to use another trick.

Okay, here we go: you have a large modification upcoming and want to be able to restore your system to this moment in time.

1. Backup /boot filesystem

[root@localhost ~]# df /boot 
Filesystem           1K-blocks  Used Available Use%  
Mounted on /dev/sda1    248895 24037    212008  11% /boot
[root@localhost ~]# umount /boot
[root@localhost ~]# dd if=/dev/sda1 of=dev_sda1_backup
[root@localhost ~]# mount /boot

2. Create snapshot of logical volumes
In this example it’s only the root filesystem (which is a bit special, because this filesystem is set in grub.conf for the bootloader, and in /etc/fstab).

[root@localhost ~]# lvdisplay -v /dev/vg00/lv_root 
Using logical volume(s) on command line 
/dev/hdc: open failed: No medium found 
--- Logical volume --- 
LV Name /dev/vg00/lv_root 
VG Name vg00 
LV UUID wutwln-ffdB-QRlg-1LgL-XqKB-glvn-OpCowW 
LV Write Access read/write 
LV Status available # open 1 
LV Size 3.91 GB 
Current LE 125 
Segments 1 
Allocation inherit 
Read ahead sectors auto 
- currently set to 256 
Block device 253:0 

lvdisplay displays the properties of the logical volume to be snapshotted, I use this to spot the number of LE’s.

[root@localhost ~]# lvcreate -l 125 -s /dev/vg00/lv_root -n lv_root_snap 
/dev/hdc: open failed: No medium found 
Logical volume "lv_root_snap" created

The system now has been backed up in a way that we can revert to this situation.

Next would be to do the O/S upgrade, software upgrade or whatever. It goes horribly wrong, and you need to restore the system to the previous situation.

1. Rename current logical volume
In this case, I rename my lv_root logical volume to lv_root_old:

[root@localhost ~]# lvrename /dev/vg00/lv_root /dev/vg00/lv_root_old
/dev/hdc: open failed: No medium found 
Renamed "lv_root" to "lv_root_old" in volume group "vg00"

The logical volume (which we currently use) has been renamed.

2. Create new lv_root
This is the logical volume we are going to use to put the snapshot in.

[root@localhost ~]# lvcreate -l 125 -n lv_root /dev/vg00
Logical volume "lv_root" created

3. Populate the new lv_root with the snapshot contents

[root@localhost ~]# dd if=/dev/vg00/lv_root_snap of=/dev/vg00/lv_root 	
8192000+0 records in 
8192000+0 records out 
4194304000 bytes (4.2 GB) copied, 93.0858 seconds, 45.1 MB/s

4. Restore the /boot filesystem

[root@localhost ~]# umount /boot
[root@localhost ~]# dd if=dev_sda1_backup of=/dev/sda1
[root@localhost ~]# mount /boot

5. Reboot

[root@localhost ~]# shutdown -r now

This sequence of events enabled me to restore my system to a post-modification situation. Of course you should test this very thoroughly for your own situation, but this offers an elegant way, which has little external dependencies.

Post restore/cleaning up:
When the system is reverted to its old situation, we are left with a logical volume and a snapshot which probably are not of use anymore. These can be cleaned up the following way:

[root@localhost ~]# lvremove /dev/vg00/lv_root_old 
Do you really want to remove active logical volume lv_root_snap? [y/n]: y
Logical volume "lv_root_snap" successfully removed 
Do you really want to remove active logical volume lv_root_old? [y/n]: y 
Logical volume "lv_root_old" successfully removed

When you use Oracle ASM (Automatic Storage Management) for your database, the permissions on the block devices on the operating system layer which are used by ASM need to be changed. To be more precise, the owner and group need to be set to ‘oracle’ and ‘dba’ (Oracle documentation) in my case.

I used to do this in a very lazy way, using a simple ‘/bin/chown oracle.dba /dev/sdb’ in /etc/rc.local. This worked for me with RHEL/OL version 5. This has changed with RHEL/OL 6, because the system V startup system has changed to ‘upstart’. Also, the disk devices change ownership back in OL6 if you set it by hand to oracle.dba.

Oracle has its own resolution for generating block device files especially for ASM, called ASMLib. This generally works well, but what is exactly happening to generate the devices, and how to troubleshoot the generation of the devices is (as far as I know) not documented nor generally known. I’ve been bitten a few times on my test systems, which is the reason I don’t like ASMLib, and do not feel comfortable with ASMLib.

The linux system has functionality which equals the block device generation of ASMLib, which is called udev. Udev isn’t exactly over-documented. Also, there seems to be a difference between the tools (scsi_id and rule file syntax) in RHEL/OL 5 and RHEL/OL 6. But there is much written about it. And it’s very simple, once you know what to do.

I have a separate (virtual) disk which I want to use with ASM. After startup, the block device of this disk is called /dev/sdb. In order to let udev identify this disk, the linux utility ‘scsi_id’ can be used to fetch the unique SCSI identifier of the disk. This is of importance, because the correct disk must be specified to be used with ASM. This sounds a bit dumb (of course we need the correct disk), but Linux names the devices in the order at which they are seen by the kernel. This means it’s possible that /dev/sdb is another disk if someone decides to add a disk to the system, and reboots the server. (please mind most servers do not change very much, nor I have seen this happening)

And there’s another complication: VMWare doesn’t provide a unique device number by default. In order to let VMWare provide Linux a unique number, ‘disk.EnableUUID = “TRUE”‘ must be added to the virtual machine configuration file (“vmx file”). After a reboot, ‘scsi_id’ is able to provide a serial!

So at this point, you should have a disk or disks to be used with ASM. First grab the unique serial of the disk (this is the RHEL6/OL6 syntax of the scsi_id command!):

# scsi_id -g -u -d /dev/sdb
36000c29127c3ae0670242b058e863393

Next, we need to make a udev rule file. In this file, we specify this SCSI id (which we fetched with scsi_id), and specify how we want that device created:

# cd /etc/udev/rules.d/
# cat 99-oracle-asmdevices.rules
KERNEL=="sd*", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="36000c29127c3ae0670242b058e863393", NAME+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"

(There is no ’99-oracle-asmdevices.rules’ file by default, you need to create it yourself)

That’s all. When the kernel finds a “sd*” device, subsystem “block”, with the serial “36000c29127c3ae0670242b058e863393″, it creates a device file ‘/dev/oracleasm/disk1′, owner oracle, etc.

This is also convenient because the disk string of ASM is very easy to set now: ‘/dev/oracleasm/*’.

Please mind if the block devices have the correct permissions, it doesn’t matter to ASM if the device file names have swapped. The ASM instance scans the device headers for it’s contents.

Some of you might have experimented with, or used Oprofile, ltrace/strace, Systemtap, and wished you could look further into the userland process, like dtrace does, because dtrace allows you to profile and look into a process’ function calls.

If you’ve used oprofile/strace/systemtap/etc., you probably found out you can see all the kernel functions which are processed, but it does not get into userspace, or with very limited information. Perhaps the only tool which is different is gdb, which enables you to see userspace function names, but gdb is not a profiler, it’s a debugger. And it works best with (I assume it made for) debugging symbols in the executable. Most (all I presume) commercial executables, like the oracle database executable, do not have debugging symbols.

Some time ago, a tweet from Kevin Closson caught my attention, which showed a profile of a Linux system running Oracle, which showed the functions inside Oracle. This totally shocked me! That is something I want to be able to do on Linux too!!

It was done using the perf utility. The perf utility is delivered with the kernel (rpm) package on EL Linuxes, and the since I am using Oracle Linux, I can tell you it’s delivered with the ‘UEK’ kernels (the 2.6.32+ ones). I’ve eventually used the latest version of the UEK kernel (2.6.39-200.24.1.el6uek.x86_64), which you get when you subscribe to the ol6_UEK_latest yum channel. Newer versions of the kernel have more perf functions.

The ‘perf top’ sub-utility shows the kernel and userland functions which are active (by sampling active processes). It can be used in a variety of ways, some of them are: system-wide, per processor or per process. This allows you to do look beyond the Oracle wait interface. As Tanel Poder often states in his performance classes: the Oracle wait interface is a good starting point for performance investigation, but sometimes you need to look deeper. You always need to use the appropriate layer to look at when investigating issues.

Note when using virtualization: virtualization does not allow direct access to the performance monitoring hardware. So if you try to use perf on that (or oprofile for that matter), you get no output. You can get around that with perf by using a software event. I use the software event cpu-clock. This means I invoke perf with the option ‘-e cpu-clock’

For example: this statement uses a lot of CPU, and might (depending on Oracle version, operating system type and version) never return an answer. But it is perfect to have a statement that consumes approximately 1 CPU, and keeps doing that for some time:

select count(*) from dba_extents, dba_extents, dba_extents

A quick glimpse on the statistics profile using sql_monitor reveals:

Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets	|
=========================================
|     109 |	103 |	  6.23 |   9755 |
=========================================

Of course I could look in the sql monitoring plan details, which would reveal what row sources are active at the moment, but let’s see what ‘perf top -e cpu-clock -p shows:

   PerfTop:     108 irqs/sec  kernel: 0.9%  exact:  0.0% [1000Hz cpu-clock],  (target_pid: 12727)
--------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ ____________________________________________________

             1414.00 58.8% kxhrHash                  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              389.00 16.2% qerhjSplitProbe           /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              263.00 10.9% rworofprVeryFastUnpackRow /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              122.00  5.1% smbget                    /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               93.00  3.9% sorgetqbf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               86.00  3.6% qersoFetch                /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               31.00  1.3% qerhjInnerProbeHashTable  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

This shows 58.8% of the samples takes of this process are busy in the function ‘kxhrHash’. So it’s the hashing which this process is busy with. Another important and interesting aspect is: there is very little time spend in kernel mode: (kernel: 0.9%). But…the information about where the time is spend could have been read in the sql monitoring report: it reports CPU and waits per row source.

Where perf comes in to fulfill something which could not be done different, is when you got multiple processes, and you cannot setup perf for every distinct process, or you want to understand system-wide performance. To mimic a system doing OLTP, I’ve setup Kevin Closson’s SLOB in PIO mode, and use perf top -e cpu-clock without specifying a process ID, so I measure the entire system. I get the following profile:

  PerfTop:     984 irqs/sec  kernel:92.4%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
--------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                           DSO
             _______ _____ __________________________________ ______________________________________________________________________

            13102.00 88.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]                                                     
              772.00  5.2% copy_page                          [kernel.kallsyms]                                                     
              298.00  2.0% __do_softirq                       [kernel.kallsyms]                                                     
              114.00  0.8% scsi_request_fn                    [kernel.kallsyms]                                                     
               29.00  0.2% kcbgtcr                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               25.00  0.2% kmem_cache_alloc                   [kernel.kallsyms]                                                     
               25.00  0.2% finish_task_switch                 [kernel.kallsyms]                                                     
               22.00  0.1% kcbzib                             /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               17.00  0.1% sxorchk                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Conclusion: the majority of the processes spend time in kernel calls (kernel.kallsyms). This is not too surprising: I've setup SLOB in 'PIO mode', which means I've reduced the Oracle database buffer cache as much as possible. With Oracle version 11.2.0.3, the smallest buffer cache I can create is 4MB. Next I've let 10 readers do index range scans (that is one of the functions of SLOB: swamping your system with single block (8KB) IO's). The majority of the processes are spending time in the kernel function _raw_spin_unlock_irqrestore. This is how a system which is swamped with physical IO looks like with perf.

This all changes when I revert SLOB to LIO mode, and issue the same test (SLOB run with 10 readers):

   PerfTop:    1004 irqs/sec  kernel: 3.1%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________________________________________

              478.00 23.9% kcbgtcr                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              154.00  7.7% qetlbr                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              149.00  7.4% kcbz_fp_buf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              133.00  6.6% kdsgrp                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              104.00  5.2% kafger                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               94.00  4.7% kcbrls                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrgcm                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrget3                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Ah, we now spend only 3.1% in kernel mode, and the majority of the time is spend in kcbgtcr (quite probably meaning 'kernel cache buffer get consistent read')! I've modified the buffer cache to be able to contain all blocks (also known as SLOB 'LIO mode'), so once they are read, only a logical IO is needed to fetch the block.

Every now and then I am asked about the availability of the presentations I have delivered. Recently somebody asked about a presentation I delivered at the OUG Scotland about multiblock reads, and I promised to make it available. I’ve now uploaded a PDF version of all my old presentations them and put them in the ‘Whitepapers and presentation’ section.

Enjoy!

So, you’ve just gotten a fresh installed Linux system with Oracle Linux or Redhat Linux from the sysadmin. And with Oracle Linux you can not use the internet (forbidden by company laws is a common one), or you got Redhat Linux and can not use up2date for some reason. Most of the time, when installing Oracle products I am allowed to use the root account myself during the install. The DVD most of the time is still present in the drive.

You could mount the DVD and use ‘rpm’ directly to install packages off the DVD. If you get an error the rpm package has a dependency, you resolve the dependency, if that depended package has a dependency itself, you resolve that, etc. That’s something you could do. But there is an easier way!

Mounting the DVD

The first task is to mount the DVD again. The installation procedure does not add an entry in /etc/fstab to mount the DVD in an easy way, so I do that myself. The installer makes a generic device in /dev for the DVD, /dev/dvd, which is a symbolic link to the device that truly is the DVD drive. This is how the line for /etc/fstab looks like:

/dev/dvd                /media                  udf,iso9660 noauto,user,ro              0 0

First column is the device, second column is the directory the device get mounted to, the third column are the filesystem types, the fourth column are the mount options (noauto: the DVD does not get mounted automatically, it needs to be mounted explicitly, user: a regular user is allowed to mount the device, ro: readonly).
Next we need to mount the dvd using the command:

mount /media

It will briefly wait before returning the cursor, during which it mounts the DVD to the /media directory, and can be checked by looking at all the mounts:

# mount
/dev/mapper/vg00-lvroot on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/cciss/c0d0p1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sr0 on /media type iso9660 (ro,noexec,nosuid,nodev)

Setup the repository information

Okay, now that we have the DVD mounted, let’s create a DVD based yum repository. The repositories are listed in the directory /etc/yum.repos.d. Repositories can be grouped in files. A default installed system lists a few repositories:

# ls /etc/yum.repos.d
redhat.repo  rhel-debuginfo.repo  rhel-source.repo

Redhat.repo does not contain any repository (the remarks in the file suggest that a subscription manager probably will use this file), rhel-debuginfo.repo has two repositories in it with the debug-info packages (binaries with the debugging information in the binary, so you have all the debug information to get full information with for example gdb; normal binaries are ‘stripped’), rhel-source.repo contains a repository with all the source rpm’s in it. All these repositories are disabled (enabled=0).

To get yum use the DVD as a repository, add a file ‘rhel-dvd.repo’ in /etc/yum.repos.d with the following content:

[dvd]
name=Red Hat Enterprise Linux Installation DVD
baseurl=file:///media/Server
enabled=0

The added repository is named ‘dvd’, and is disabled. In order to use this repository, add ‘–enablerepo=dvd’ to use it. This way it doesn’t interfere with anything, like someone adding a web-based (“real”) repository.

Using the repository

Now you can add packages, which dependencies are resolved by yum using:

# yum install --enablerepo=dvd packagename

just like when yum is setup with a “real” web-based repository.

Another extremely handy feature of yum is searching for a specific file in the repository. For example, if you want to send files as attachment on linux, one way of attaching files is using the ‘uuencode’ executable to generate a file as mail attachment. ‘uuencode’ is not installed by default, nor a package with that name exists. To search in the yum repository for a file called ‘uuencode’, enter the following:

# yum provides --enablerepo=dvd */uuencode
Loaded plugins: product-id, security, subscription-manager
Updating Red Hat repositories.
sharutils-4.6.1-2.x86_64 : The GNU shar utilities for packaging and unpackaging shell archives.
Repo        : dvd
Matched from:
Filename    : /usr/bin/uuencode

This shows the package ‘sharutils’ contains a file ‘/usr/bin/uuencode’.

May 4th: some updates after discussion with Jeff Holt, Cary Millsap and Ron Christo of Method-R.

There’s all the documentation, and there all the terrific blogs about performance and Oracle wait events. But if you more or less start with this stuff, or encounter a wait event that is not (extensive enough) documented, or an event turns up and gives a response time you don’t expect, you need to understand what that wait event means. If you don’t know, it’s easy to get stuck at this point.

If you are familiar with waits, and just want to dig further, progress to “Get to know wait events”, if you want to get up speed with waits, read on.

What is a wait event?

This is the definition of the performance tuning guide in the Oracle documentation:

Statistics that are incremented by a server process/thread to indicate that it had to wait for an event to complete before being able to continue processing. Wait events are one of the first places for investigation when performing reactive performance tuning.

This description is quite good I think. If I would have to make a better description, I would say a wait event is essentially the registration of time spend on executing a part of the Oracle database server code, mostly doing something elementary. The granularity of wait events can differ between Oracle versions. Wait events can describe interaction with the operating system (reading or writing are good examples of this), but also interaction between processes (enqueue’s and ipc messaging are good examples of that). Needless to say, both interactions require operating system calls.

Also, a wait event is not necessary a bad thing. Every Oracle database process is either running on CPU or waiting. If your SQL*Plus session is not running, it is waiting for ‘SQL*Net message from client’, which means the database tries to tell you that it waits for the next instruction. Another interesting thing to note about wait events is that the execution of certain instrumented (covered by a wait event) parts if Oracle server code can not report a wait per kernel developer decision. This means, for example, that if doing a latch get read went efficiently, you will not get a ‘latch free’ or ‘latch: latch name‘ wait event, because the wait registration would take exponentially more time than the actual latch get.

How to get the wait events during execution

(skip this section if you are familiar with enabling and disabling sql trace at level 8)

The way to get all the waits of the execution of one or more SQL statements for a given process is to enable SQL trace, alias event 10046, at level 8. This can be done at the system level (meaning all processes) too, but unless you want to reverse engineer the interaction between ALL the process, this is not very useful. There much written about all the ways you can enable SQL trace at level 8. For the sake of simplicity, I use setting event 10046 at level 8 in the session itself. The session requires ‘alter session’ rights to be able to set it. This is how it’s done:

SQL> alter session set events '10046 trace name context forever, level 8';
Session altered.

And can be turned off the following way:

SQL> alter session set events '10046 trace name context off';
Session altered.

Get to know wait events: strace

At this point we know how to enable tracing, but want to dig deeper and understand the operating system interaction inside the waits.

To see what a process (for example your Oracle foreground process) is doing from the perspective of the Operating System, there’s the ‘strace’ executable. The ‘strace’ executable can display all the system calls an operating system process is executing if it has enough privileges to access it. It resembles SQL trace a bit, but at the Operating System/Linux layer. First, you determine the process ID of the session you want to use ‘strace’ on, then use another process to execute ‘strace’. This is what it looks like:

1. Setup a sqlplus session
$ sqlplus ts/ts@v11203

SQL*Plus: Release 11.2.0.3.0 Production on Wed Apr 25 23:03:22 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

TS@v11203 >

2. Attach to the sqlplus process with strace in a session on the linux host where the database is running:
$ strace -p 14128
Process 14128 attached - interrupt to quit
read(14,

3. Now execute something in the sqlplus session, for example ‘select * from dual’:
TS@v11203 > select * from dual;

D
-
X

TS@v11203 >

4. This results in the following list of system calls:

read(14, "\1/\6\21i \376\377\377\377\377\377\377\377\2\2"..., 8208) = 303
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372954
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\1r\6\20\27\2\374T\rD@\255\262t\t\313\242\1\247-8x"..., 370) = 370
read(14, "\25\6\3\5\"\2\17", 8208) = 21
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\254\6\4\1 \1\1{\5\2"..., 172) = 172
read(14, "\1\337\6\3^#)\4\4\376\377\377\377\377\377\377\3774"..., 8208) = 479
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "010\6\v\1\5\314\2\1p\5\207y"..., 304) = 304
read(14,

Wow! Nice! But what does it mean? Well, these are all the system calls used by the Oracle foreground process in order to execute the (simple) SQL. It’s hard to tell something about the execution of the SQL with this list. And be aware, this is only a ‘select * from dual’, so a ‘real’ SQL will probably give a list so long you do not dare to begin studying the list to get something useful out of it.

Well, how about the next chapter:

Get to know wait events: combining waits and strace

How about combining the sql trace output with the system calls? The ‘strace’ executable has quite some options, which can be very helpful:

- list the data written by the process (‘-e write=all’)
This option lists the data written by the session. The trace file is written by the server process, so it will list the contents of that.
- list a timestamp for every system call (‘-tt’)
- list the time spend doing the system call (‘-T’)
- save the output to a file (‘-o /path/to/file.txt’)

These are all helpful to get more information the system calls.

Now setup a session in the database, enable SQL trace at level 8, fetch the process ID of session and run ‘strace -ttT -e write=all -p PID’ in another session and run something. Yes, this generates a massive amount of output. But, as I use it, it’s to investigate what certain waits are doing. Here are some examples:


23:46:20.145793 pread(257, "\6\242\202\3\1\356O\6\2049\245\2\3226\264O\6"..., 1032192, 1996505088) = 1032192
23:46:20.148986 lseek(8, 0, SEEK_CUR) = 46126
23:46:20.149030 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 31 39 38 20 66 69 ad' ela= 3198 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 34 le#=5 bl ock#=664 |
| 00050 33 34 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 34 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 34 38 39 36 31 53903801 48961 |
23:46:20.149121 write(9, "J?e83~1z1\n", 10) = 10
| 00000 4a 3f 65 38 33 7e 31 7a 31 0a J?e83~1z 1. |
23:46:20.149163 write(8, "\n", 1) = 1
| 00000 0a . |
23:46:20.157813 pread(256, "\6\242\2\4\1\360O\6\2\4\217T\2\3226\264O\6"..., 1032192, 2000699392) = 1032192
23:46:20.161124 lseek(8, 0, SEEK_CUR) = 46252
23:46:20.161169 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 33 31 37 20 66 69 ad' ela= 3317 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 35 le#=5 bl ock#=665 |
| 00050 36 32 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 62 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 36 31 31 30 30 53903801 61100 |
23:46:20.161260 write(9, "J?Hv2~1z1\n", 10) = 10
| 00000 4a 3f 48 76 32 7e 31 7a 31 0a J?Hv2~1z 1. |
23:46:20.161624 lseek(8, 0, SEEK_CUR) = 46377
23:46:20.161659 write(8, "\n", 1) = 1
| 00000 0a

The pread() call is the actual IO call done by the database.
The lseek() call is the seek towards the end of the trace file, and sets the pointer for the upcoming write.
The write() call is the write in the trace file. You see the multiblock read (scattered read in the trace file) was done using a pread() call, and the time measured by strace () is line with the time measured by Oracle (ela= 3317), and the number of blocks in the trace (126) matches the number of bytes read 1032192 (126*8192).
The time spend in the system calls is between the ‘smaller’ and ‘bigger’ signs, and get removed by wordpress, so if you where searching for it in the above example, you wouldn’t find it
The write() to file descriptor 9 with the garbage in it is the write to the .trm file.
The lseek() sets the pointer in the file after the writes, and a write() is done to enter a carriage return (\n) to the tracefile.
This is a simple example, to show how a synchronous multiblock read is reported, let’s see a a-synchronous one!


00:17:27.557053 pread(256, "#\242\202\251\243O\6\1\4[\4"..., 8192, 1190150144) = 8192
00:17:27.558973 lseek(8, 0, SEEK_CUR) = 2659
00:17:27.559009 write(8, "WAIT #140244789954888: nam='db f"..., 124) = 124
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 62 20 66 54888: n am='db f |
| 00020 69 6c 65 20 73 65 71 75 65 6e 74 69 61 6c 20 72 ile sequ ential r |
| 00030 65 61 64 27 20 65 6c 61 3d 20 31 39 30 33 20 66 ead' ela = 1903 f |
| 00040 69 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 34 33 ile#=5 b lock#=43 |
| 00050 33 39 34 20 62 6c 6f 63 6b 73 3d 31 20 6f 62 6a 394 bloc ks=1 obj |
| 00060 23 3d 31 34 30 33 33 20 74 69 6d 3d 31 33 33 35 #=14033 tim=1335 |
| 00070 33 39 32 32 34 37 35 35 38 39 34 33 39224755 8943 |
00:17:27.559088 write(9, "N?Rb~22y1\n", 10) = 10
| 00000 4e 3f 52 62 7e 32 32 79 31 0a N?Rb~22y 1. |
00:17:27.559128 write(8, "\n", 1) = 1
| 00000 0a . |
00:17:27.559334 mmap(0x7f8d48e41000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48e41000
00:17:27.559450 mmap(NULL, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x1ec000) = 0x7f8d48c81000
00:17:27.559491 mmap(0x7f8d48c81000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48c81000
00:17:27.559576 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x2fc000) = 0x7f8d48a81000
00:17:27.559614 mmap(0x7f8d48a81000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48a81000
00:17:27.559834 io_submit(140244822339584, 1, {{0x7f8d48a861f8, 0, 0, 0, 256}}) = 1
00:17:27.560008 mmap(0x7f8d48ad1000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48ad1000
00:17:27.560080 io_submit(140244822339584, 1, {{0x7f8d48a86450, 0, 0, 0, 256}}) = 1
00:17:27.560272 io_getevents(140244822339584, 1, 128, {{0x7f8d48a86450, 0x7f8d48a86450, 122880, 0}}, {600, 0}) = 1
00:17:27.560363 io_getevents(140244822339584, 1, 128, {{0x7f8d48a861f8, 0x7f8d48a861f8, 106496, 0}}, {600, 0}) = 1
00:17:27.560865 lseek(8, 0, SEEK_CUR) = 2784
00:17:27.560902 write(8, "WAIT #140244789954888: nam='dire"..., 129) = 129
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 69 72 65 54888: n am='dire |
| 00020 63 74 20 70 61 74 68 20 72 65 61 64 27 20 65 6c ct path read' el |
| 00030 61 3d 20 35 38 36 20 66 69 6c 65 20 6e 75 6d 62 a= 586 f ile numb |
| 00040 65 72 3d 35 20 66 69 72 73 74 20 64 62 61 3d 34 er=5 fir st dba=4 |
| 00050 33 33 39 35 20 62 6c 6f 63 6b 20 63 6e 74 3d 31 3395 blo ck cnt=1 |
| 00060 33 20 6f 62 6a 23 3d 31 34 30 33 33 20 74 69 6d 3 obj#=1 4033 tim |
| 00070 3d 31 33 33 35 33 39 32 32 34 37 35 36 30 38 34 =1335392 24756084 |
| 00080 35 5 |

This advanced example shows how first the segment header is fetched using the ‘db file sequential read’ wait. Despite being on ASM (this is ASM!), this clearly shows the segment header is read synchronously using the pread() call.
The next system calls and waits are even more interesting!
First, memory is allocated using the mmap() call. This is the beginning of a full scan, and the stuff read from disk need to be buffered somewhere. Please mind the memory allocation is not part of any wait. That is no problem, the allocation of memory should be very fast.
Next, TWO asynchronous reads are initiated using the two io_submit() calls, mind the iocb (IO control block) addresses, 0x7f8d48a861f8 and 0x7f8d48a86450.
See the man page of the io_submit call on your linux system using ‘man io_submit’, also use the man page of io_getevents to learn more about that call
Next, the asynchronous reads are reaped using the io_getevents() calls. What is noteworthy here is, is that the two requests are reaped in REVERSE order (this is consistent for the first two asynchronous call pairs for every segment, this example is Oracle version 11.2.0.3). So first request 0x7f8d48a86450 is fetched, then 0x7f8d48a861f8.
After 0x7f8d48a861f8 is reaped, the wait is reported for the FIRST IO call (io_submit() & io_getevents() combination). The second request (0x7f8d48a86450) of the start of a full segment scan is not reported. Never. Further requests do not reverse reaping/’encapsulate’ IO requests (mostly, it can happen a few times more).

I hope you get the idea of digging deeper into waits by examining the system calls.

I am not sure if the above ‘encapsulation’ of IO calls is a bug. But think of this: if you issue multiple IO requests (like we saw above, multiple IO requests are issued!) you cannot report the waits of all these distinct IO requests, because you would get more time reported than has passed. This has some consequences for the above ‘direct path read’ waits: these do not report individual IO times (like ‘db file sequential read’ does). This is just an indication on time spend in the process it had to wait for results from its IO requests, rather than disk IO latency.

The above conclusions are the result of observation and logical reasoning from my side. If you have anything to add, please respond!

An Oracle Exadata database machine consists of several parts: Intel based servers, infiniband switches, a cisco ethernet switch, a KVM switch and the hardware surrounding it like cables, 19″ rack, power distribution units. The Intel based servers are what “Exadata administrators” are administering the most. The intention of this article is to let the reader gain a little more insight into Exadata specific administration on those.

Two server layers: computing and storage
The two layers have quite different properties: the Exadata computing layer is Linux with Oracle grid infrastructure and the Oracle database software installed, very much as you would do yourself (if you install it in a strict OFA way), and the storage layer is Linux too, but with specific Exadata storage software.

Computing layer
The computing layer filesystems are using LVM on newer (X2) versions of Exadata, and using plain partitions without LVM on the V2 version of Exadata. There are 3 filesystems on a X2 computing node:

[root@xxxxdb01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
30G 14G 15G 49% /
/dev/sda1 124M 16M 102M 14% /boot
/dev/mapper/VGExaDb-LVDbOra1
99G 29G 65G 32% /u01
tmpfs 81G 196M 81G 1% /dev/shm

These are two filesystems inside logical volume, and the boot partition on a plain partition. The /boot partition consists of the grub configuration, kernel and initrd (initial ramdisk). The root logical volume (LVDbSys1) contains the linux operating system files, and the /u01 logical volume (LVDbOra1) contains the Oracle grid/cluster and database software.

A nice thing to know about the computing layer storage is that there is space left in the single volume group which is used for the logical volumes:

[root@emc1db01 ~]# vgdisplay
--- Volume group ---
VG Name VGExaDb
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 3
Max PV 0
Cur PV 1
Act PV 1
VG Size 557.62 GB
PE Size 4.00 MB
Total PE 142751
Alloc PE / Size 39424 / 154.00 GB
Free PE / Size 103327 / 403.62 GB <<-- 400GB space left!
VG UUID AZbMbf-papI-6FE2-hYlR-bzE1-G4pp-Bou81e

The disk configuration on the computing layer is build upon a RAID 5 set of 3 disks with a hot spare. The RAID is hardware based (the diskcontroller is reponsible for the RAID maintenance). This is visible through the MegaCli64 command:

[root@emc1db01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name : DBSYS
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
Size : 557.75 GB
State : Optimal
Stripe Size : 1.0 MB
Number Of Drives : 3
Span Depth : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disabled
Encryption Type : None

Number of Dedicated Hot Spares: 1
0 : EnclId - 252 SlotId - 3

Exit Code: 0x00

Let me emphasize this disks/RAID-set only is responsible for serving the operating system and accompanying files, and the grid/clusterware and database executables. All true database processing is done on the storage layer nodes.

Storage layer
The storage layer filesystems are not using LVM, nor regular partitions (/dev/sd* for disks visible as SCSI devices for linux):

[root@xxxxcel01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md5 9.9G 3.4G 6.0G 37% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/md7 2.0G 623M 1.3G 33% /opt/oracle
/dev/md4 116M 37M 74M 34% /boot
/dev/md11 2.3G 222M 2.0G 10% /var/log/oracle

The devices used are linux software RAID devices (‘md’ means multiple devices, which is a linux device which is build from independed underlying devices). The reason for using linux software RAID for the operating system and Oracle storage software is to provide redundancy, because the underlying disks on the storage servers do not provide that. Redundancy/error protection for the databases data on the storage servers is done using ASM normal redundancy, which is, put blunt and simple, RAID 10 handled on the ASM layer.

A logical next question is: okay, how do these linux software RAID/MD devices look like? Well, the most simple way is to use /proc/mdstat:

[root@emc1cel01 ~]# cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb1[1] sda1[0]
120384 blocks [2/2] [UU]

md5 : active raid1 sdb5[1] sda5[0]
10482304 blocks [2/2] [UU]

md6 : active raid1 sdb6[1] sda6[0]
10482304 blocks [2/2] [UU]

md7 : active raid1 sdb7[1] sda7[0]
2096384 blocks [2/2] [UU]

md8 : active raid1 sdb8[1] sda8[0]
2096384 blocks [2/2] [UU]

md1 : active raid1 sdb10[1] sda10[0]
714752 blocks [2/2] [UU]

md11 : active raid1 sdb11[1] sda11[0]
2433728 blocks [2/2] [UU]

md2 : active raid1 sdb9[1] sda9[0]
2096384 blocks [2/2] [UU]

unused devices:

We see the MD device name: md and a device number, the state (‘active’), the RAID type (‘raid1′) and the actual devices the md device is build from (sdb1[1] sda1[0] for example). On the next line the number of blocks in the md device, the total number of disks and the number of active disks [2/2], and the status of the disks [UU]. “U” means up. A failed device (probably as a result of a failed disk, but it could be the result of the clearing of a partition) is visible as “_”.

Okay, this is all strongly linux related, where’s the Exadata stuff?

Cellos
‘Cell’ is a name used for the storage server in the Exadata context. Both the computing nodes and the storage nodes have a directory ‘/opt/oracle.cellos’. This directory contains Oracle software which is embedded in Linux. The function of cellos is to maintain and check firmware versions and hardware configurations and hardware activation/deactivation.

Validation functions are done by ‘cellos’, or ‘scripts inside the /opt/oracle.cellos directory’ during startup phase of linux. One of the functions cellos can perform is flashing firmware. Strongly related to cellos are the linux kernel and the OFED stack.

Imageinfo
The state of the cellos image is listed using the imageinfo utility (which exists both in /usr/local/bin and in /opt/oracle.cellos), which executes the /opt/oracle.cellos/patch/bin/imageinfo.sh script:

On a database/computing node:

[root@xxxxdb01 bin]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Image version: 11.2.2.2.0.101206.2
Image created: 2010-12-09 10:06:22 -0800
Image activated: 2011-01-28 14:19:59 -0800
Image image type: production
Image status: success
Internal version:
Image label: OSS_11.2.0.3.0_LINUX.X64_101206.2
Node type: COMPUTE
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

On a cell/storage node:

[root@xxxxcel01 bin]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Cell version: OSS_11.2.0.3.0_LINUX.X64_101206.2
Cell rpm version: cell-11.2.2.2.0_LINUX.X64_101206.2-1

Active image version: 11.2.2.2.0.101206.2
Active image created: 2010-12-09 09:37:46 -0800
Active image activated: 2011-01-28 14:33:44 -0800
Active image type: production
Active image status: success
Active internal version:
Active image label: OSS_11.2.0.3.0_LINUX.X64_101206.2
Active node type: STORAGE
Active system partition on device: /dev/md5
Active software partition on device: /dev/md7

In partition rollback: Impossible

Cell boot usb partition: /dev/sdac1
Cell boot usb version: 11.2.2.2.0.101206.2

Inactive image version: undefined
Rollback to the inactive partitions: Impossible

We see imageinfo knows if it’s run on a computing or storage node (‘Active node type’).

CheckHWnFWProfile
Another executable in /opt/oracle.cellos is ‘CheckHWnFWProfile’. The name of this executabe is quite self-explanory: it checks if the configuration of the server it’s run on is valid, which means it is listed in one of the profiles (Exadata has different incarnations with different hardware, of which Oracle calls the hardware and firmware combination a ‘profile’).

If it’s run without any option, it should return [SUCCESS], which means the hardware and firmware are correct for that version of cellos:

[root@xxxxdb01 ~]# /opt/oracle.cellos/CheckHWnFWProfile
[SUCCESS] The hardware and firmware profile matches one of the supported profiles

If you want more information about the server, you can list the configuration with the ‘-d’ option:

[root@xxxxdb01 ~]# /opt/oracle.cellos/CheckHWnFWProfile -d
[SystemManufacturer]
SUN MICROSYSTEMS
[SystemModel]
SUN FIRE X4170 M2 SERVER
[BIOSVendor]
American Megatrends Inc.
[BIOSVersion]
08040203
[BIOSDate]
09/14/2010
[ProcessorInformation]
Cores:24
Intel(R) Xeon(R) X5670 2.93GHz
step: 2
[ILOMVersion]
3.0.9.27.a r58740
[PhysicalMemoryGB]
96
[DiskControllerPCIeSlotNumber]
PCI-E Slot 0
[DiskControllerModel]
LSI MegaRAID SAS 9261-8i
[DiskControllerFirmwareRevision]
12.9.0-0049
[DiskControllerSilicon]
B4
[DiskControllerBatteryBackupUnitVersion]
iBBU08
[InfinibandHCAPCIeSlotNumber]
PCI-E Slot 2
[FansAndFanCount]
FM0 FM1 FM2 FM3 4
[PowersuppliesAndCount]
PS0:OK PS1:OK 2
[InfinibandHCAPCIeSlotWidth]
5Gbps,x8
[InfinibandHCADeviceID]
26428
[InfinibandHCAChipRevision]
B0
[InfinibandHCAImageType]
ConnectX
[InfinibandHCAFirmwareVersion]
2.7.8100
[SASBackplaneFirmwareVersion]
NotAvailable
[DiskControllerPCIeSlotWidth]
x8
[10GEtherModelPCISlot]
82599EB PCI-E Slot 1
82599EB PCI-E Slot 1
[InfinibandHCAHardwareRev]
50
[LightsOutFirmwareVersion]
Firmware Revision : 3.0
[Enclosure:Slot:DiskType:DiskMake:DiskModel:Diskfw]
ALL:ALL SAS SEAGATE ST930003SSUN300G 0B70

If ‘CheckHWnFWProfile’ resulted in an error, you can list your actual configuration with ‘-d’, and use the ‘-s’ option to list what the profile or supported configuration is.

This is an investigation into an half rack database machine (the half rack database machine at VX Company). It’s an exadata/database V2, which means SUN hardware and database and cell (storage) software version 11.2.

I build a table (called ‘CG_VAR’), which consists of:
- bytes: 50787188736 (47.30 GB)
- extents: 6194
- blocks: 6199608

The table doesn’t have a primary key, nor any other constraints, nor any indexes. (of course this is not a real life situation)

No exadata optimisation

At first I disabled the Oracle storage optimisation using the session parameter ‘CELL_OFFLOAD_PROCESSING’:
alter session set cell_offload_processing=false;

Then executed: select count(*) from cg_var where sample_id=1;
The value ’1′ in the table ‘CG_VAR’ accounts for roughly 25%.

Execution plan:

Execution Plan
----------------------------------------------------------
Plan hash value: 2301354116
-------------------------------------------------------------------------------------
| Id  | Operation                  | Name   | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |        |     1 |     3 |  1692K  (1)| 05:38:34 |
|   1 |  SORT AGGREGATE            |        |     1 |     3 |            |    |
|*  2 |   TABLE ACCESS STORAGE FULL| CG_VAR |   395M|  1131M|  1692K  (1)| 05:38:34 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("SAMPLE_ID"=1)

Statistics:

592,STAT opened cursors cumulative                                                     1
592,STAT user calls                                                                    3
592,STAT session logical reads                                                   6178215
592,STAT CPU used when call started                                                15398
592,STAT CPU used by this session                                                  15398
592,STAT DB time                                                                   30667
592,STAT user I/O wait time                                                        15397
592,STAT non-idle wait time                                                        15424
592,STAT non-idle wait count                                                       96193
592,STAT session uga memory                                                      2105504
592,STAT session uga memory max                                                   982320
592,STAT session pga memory                                                      1572864
592,STAT session pga memory max                                                  1572864
592,STAT enqueue waits                                                                 3
592,STAT enqueue requests                                                              2
592,STAT enqueue conversions                                                           3
592,STAT enqueue releases                                                              2
592,STAT global enqueue gets sync                                                      5
592,STAT global enqueue releases                                                       2
592,STAT physical read total IO requests                                           48448
592,STAT physical read total multi block requests                                  48446
592,STAT physical read requests optimized                                              2
592,STAT physical read total bytes                                           50610774016
592,STAT cell physical IO interconnect bytes                                 50610774016
592,STAT ges messages sent                                                             4
592,STAT consistent gets                                                         6178215
592,STAT consistent gets from cache                                                  142
592,STAT consistent gets from cache (fastpath)                                       142
592,STAT consistent gets direct                                                  6178073
592,STAT physical reads                                                          6178073
592,STAT physical reads direct                                                   6178073
592,STAT physical read IO requests                                                 48448
592,STAT physical read bytes                                                 50610774016
592,STAT calls to kcmgcs                                                             142
592,STAT calls to get snapshot scn: kcmgss                                             1
592,STAT file io wait time                                                         17477
592,STAT Number of read IOs issued                                                 48448
592,STAT no work - consistent read gets                                          6178073
592,STAT table scans (long tables)                                                     1
592,STAT table scans (direct read)                                                     1
592,STAT table scan rows gotten                                               1596587000
592,STAT table scan blocks gotten                                                6178073
592,STAT session cursor cache hits                                                     1
592,STAT session cursor cache count                                                    1
592,STAT parse count (total)                                                           1
592,STAT execute count                                                                 1
592,STAT bytes sent via SQL*Net to client                                            528
592,STAT bytes received via SQL*Net from client                                       11
592,STAT SQL*Net roundtrips to/from client                                             2
592,STAT cell flash cache read hits                                                    2
592,WAIT enq: KO - fast object checkpoint                                            621
592,WAIT direct path read                                                      153970472
592,WAIT SQL*Net message to client                                                     3
592,WAIT SQL*Net message from client                                          1782253728
592,WAIT kfk: async disk IO                                                       264299
592,WAIT events in waitclass Other                                                  2236
592,TIME parse time elapsed                                                           20
592,TIME DB CPU                                                                153981591
592,TIME sql execute elapsed time                                              306726075
592,TIME DB time                                                               306726219

The executionplan is different with the operation ‘TABLE ACCESS STORAGE FULL’. This means the database is aware the segment is on cell/exadata storage.

We see here the SQL took 306,726,095 microseconds (306.726 seconds; time: parse time+sql execute elapsed time) for scanning 47.30 GB (!!), of which roughly half the time (153.970 seconds) was taken by the wait ‘direct path read’. The I/O wait-time is also in the statistic ‘user I/O wait time’: 15397.

I’ve executed the SQL before; there is no hard parse in the ‘time section’, there’s also no ‘parse count (hard)’ in the statistics section.

I’ve done 48448 IO requests (‘physical read total IO requests’), of which almost all IO’s (48446) where multiblock requests (‘physical read total multi block requests’). This means the average IO wait time is: 153,970,472/48,448 = 3,178 which is 3ms!

Using exadata optimisation, without storage indexes

In order to enable exadata optimisation, I reverted the session parameter ‘CELL_OFFLOAD_PROCESSING’ to ‘TRUE’.

One of the optimisation techniques of exadata is called ‘storage index’. A storage index is a list kept in the memory of the cellservers which keeps track of minimum and maximum values of fields per chunk of 1 MB (called ‘storage region’). Storage indexes are build automatically during usage of the ‘storage region’. A way to flush the storage index is to restart the cell servers. (Warning! Do not do this in a production environment!)

Execution plan:

Execution Plan
----------------------------------------------------------
Plan hash value: 2301354116
-------------------------------------------------------------------------------------
| Id  | Operation                  | Name   | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |        |     1 |     3 |  1692K  (1)| 05:38:34 |
|   1 |  SORT AGGREGATE            |        |     1 |     3 |            |    |
|*  2 |   TABLE ACCESS STORAGE FULL| CG_VAR |   395M|  1131M|  1692K  (1)| 05:38:34 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - storage("SAMPLE_ID"=1)
       filter("SAMPLE_ID"=1)

Statistics:

592,STAT opened cursors cumulative                                                     1
592,STAT user calls                                                                    3
592,STAT session logical reads                                                   6178215
592,STAT CPU used when call started                                                 3707
592,STAT CPU used by this session                                                   3707
592,STAT DB time                                                                    3856
592,STAT user I/O wait time                                                          156
592,STAT non-idle wait time                                                          156
592,STAT non-idle wait count                                                        2437
592,STAT session uga memory                                                     24054408
592,STAT session uga memory max                                                 21948904
592,STAT session pga memory                                                     25755648
592,STAT session pga memory max                                                 23527424
592,STAT enqueue waits                                                                 3
592,STAT enqueue requests                                                              2
592,STAT enqueue conversions                                                           3
592,STAT enqueue releases                                                              2
592,STAT global enqueue gets sync                                                      5
592,STAT global enqueue releases                                                       2
592,STAT physical read total IO requests                                           53714
592,STAT physical read total multi block requests                                  52661
592,STAT physical read total bytes                                           50610774016
592,STAT cell physical IO interconnect bytes                                  5575531904
592,STAT ges messages sent                                                             4
592,STAT consistent gets                                                         6178215
592,STAT consistent gets from cache                                                  142
592,STAT consistent gets from cache (fastpath)                                       142
592,STAT consistent gets direct                                                  6178073
592,STAT physical reads                                                          6178073
592,STAT physical reads direct                                                   6178073
592,STAT physical read IO requests                                                 53714
592,STAT physical read bytes                                                 50610774016
592,STAT calls to kcmgcs                                                             142
592,STAT calls to get snapshot scn: kcmgss                                             1
592,STAT cell physical IO bytes eligible for predicate offload               50610774016
592,STAT cell physical IO interconnect bytes returned by smart scan           5575531904
592,STAT cell session smart scan efficiency                                            9
592,STAT table scans (long tables)                                                     1
592,STAT table scans (direct read)                                                     1
592,STAT table scan rows gotten                                                387130468
592,STAT table scan blocks gotten                                                1503621
592,STAT cell scans                                                                    1
592,STAT cell blocks processed by cache layer                                    6549118
592,STAT cell blocks processed by txn layer                                      6549118
592,STAT cell blocks processed by data layer                                     6183339
592,STAT cell blocks helped by minscn optimization                               6549118
592,STAT cell simulated session smart scan efficiency                        50653913088
592,STAT cell IO uncompressed bytes                                          50653913088
592,STAT session cursor cache count                                                    1
592,STAT parse count (total)                                                           1
592,STAT execute count                                                                 1
592,STAT bytes sent via SQL*Net to client                                            528
592,STAT bytes received via SQL*Net from client                                      327
592,STAT SQL*Net roundtrips to/from client                                             2
592,WAIT enq: KO - fast object checkpoint                                            492
592,WAIT cell smart table scan                                                   1560351
592,WAIT SQL*Net message to client                                                     3
592,WAIT SQL*Net message from client                                             5605106
592,WAIT events in waitclass Other                                                  1590
592,TIME parse time elapsed                                                           46
592,TIME DB CPU                                                                 37064365
592,TIME sql execute elapsed time                                               38571803
592,TIME DB time                                                                38571968

The execution plan shows the ‘TABLE ACCESS STORAGE FULL’ again, to indicate cell storage, but now it was able to offload, which is visible in the predicate information section with the predicate ‘storage(“SAMPLE_ID”=1)’. Let’s see the differences:

NR ,TYP Statistic                                                                  1st           2nd          Diff
592,STAT opened cursors cumulative                                                     1              1              0
592,STAT user calls                                                                    3              3              0
592,STAT session logical reads                                                   6178215        6178215              0
592,STAT CPU used when call started                                                15398           3707         -11691
592,STAT CPU used by this session                                                  15398           3707         -11691
592,STAT DB time                                                                   30667           3856         -26811
592,STAT user I/O wait time                                                        15397            156         -15241
592,STAT non-idle wait time                                                        15424            156         -15268
592,STAT non-idle wait count                                                       96193           2437         -93756
592,STAT session uga memory                                                      2105504       24054408       21948904
592,STAT session uga memory max                                                   982320       21948904       20966584
592,STAT session pga memory                                                      1572864       25755648       24182784
592,STAT session pga memory max                                                  1572864       23527424       21954560
592,STAT enqueue waits                                                                 3              3              0
592,STAT enqueue requests                                                              2              2              0
592,STAT enqueue conversions                                                           3              3              0
592,STAT enqueue releases                                                              2              2              0
592,STAT global enqueue gets sync                                                      5              5              0
592,STAT global enqueue releases                                                       2              2              0
592,STAT physical read total IO requests                                           48448          53714           5266
592,STAT physical read total multi block requests                                  48446          52661           4215
592,STAT physical read requests optimized                                              2              0             -2
592,STAT physical read total bytes                                           50610774016    50610774016              0
592,STAT cell physical IO interconnect bytes                                 50610774016     5575531904   -45035242112
592,STAT ges messages sent                                                             4              4              0
592,STAT consistent gets                                                         6178215        6178215              0
592,STAT consistent gets from cache                                                  142            142              0
592,STAT consistent gets from cache (fastpath)                                       142            142              0
592,STAT consistent gets direct                                                  6178073        6178073              0
592,STAT physical reads                                                          6178073        6178073              0
592,STAT physical reads direct                                                   6178073        6178073              0
592,STAT physical read IO requests                                                 48448          53714           5266
592,STAT physical read bytes                                                 50610774016    50610774016              0
592,STAT calls to kcmgcs                                                             142            142              0
592,STAT calls to get snapshot scn: kcmgss                                             1              1              0
592,STAT file io wait time                                                         17477              0         -17477
592,STAT Number of read IOs issued                                                 48448              0         -48448
592,STAT cell physical IO bytes eligible for predicate offload                         0    50610774016    50610774016
592,STAT cell physical IO interconnect bytes returned by smart scan                    0     5575531904     5575531904
592,STAT cell session smart scan efficiency                                            0              9              9
592,STAT no work - consistent read gets                                          6178073              0       -6178073
592,STAT table scans (long tables)                                                     1              1              0
592,STAT table scans (direct read)                                                     1              1              0
592,STAT table scan rows gotten                                               1596587000      387130468    -1209456532
592,STAT table scan blocks gotten                                                6178073        1503621       -4674452
592,STAT cell scans                                                                    0              1              1
592,STAT cell blocks processed by cache layer                                          0        6549118        6549118
592,STAT cell blocks processed by txn layer                                            0        6549118        6549118
592,STAT cell blocks processed by data layer                                           0        6183339        6183339
592,STAT cell blocks helped by minscn optimization                                     0        6549118        6549118
592,STAT cell simulated session smart scan efficiency                                  0    50653913088    50653913088
592,STAT cell IO uncompressed bytes                                                    0    50653913088    50653913088
592,STAT session cursor cache hits                                                     1              0             -1
592,STAT session cursor cache count                                                    1              1              0
592,STAT parse count (total)                                                           1              1              0
592,STAT execute count                                                                 1              1              0
592,STAT bytes sent via SQL*Net to client                                            528            528              0
592,STAT bytes received via SQL*Net from client                                       11            327            316
592,STAT SQL*Net roundtrips to/from client                                             2              2              0
592,STAT cell flash cache read hits                                                    2              0             -2
592,WAIT enq: KO - fast object checkpoint                                            621            492           -129
592,WAIT direct path read                                                      153970472              0     -153970472
592,WAIT cell smart table scan                                                         0        1560351        1560351
592,WAIT SQL*Net message to client                                                     3              3              0
592,WAIT SQL*Net message from client                                          1782253728        5605106    -1776648622
592,WAIT kfk: async disk IO                                                       264299              0        -264299
592,WAIT events in waitclass Other                                                  2236           1590           -646
592,TIME parse time elapsed                                                           20             46             26
592,TIME DB CPU                                                                153981591       37064365     -116917226
592,TIME sql execute elapsed time                                              306726075       38571803     -268154272
592,TIME DB time                                                               306726219       38571968     -268154251

Let start at the time spend: the first run, this was: 306726075+20= 307 seconds, the second run took: 38571803+46= 39 seconds (!!). That is a reduction of 87%. That _is_ spectacular!

The waits have changed: the ‘direct path read’ is swapped for ‘cell smart table scan’.

In the (session) statistics also have some things to notice:

  • physical read total bytes: Is the same(!) It’s not strange: the size of the segment to read is known by the database, and must be read entirely from the database’s perspective.
  • cell physical IO bytes eligible for predicate offload: this statistic shows the amount of data which the cell server is able to process on behalf of the database, instead of the database processing and the cell server just delivering blocks. In this case, all bytes are processed on the cellserver (cell physical IO bytes eligible for predicate offload=physical read total bytes)
  • cell physical IO interconnect bytes: Here we see that the amount of data traffic between the database and the cell servers. Because of the offload, the amount is reduced immensely (50,610,774,016 versus 5,575,531,904) which is a reduction of 88.9% (!!)
  • cell physical IO interconnect bytes returned by smart scan: This is the amount of data which is caused by the cell smart scan (the statistic name is quite self explaining). If we compare this statistic with ‘cell physical IO interconnect bytes’, we see all data on the interconnect was caused by the smart scan.
  • cell scans: the number of scans offloaded to cellservers. one in this case.
Using exadata optimisation, with storage indexes

Upon the next execution of this statement, the execution plan stays the same, but the cellservers built the storage index.

Let’s see a comparison of the statistics of the previous run (without a storage index in place) with a second run, which will use the storage index, which is built up during the first run:

NR ,TYP Statistic                                                                  1st           2nd          Diff
592,STAT opened cursors cumulative                                                     1              1              0
592,STAT user calls                                                                    3              3              0
592,STAT session logical reads                                                   6178215        6178215              0
592,STAT CPU used when call started                                                 3707           3713              6
592,STAT CPU used by this session                                                   3707           3713              6
592,STAT DB time                                                                    3856           3745           -111
592,STAT user I/O wait time                                                          156             37           -119
592,STAT non-idle wait time                                                          156             37           -119
592,STAT non-idle wait count                                                        2437           1825           -612
592,STAT session uga memory                                                     24054408              0      -24054408
592,STAT session uga memory max                                                 21948904              0      -21948904
592,STAT session pga memory                                                     25755648              0      -25755648
592,STAT session pga memory max                                                 23527424              0      -23527424
592,STAT enqueue waits                                                                 3              3              0
592,STAT enqueue requests                                                              2              2              0
592,STAT enqueue conversions                                                           3              3              0
592,STAT enqueue releases                                                              2              2              0
592,STAT global enqueue gets sync                                                      5              5              0
592,STAT global enqueue releases                                                       2              2              0
592,STAT physical read total IO requests                                           53714          53765             51
592,STAT physical read total multi block requests                                  52661          52688             27
592,STAT physical read requests optimized                                              0          35423          35423
592,STAT physical read total bytes                                           50610774016    50610774016              0
592,STAT cell physical IO interconnect bytes                                  5575531904     5570438336       -5093568
592,STAT ges messages sent                                                             4              4              0
592,STAT consistent gets                                                         6178215        6178215              0
592,STAT consistent gets from cache                                                  142            142              0
592,STAT consistent gets from cache (fastpath)                                       142            142              0
592,STAT consistent gets direct                                                  6178073        6178073              0
592,STAT physical reads                                                          6178073        6178073              0
592,STAT physical reads direct                                                   6178073        6178073              0
592,STAT physical read IO requests                                                 53714          53765             51
592,STAT physical read bytes                                                 50610774016    50610774016              0
592,STAT calls to kcmgcs                                                             142            142              0
592,STAT calls to get snapshot scn: kcmgss                                             1              1              0
592,STAT cell physical IO bytes eligible for predicate offload               50610774016    50610774016              0
592,STAT cell physical IO bytes saved by storage index                                 0    37006434304    37006434304
592,STAT cell physical IO interconnect bytes returned by smart scan           5575531904     5570438336       -5093568
592,STAT cell session smart scan efficiency                                            9              0             -9
592,STAT table scans (long tables)                                                     1              1              0
592,STAT table scans (direct read)                                                     1              1              0
592,STAT table scan rows gotten                                                387130468      387130468              0
592,STAT table scan blocks gotten                                                1503621        1503621              0
592,STAT cell scans                                                                    1              1              0
592,STAT cell blocks processed by cache layer                                    6549118        2025039       -4524079
592,STAT cell blocks processed by txn layer                                      6549118        2025039       -4524079
592,STAT cell blocks processed by data layer                                     6183339        1666003       -4517336
592,STAT cell blocks helped by minscn optimization                               6549118        2025039       -4524079
592,STAT cell simulated session smart scan efficiency                        50653913088    13647896576   -37006016512
592,STAT cell IO uncompressed bytes                                          50653913088    13647896576   -37006016512
592,STAT session cursor cache hits                                                     0              1              1
592,STAT session cursor cache count                                                    1             1            0
592,STAT parse count (total)                                                           1              1              0
592,STAT execute count                                                                 1              1              0
592,STAT bytes sent via SQL*Net to client                                            528            528              0
592,STAT bytes received via SQL*Net from client                                      327            327              0
592,STAT SQL*Net roundtrips to/from client                                             2              2              0
592,WAIT enq: KO - fast object checkpoint                                            492            790            298
592,WAIT cell smart table scan                                                   1560351         366466       -1193885
592,WAIT SQL*Net message to client                                                     3              2             -1
592,WAIT SQL*Net message from client                                             5605106        8163815        2558709
592,WAIT events in waitclass Other                                                  1590           1723            133
592,TIME parse time elapsed                                                           46             23            -23
592,TIME DB CPU                                                                 37064365       37131355          66990
592,TIME sql execute elapsed time                                               38571803       37456254       -1115549
592,TIME DB time                                                                38571968       37457846       -1114122

We see here the storage index reduced an enormous amount of I/O on the cells (cell physical IO bytes saved by storage index: 37006434304), but didn’t result in a enormous reduction of execution time (38.5 seconds versus 37.4 seconds).

This isn’t too unexpected:
1/without exadata optimisation: total time: 307 s, IO wait (direct path read): 154 s.
2/with exadata optimisation: total time: 39 s, IO wait (cell smart table scan): 2 s.

I think you get the point now: after IO waits are brought down to 2 seconds, any further optimisation on IO (which is what the storage indexes are) only will reduce the 2 seconds further. And it did! The optimisation of the storage indexes is again spectacular, only has little impact on overall response time, because it influences a too little part of it (in fact, that is Amdahl’s law):

3/with exadata optimisation and storage indexes: total time: 37 s, IO wait (cell smart table scan): 0.3 s.

Conclusion

This is only an investigation in some very specific things of Exadata. There is much more to explore.

Using the hardware in the database machine, a single process can read 47.30 GB in 154 seconds. That is a very decent number, and sustainable with multiple processes, alias Oracle Parallel Query. That on it self is something of which many shops can benefit from. But, it gets really interesting when using the exadata optimisations. In this case, the total response time dropped from 307 to 37 seconds, with barely a wait on I/O (0.3 seconds for 47.30 GB!!)

For those of you who haven’t followed all the Oracle Exadata and database machine information, and want a brief introduction to the database machine: here it is!

The confusion

There is some confusion about ‘exadata’ and ‘the database machine’. If we look at the official product names, ‘Exadata’ is the storage server, and the ‘Database machine’ is the complete boxed machine, including exadata for storage. But…in the real world, in different kinds of papers (technical, advertising) exadata sometimes is used as an alias for the database machine.

What is the database machine?

Essentially, the database machine is a version of the Oracle database on balanced, optimised hardware from Oracle. But that is a very short description, and does harm to the unique feature of the database machine: the Oracle Exadata storage server. Exadata offers database offloading by execution some of the database functions on the storage layer. This is a schematic drawing of the architecture of the database machine (cell server is the internal name of the exadata software):

The database machine, a little more verbose

One of the basic design principles of the database machine is fault tolerancy. This means any component may fail, whilst the database still is able to function. So if you look at the schematic drawing above, it means there are multiple instances of the database (=Oracle RAC), there are TWO infiniband switches (in a quarter and half rack, there are three infiniband switches in a full rack), and there are multiple storage servers (3/quarter-, 7/half-, 14/full-rack). The redundancy of the database is NOT done by the storage servers, but is done using Oracle’s ASM (Automatic Storage Management) ‘normal redundancy’, which essentially means data is stored double.

It is possible to use a single server (non RAC) version of the database on the database machine, but that does harm to the architecture of the database machine (in my opinion).

The storage servers are fitted with 384 GB flash memory attached via PCI-X. This flash memory can be used as flash disks, but is configured as flash memory for caching of the storage server by default.

Last but not least the database is able to offload some database actions to the storage:
- Smart scans: row filtering, column filtering, join filtering, incremental backup filtering, data mining model scoring. Transparently.
- Storage indexes: stores min and max values of columns of database blocks every 1MB. Transparently.
- Hybrid Columnar Compression: database compression mechanism where data is stored by column, then compressed. (HCC is using several database blocks). Needs to be specified explicitly as a compression option during creation of modification.

Conclusion

I hope this article was able to get you a basic understanding of the architecture of the database machine. Perhaps it might even make you enthusiastic about it, because it offers several unique features:
- A hardware infrastructure purely built for database performance. Something encountered ‘in the wild’ very rarely.
- State of the art hardware.
- Common Oracle database, no need to alter anything when moving to the database machine.

The second installation step of the database machine aka Exadata by Oracle ACS (Advanced Customer Support) is configuring the database and storage (‘cell’) nodes/servers. The blades are delivered with default IP addresses, during this step they are configured to the IP addresses which fit in our environment. Also the cellservers are configured (‘LUN’s are carved’) to have storage for ASM.

The cellservers are configured with three diskgroups during a normal installation: DATA for data, RECO for the flash recovery area and a diskgroup for the clusterware (voting disks, cluster registry) called SYSTEMDG.

A RAC database is configured too. We have a half rack, which means 4 database nodes, so a 4 node RAC database is configured, called ‘dbm’. The database has no data in it, besides the data dictionary (obviously), and is using a ‘humble’ amount of memory (8GB on a 64GB machine).

Now it’s up to me to automate the creation (and deletion) RAC databases, adding and deleting instances of the RAC database, modifying the storage (to be able to test both half rack and quarter rack configurations) and also some optimising/configuration, like enabling hugepages, add rlwrap etc.

Busy, busy, busy :)

Follow

Get every new post delivered to your Inbox.

Join 1,047 other followers

%d bloggers like this: