Archive

Oracle

Recently we upgraded an Exadata to the currently latest version, 11.2.3.2.0. The Exadata software itself consists of an image for the storage servers (the storage servers are essentially re-imaged), and a set of updates for the database/computing nodes, including: firmware for ILOM (lights out adapter), BIOS, LSI RAID adapter, Infiniband adapter, linux kernel, drivers, mandatory packages, to name some.

One of the exceptional things this upgrade does, is remove the hot-spare out of the RAID set on the database/compute nodes. This is documented in MOS note: 1468877.1, as ‘known issue 5: hotspare removed for compute nodes’. For some sites, this actually can be a good thing, if they are really tight on disk space on the compute nodes of Exadata. I must say that we have not encountered this situation. What this means, is that the actual HDD configuration on the compute node is left to the customer, instead of having one mandatory configuration (having 3 disks in a RAID-5 configuration, and one hot-spare).

So if you decide to use the former hot-spare disk as an active part of the RAID configuration, you are effectively trading availability for diskspace. Please mind the RAID set itself already provides redundancy, even without the hot-spare!

On the other hand, I think in most configurations, it makes sense to convert the disk back to being hot-spare.

This is done in the following way:

a) Get an overview of the current disk configuration:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B70

This shows the disk in slot number 3 being left “unconfigured”, but in “good” state (of a disk has gone bad because of errors, it will be removed from the RAID set, and will show up as “unconfigured(bad)”!)
This is the state the upgrade to 11.2.3.2.0 leaves your system.

New let’s make the disk hot spare again!

b) Get the inclosure id:
/opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252

This means we know the enclosure id (252) and the slot number (3), which is the information needed for the MegaCli utility to revert the unconfigured disk to hot-spare again!

c) Revert the unconfigured disk back to hot-spare
/opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[252:3] -a0
Adapter: 0: Set Physical Drive at EnclId-252 SlotId-3 as Hot Spare Success.
Exit Code: 0x00

d) Check the disk configuration again:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Hotspare, Spun Up
Device Firmware Level: 0B70

It appears that having the disk been removed from the RAID set by the update to 11.2.3.2.0 generates an ASR message. At least it did at our site, despite this being an undocumented bug (7161023,’ASR generating false errors in relation to disks’) which is marked resolved in 11.2.3.1.0 (?). Most sites I encounter have ASR setup, but not having all messages send additionally to local, onsite monitoring. I want to stress it’s very important to have the ASR messages sent to your own monitoring too!

Oracle Support does not list all the specifications from an ASR message it has gotten. Instead, a Service Request is made with enough information for Oracle itself (!!). In our case, the exact error message was NOT specified, only ‘compute server hard disk predictive failure’ and the node name.

Where do you look on an Exadata for that information? The first logical point is the ASR daemon. I didn’t spend too much time on it, but it seems that it’s more a proxy for messages than a database. I wasn’t able to find useful information about the systems which where using this daemon.

What are the sources for ASR with an Exadata? These are:

Computing node:
- “compmon daemon” / Linux level monitoring
- ILOM
Storage node:
- “cell daemon” / Linux level monitoring
- ILOM

For the computing node, it’s quite easy to see if there are any detected failed devices from the viewpoint of the ILOM:
(please mind ipmitool -I open only works on the local system)
# ipmitool -I open sunoem cli "show /SP/logs/event/list Severity==(Major,Critical,Down)"
Connected. Use ^D to exit.
-> show /SP/logs/event/list Severity==(Major,Critical,Down)

ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------

-> Session closed
Disconnected

This shows no messages with the severity Major, Critical or Down are in the eventlog in the ILOM. Please mind that the logons to the ILOM have severity “Minor”. These are in most system the vast majority of the messages, which are not of interest for this investigation. If you want to know if something has failed, there even a simpler command:
# ipmitool -I open sunoem cli "show faulty"

For the “compmon daemon”, grep the processlist for “compmon”:
# ps -ef | grep compmon
root 12812 1 0 Oct22 ? 00:00:11 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server

The most important part here is the directory: /opt/oracle/cellos/compmon
If you navigate to that directory, you will see a number of “state files”: asrs.state, traps.state and disks.state.
The disks.state lists the disk status as listed with a) with the firmware state.
The most important file for the ASR message investigation is the traps.state file. This file lists traps it has sent to ASR. In our case:
1 ; Mon Oct 22 14:39:10 2012 ; 86425886-b359-4587-8d46-f31ff2ecb135 ; Physicaldisk : Make Model: is at status predictive failure. Raised fault id: HALRT-02008 ; Physical disk should be replaced. Exadata Compute Server: Disk Serial Number:
Yes, this is pasted correctly, it misses Physicaldisk, Make Model and Disk Serial Number information. This has not been omitted for safety, it just is not listed.
So, the failure which was sent was HALRT-02008 in our case.

For completeness, the ILOM layer can be investigated identically to the description of the ILOM handling on the computing layer. The Linux layer messages can be investigated with: # cellcli -e list alerthistory
32 2012-10-17T02:00:27+02:00 info "HDD disk controller battery on disk contoller at adapter 0 is going into a learn cycle. This is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours. The disk controller cache might go into WriteThrough caching mode during the learn cycle. Disk write throughput might be temporarily lower during this time. The message is informational only, no action is required."
33 2012-10-22T11:43:21+02:00 info "Factory defaults restored for Adapter 0"
34 2012-10-22T11:43:23+02:00 info "Factory defaults restored for Adapter 0"

If you are administering an Oracle Exadata database machine, which base operating system image (the operating system version with which it system came) is Linux and version 11.2.3.1.0 (current version is viewable with the command ‘imageinfo’, which needs root account privileges) or higher, and multiple users are accessing the system with password authentication, this blogpost might be an interesting read. Also, if you have witnessed temporary lockout of the oracle user, or other users: this blogpost describes the reason and a potential resolution.

I administer several Exadata database machines, which are not all delivered at the same time, so the base image version is different. Because I also administer the Linux operating system on the computing nodes in the Exadata database machines, I noticed the Linux settings slightly differ among the different Exadata database machine computing nodes. There is a positive side to this: apparently the team that maintains the base image does not only renew the packages on the image, but also get feedback about the settings, and change things to improve something. There is also a negative side to this: these changes are not documented anywhere (that I am aware of), so getting a new system always is a bit exciting, because things might have been changed. Or not…

I speak about ‘base image’ deliberately. After a system is delivered with a certain ‘base image’ (collection of kernel, executables, os-scripts and settings), the kernel, executables and os-scripts are renewed with an upgrade, but the settings remain the same. This blogpost is about a PAM (pluggable authentication modules) setting, which I encountered on base-image 11.2.3.1.0.

I witnessed the ‘oracle’ account being locked out temporarily on a system. The reason was a series of unsuccesful logon attempts. This could be something which complies with somebodies security standards (but who’s?). I think the ‘oracle’ account on an Oracle database system being locked out (albeit temporarily) is highly undesirable on most systems. As I’ve described earlier: this is a setting which the Exadata image engineering team decided to do with pluggable authentication modules.

Meet pam_tally2.so…
In the directory /etc/pam.d there are two files which configure the temporary lockout when a series of unsuccesful logon attempts have been made: ‘login’ and ‘sshd’. The actual line responsible for the lockout is:

auth       required     pam_tally2.so deny=5 onerr=fail lock_time=600

In most environments where no actual security compliancy rules are known, I disable this behaviour by commenting it with a hash (‘#’) sign as the first character on the line. Of course you could read the manpage of pam_tally2 (search on the internet for ‘man pam_tally2′) and configure it to your linking or to comply with your security rules.

Recently I patched an 11.2.0.2 grid infrastructure to an higher version. After the patching I started the grid infrastructure on that host, and ASM was unable to start. Looking in the alert.log file of the ASM instance it turned out that upon starting ASM, even before the contents of the pfile/spfile was displayed, the ASM crashed with the ORA-00600 error:

Sat Oct 13 14:35:07 2012
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 1, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 1 
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 30960 free: 30960)
DFLT Huge Pages allocation successful (allocated: 67107111)
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
CELL communication is configured to use the following interface(s) for this instance
    1.1.1.1 
CELL interconnect IPC version: Oracle RDS/IP (generic)
IPC Vendor 1 Protocol 3
  Version 4.1
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on. 
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support  enabled
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_20405.trc  (incident=66599):
ORA-00600: internal error code, arguments: [kmgs_component_init_3], [60], [65], [17], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_66599/+ASM1_ora_20405_i66599.trc
Sweep [inc][66599]: completed
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

A quick google didn't result in any hits, nor searching the MOS knowledge base and bug database.

I decided to start the ASM instance manually, and got an interesting message on screen just prior to the instance crashing:

ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance

Upon investigation of the contents of the spfile, it turned out the parameter 'remote_os_authent' was set for the ASM instance. Removing this parameter returned normal, expect behaviour, in other words: not crashing the ASM instance.

I tried to replay/redo this failure in a VM on my laptop (using Oracle Grid Infrastructure 11.2.0.3 in non-clustered mode), but was UNABLE to crash ASM by setting 'remote_os_authent' to both true and false. So it could either be a 'cluster feature' or a feature specific to version 11.2.0.2.

The reason for this blogpost is to have a little bit of documentation on the OERI [kmgs_component_init_3], so if anybody encounters this problem, there is at least some clue to be found, or even a resolution.

ps. kmgs_component_init is in the VOS layer (operating system depended layer)

This post really is about using LVM (Logical Volume Manager; an abstraction layer for disk devices) snapshots. A snapshot is a frozen image of a logical volume, which simply means “filesystem”. It’s not really “frozen”, LVM2 snapshots are read/write by default. But you can freeze a filesystem in time with a LVM snapshot.

The background of this really is Exadata (computing node) and upgrading, but has nothing unique to Exadata. So don’t let this bother you. But the idea of using LVM snapshots popped up when dealing with Exadata computing nodes and upgrades.

First of all: LVM is in development, which mean different Linux versions have different LVM options available to them. I am using the Exadata X2 Linux version: RHEL/OL 5u7 x86_64. I guess OL6 has more and more advanced features inside LVM, but with X2, OL5u7 is what I have to use. So the steps in this blogpost are done with this version. Any comments are welcome!

Second: if you want to experiment with this: most people allocate all space in the volume group to logical volumes upfront. A snapshot is a COW (Copy on Write) copy of a logical volume. This means a snapshot starts off with zero extra bytes (source and snapshot are equal), and grows as the source gets modified. This means you need to have free/available space in the volume group to facilitate the snapshot.

Then there is another caveat: the ‘/boot’ filesystem cannot be in LVM, so is a normal partition on most systems (also on Exadata). This means snapshots do not help if you want a backup of that filesystem. You need to use another trick.

Okay, here we go: you have a large modification upcoming and want to be able to restore your system to this moment in time.

1. Backup /boot filesystem

[root@localhost ~]# df /boot 
Filesystem           1K-blocks  Used Available Use%  
Mounted on /dev/sda1    248895 24037    212008  11% /boot
[root@localhost ~]# umount /boot
[root@localhost ~]# dd if=/dev/sda1 of=dev_sda1_backup
[root@localhost ~]# mount /boot

2. Create snapshot of logical volumes
In this example it’s only the root filesystem (which is a bit special, because this filesystem is set in grub.conf for the bootloader, and in /etc/fstab).

[root@localhost ~]# lvdisplay -v /dev/vg00/lv_root 
Using logical volume(s) on command line 
/dev/hdc: open failed: No medium found 
--- Logical volume --- 
LV Name /dev/vg00/lv_root 
VG Name vg00 
LV UUID wutwln-ffdB-QRlg-1LgL-XqKB-glvn-OpCowW 
LV Write Access read/write 
LV Status available # open 1 
LV Size 3.91 GB 
Current LE 125 
Segments 1 
Allocation inherit 
Read ahead sectors auto 
- currently set to 256 
Block device 253:0 

lvdisplay displays the properties of the logical volume to be snapshotted, I use this to spot the number of LE’s.

[root@localhost ~]# lvcreate -l 125 -s /dev/vg00/lv_root -n lv_root_snap 
/dev/hdc: open failed: No medium found 
Logical volume "lv_root_snap" created

The system now has been backed up in a way that we can revert to this situation.

Next would be to do the O/S upgrade, software upgrade or whatever. It goes horribly wrong, and you need to restore the system to the previous situation.

1. Rename current logical volume
In this case, I rename my lv_root logical volume to lv_root_old:

[root@localhost ~]# lvrename /dev/vg00/lv_root /dev/vg00/lv_root_old
/dev/hdc: open failed: No medium found 
Renamed "lv_root" to "lv_root_old" in volume group "vg00"

The logical volume (which we currently use) has been renamed.

2. Create new lv_root
This is the logical volume we are going to use to put the snapshot in.

[root@localhost ~]# lvcreate -l 125 -n lv_root /dev/vg00
Logical volume "lv_root" created

3. Populate the new lv_root with the snapshot contents

[root@localhost ~]# dd if=/dev/vg00/lv_root_snap of=/dev/vg00/lv_root 	
8192000+0 records in 
8192000+0 records out 
4194304000 bytes (4.2 GB) copied, 93.0858 seconds, 45.1 MB/s

4. Restore the /boot filesystem

[root@localhost ~]# umount /boot
[root@localhost ~]# dd if=dev_sda1_backup of=/dev/sda1
[root@localhost ~]# mount /boot

5. Reboot

[root@localhost ~]# shutdown -r now

This sequence of events enabled me to restore my system to a post-modification situation. Of course you should test this very thoroughly for your own situation, but this offers an elegant way, which has little external dependencies.

Post restore/cleaning up:
When the system is reverted to its old situation, we are left with a logical volume and a snapshot which probably are not of use anymore. These can be cleaned up the following way:

[root@localhost ~]# lvremove /dev/vg00/lv_root_old 
Do you really want to remove active logical volume lv_root_snap? [y/n]: y
Logical volume "lv_root_snap" successfully removed 
Do you really want to remove active logical volume lv_root_old? [y/n]: y 
Logical volume "lv_root_old" successfully removed

When you use Oracle ASM (Automatic Storage Management) for your database, the permissions on the block devices on the operating system layer which are used by ASM need to be changed. To be more precise, the owner and group need to be set to ‘oracle’ and ‘dba’ (Oracle documentation) in my case.

I used to do this in a very lazy way, using a simple ‘/bin/chown oracle.dba /dev/sdb’ in /etc/rc.local. This worked for me with RHEL/OL version 5. This has changed with RHEL/OL 6, because the system V startup system has changed to ‘upstart’. Also, the disk devices change ownership back in OL6 if you set it by hand to oracle.dba.

Oracle has its own resolution for generating block device files especially for ASM, called ASMLib. This generally works well, but what is exactly happening to generate the devices, and how to troubleshoot the generation of the devices is (as far as I know) not documented nor generally known. I’ve been bitten a few times on my test systems, which is the reason I don’t like ASMLib, and do not feel comfortable with ASMLib.

The linux system has functionality which equals the block device generation of ASMLib, which is called udev. Udev isn’t exactly over-documented. Also, there seems to be a difference between the tools (scsi_id and rule file syntax) in RHEL/OL 5 and RHEL/OL 6. But there is much written about it. And it’s very simple, once you know what to do.

I have a separate (virtual) disk which I want to use with ASM. After startup, the block device of this disk is called /dev/sdb. In order to let udev identify this disk, the linux utility ‘scsi_id’ can be used to fetch the unique SCSI identifier of the disk. This is of importance, because the correct disk must be specified to be used with ASM. This sounds a bit dumb (of course we need the correct disk), but Linux names the devices in the order at which they are seen by the kernel. This means it’s possible that /dev/sdb is another disk if someone decides to add a disk to the system, and reboots the server. (please mind most servers do not change very much, nor I have seen this happening)

And there’s another complication: VMWare doesn’t provide a unique device number by default. In order to let VMWare provide Linux a unique number, ‘disk.EnableUUID = “TRUE”‘ must be added to the virtual machine configuration file (“vmx file”). After a reboot, ‘scsi_id’ is able to provide a serial!

So at this point, you should have a disk or disks to be used with ASM. First grab the unique serial of the disk (this is the RHEL6/OL6 syntax of the scsi_id command!):

# scsi_id -g -u -d /dev/sdb
36000c29127c3ae0670242b058e863393

Next, we need to make a udev rule file. In this file, we specify this SCSI id (which we fetched with scsi_id), and specify how we want that device created:

# cd /etc/udev/rules.d/
# cat 99-oracle-asmdevices.rules
KERNEL=="sd*", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="36000c29127c3ae0670242b058e863393", NAME+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"

(There is no ’99-oracle-asmdevices.rules’ file by default, you need to create it yourself)

That’s all. When the kernel finds a “sd*” device, subsystem “block”, with the serial “36000c29127c3ae0670242b058e863393″, it creates a device file ‘/dev/oracleasm/disk1′, owner oracle, etc.

This is also convenient because the disk string of ASM is very easy to set now: ‘/dev/oracleasm/*’.

Please mind if the block devices have the correct permissions, it doesn’t matter to ASM if the device file names have swapped. The ASM instance scans the device headers for it’s contents.

(Version edited after comments -> rman backup as copy)
(Version edited to include delete leftover datafile from rman)

Recently I was asked to rename a tablespace. The environment was Oracle version 11.2.0.3 (both database and clusterware/ASM).

This is the test case I build to understand how that works:
(I couldn’t find a clean, straightforward description how to do that, which is why I blog it here)

I created an empty tablespace ‘test1′ for test purposes:

SYS@v11203 AS SYSDBA> create bigfile tablespace test1 datafile size 10m;

(I use bigfile tablespaces only with ASM. Adding datafiles is such a labour intensive work, bigfile tablespaces elimenate that, when auto extent is correctly set)

A tablespace can be easily renamed with the alter tablespace rename command:

SYS@v11203 AS SYSDBA> alter tablespace test1 rename to test2;

This changes the Oracle data dictionary to reflect a new name. This doesn’t touch the underlying datafile:

SYS@v11203 AS SYSDBA> select * from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   FILE_ID TABLESPACE_NAME		       BYTES	 BLOCKS STATUS	  RELATIVE_FNO AUT   MAXBYTES  MAXBLOCKS INCREMENT_BY USER_BYTES USER_BLOCKS ONLINE_
---------- ------------------------------ ---------- ---------- --------- ------------ --- ---------- ---------- ------------ ---------- ----------- -------
+DATA/v11203/datafile/system.260.785186841
	 1 SYSTEM			   734003200	  89600 AVAILABLE	     1 YES 3.4360E+10	 4194302	 1280  732954624       89472 SYSTEM

+DATA/v11203/datafile/sysaux.261.785186847
	 2 SYSAUX			   629145600	  76800 AVAILABLE	     2 YES 3.4360E+10	 4194302	 1280  628097024       76672 ONLINE

+DATA/v11203/datafile/undotbs1.262.785186849
	 3 UNDOTBS1			   477102080	  58240 AVAILABLE	     3 YES 3.4360E+10	 4194302	  640  476053504       58112 ONLINE

+DATA/v11203/datafile/users.264.785186857
	 4 USERS			     5242880	    640 AVAILABLE	     4 YES 3.4360E+10	 4194302	  160	 4194304	 512 ONLINE

+DATA/v11203/datafile/ts.266.785323903
	 5 TS				   419430400	  51200 AVAILABLE	  1024 YES 3.5184E+13 4294967293	12800  418381824       51072 ONLINE

+DATA/v11203/datafile/iops.267.785326883
	 6 IOPS 			  5452595200	 665600 AVAILABLE	  1024 YES 3.5184E+13 4294967293	12800 5451546624      665472 ONLINE

+DATA/v11203/datafile/test1.268.789380535
	 7 TEST2			    10485760	   1280 AVAILABLE	  1024 NO	    0	       0	    0	 9437184	1152 ONLINE

To rename the datafile in ASM, offline the tablespace, copy the datafile using RMAN, rename the datafile in the Oracle data dictionary, and online the tablespace again:

Offline the tablespace:

SYS@v11203 AS SYSDBA> alter tablespace test2 offline;

Copy the datafile using RMAN:

RMAN> copy datafile '+DATA/v11203/datafile/test1.268.789380535' to '+DATA';

Starting backup at 23-JUL-12
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=28 device type=DISK
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=+DATA/v11203/datafile/test1.268.789380535
output file name=+DATA/v11203/datafile/test2.269.789380645 tag=TAG20120723T082404 RECID=1 STAMP=789380644
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 23-JUL-12

Rename the datafile in the Oracle data dictionary:

SYS@v11203 AS SYSDBA> alter database rename file '+DATA/v11203/datafile/test1.268.789380535' to '+DATA/v11203/datafile/test2.269.789380645';

Please mind the ‘old’ filename is at ‘input datafile’, and the ‘new’ filename is at ‘output file name’ with the RMAN output.

Next, and finally: online the tablespace:

SYS@v11203 AS SYSDBA> alter tablespace test2 online;

(the old datafile is gone)

Update:
The RMAN copy command and data dictionary update could also be done with RMAN backup as copy and switch datafile:

Status of the database after renaming:

SYS@v11203 AS SYSDBA> select file_id, file_name, tablespace_name from dba_data_files;

   FILE_ID FILE_NAME					      TABLESPACE_NAME
---------- -------------------------------------------------- ------------------------------
	 1 +DATA/v11203/datafile/system.260.785186841	      SYSTEM
	 2 +DATA/v11203/datafile/sysaux.261.785186847	      SYSAUX
	 3 +DATA/v11203/datafile/undotbs1.262.785186849       UNDOTBS1
	 4 +DATA/v11203/datafile/users.264.785186857	      USERS
	 5 +DATA/v11203/datafile/ts.266.785323903	      TS
	 6 +DATA/v11203/datafile/iops.267.785326883	      IOPS
	 7 +DATA/v11203/datafile/test1.269.789411511	      TEST2

Offline the tablespace (my database is in NOARCHIVELOG, online backup (as copy) can only be done in ARCHIVELOG, when doing so, the datafile needs recovery):

RMAN> sql "alter tablespace test2 offline";

sql statement: alter tablespace test2 offline

Backup the datafile as copy. It is very convenient to use file number (file_id), this makes it much simpler to do this:

RMAN> backup as copy datafile 7 format '+DATA';

Starting backup at 23-JUL-12
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=+DATA/v11203/datafile/test1.269.789411511
output file name=+DATA/v11203/datafile/test2.268.789411665 tag=TAG20120723T170105 RECID=2 STAMP=789411665
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 23-JUL-12

Now switch to the backup copy:

RMAN> switch datafile 7 to copy;

datafile 7 switched to datafile copy "+DATA/v11203/datafile/test2.268.789411665"

And online the tablespace again:

RMAN> sql "alter tablespace test2 online";

sql statement: alter tablespace test2 online

Please mind this leaves the old datafile in place, so it needs to be removed explicitly:

RMAN> delete noprompt copy of datafile 7;

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=44 device type=DISK
List of Datafile Copies
=======================

Key     File S Completion Time Ckp SCN    Ckp Time       
------- ---- - --------------- ---------- ---------------
5       7    A 23-JUL-12       1779771    23-JUL-12      
        Name: +DATA/v11203/datafile/test1.268.789415137

deleted datafile copy
datafile copy file name=+DATA/v11203/datafile/test1.268.789415137 RECID=5 STAMP=789415216
Deleted 1 objects

An alternative is to do this with asmcmd, but it’s far less elegant:

$ asmcmd rm -f +DATA/v11203/datafile/test1.269.789411511

Some of you might have experimented with, or used Oprofile, ltrace/strace, Systemtap, and wished you could look further into the userland process, like dtrace does, because dtrace allows you to profile and look into a process’ function calls.

If you’ve used oprofile/strace/systemtap/etc., you probably found out you can see all the kernel functions which are processed, but it does not get into userspace, or with very limited information. Perhaps the only tool which is different is gdb, which enables you to see userspace function names, but gdb is not a profiler, it’s a debugger. And it works best with (I assume it made for) debugging symbols in the executable. Most (all I presume) commercial executables, like the oracle database executable, do not have debugging symbols.

Some time ago, a tweet from Kevin Closson caught my attention, which showed a profile of a Linux system running Oracle, which showed the functions inside Oracle. This totally shocked me! That is something I want to be able to do on Linux too!!

It was done using the perf utility. The perf utility is delivered with the kernel (rpm) package on EL Linuxes, and the since I am using Oracle Linux, I can tell you it’s delivered with the ‘UEK’ kernels (the 2.6.32+ ones). I’ve eventually used the latest version of the UEK kernel (2.6.39-200.24.1.el6uek.x86_64), which you get when you subscribe to the ol6_UEK_latest yum channel. Newer versions of the kernel have more perf functions.

The ‘perf top’ sub-utility shows the kernel and userland functions which are active (by sampling active processes). It can be used in a variety of ways, some of them are: system-wide, per processor or per process. This allows you to do look beyond the Oracle wait interface. As Tanel Poder often states in his performance classes: the Oracle wait interface is a good starting point for performance investigation, but sometimes you need to look deeper. You always need to use the appropriate layer to look at when investigating issues.

Note when using virtualization: virtualization does not allow direct access to the performance monitoring hardware. So if you try to use perf on that (or oprofile for that matter), you get no output. You can get around that with perf by using a software event. I use the software event cpu-clock. This means I invoke perf with the option ‘-e cpu-clock’

For example: this statement uses a lot of CPU, and might (depending on Oracle version, operating system type and version) never return an answer. But it is perfect to have a statement that consumes approximately 1 CPU, and keeps doing that for some time:

select count(*) from dba_extents, dba_extents, dba_extents

A quick glimpse on the statistics profile using sql_monitor reveals:

Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets	|
=========================================
|     109 |	103 |	  6.23 |   9755 |
=========================================

Of course I could look in the sql monitoring plan details, which would reveal what row sources are active at the moment, but let’s see what ‘perf top -e cpu-clock -p shows:

   PerfTop:     108 irqs/sec  kernel: 0.9%  exact:  0.0% [1000Hz cpu-clock],  (target_pid: 12727)
--------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ ____________________________________________________

             1414.00 58.8% kxhrHash                  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              389.00 16.2% qerhjSplitProbe           /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              263.00 10.9% rworofprVeryFastUnpackRow /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              122.00  5.1% smbget                    /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               93.00  3.9% sorgetqbf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               86.00  3.6% qersoFetch                /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               31.00  1.3% qerhjInnerProbeHashTable  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

This shows 58.8% of the samples takes of this process are busy in the function ‘kxhrHash’. So it’s the hashing which this process is busy with. Another important and interesting aspect is: there is very little time spend in kernel mode: (kernel: 0.9%). But…the information about where the time is spend could have been read in the sql monitoring report: it reports CPU and waits per row source.

Where perf comes in to fulfill something which could not be done different, is when you got multiple processes, and you cannot setup perf for every distinct process, or you want to understand system-wide performance. To mimic a system doing OLTP, I’ve setup Kevin Closson’s SLOB in PIO mode, and use perf top -e cpu-clock without specifying a process ID, so I measure the entire system. I get the following profile:

  PerfTop:     984 irqs/sec  kernel:92.4%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
--------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                           DSO
             _______ _____ __________________________________ ______________________________________________________________________

            13102.00 88.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]                                                     
              772.00  5.2% copy_page                          [kernel.kallsyms]                                                     
              298.00  2.0% __do_softirq                       [kernel.kallsyms]                                                     
              114.00  0.8% scsi_request_fn                    [kernel.kallsyms]                                                     
               29.00  0.2% kcbgtcr                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               25.00  0.2% kmem_cache_alloc                   [kernel.kallsyms]                                                     
               25.00  0.2% finish_task_switch                 [kernel.kallsyms]                                                     
               22.00  0.1% kcbzib                             /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               17.00  0.1% sxorchk                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Conclusion: the majority of the processes spend time in kernel calls (kernel.kallsyms). This is not too surprising: I've setup SLOB in 'PIO mode', which means I've reduced the Oracle database buffer cache as much as possible. With Oracle version 11.2.0.3, the smallest buffer cache I can create is 4MB. Next I've let 10 readers do index range scans (that is one of the functions of SLOB: swamping your system with single block (8KB) IO's). The majority of the processes are spending time in the kernel function _raw_spin_unlock_irqrestore. This is how a system which is swamped with physical IO looks like with perf.

This all changes when I revert SLOB to LIO mode, and issue the same test (SLOB run with 10 readers):

   PerfTop:    1004 irqs/sec  kernel: 3.1%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________________________________________

              478.00 23.9% kcbgtcr                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              154.00  7.7% qetlbr                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              149.00  7.4% kcbz_fp_buf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              133.00  6.6% kdsgrp                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              104.00  5.2% kafger                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               94.00  4.7% kcbrls                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrgcm                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrget3                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Ah, we now spend only 3.1% in kernel mode, and the majority of the time is spend in kcbgtcr (quite probably meaning 'kernel cache buffer get consistent read')! I've modified the buffer cache to be able to contain all blocks (also known as SLOB 'LIO mode'), so once they are read, only a logical IO is needed to fetch the block.

Every now and then I am asked about the availability of the presentations I have delivered. Recently somebody asked about a presentation I delivered at the OUG Scotland about multiblock reads, and I promised to make it available. I’ve now uploaded a PDF version of all my old presentations them and put them in the ‘Whitepapers and presentation’ section.

Enjoy!

I’ve got a few speaking commitments in the next few months, which might be of interest to you:

Planboard DBA Symposium

This conference is held in Utrecht, NL on the 12th of June. Some of the speakers include: Alex Nuijten and Rob van Wijk. I’ll be presenting about multiblock reads, and try to explain some of the differences which where silently introduced with Oracle version 11.

OUG Scotland Conference 2012

This is a free conference and is held in Linlithgow, Scotland, UK on the 13th of June. Some of the speakers include: Tom Kyte, Jonathan Lewis, Mark Rittman and Jože Senegacnik to name a few. I’ll be presenting about multiblock reads.

E4: Enkitec Extreme Exadata Expo 2012

This is an Exadata centric conference and is held in Irving, Texas, USA on the 13th and 14th of August. Some of the speakers include: Maria Colgan, Jonathan Lewis, Karen Morton, Kerry Osborne and Tanel Põder, to name some. I’ll be presenting about the Online Transaction Processing (OLTP) capabilities of Exadata. This conference probably has the most publically known Exadata experts together in one place. So if you are serious about Exadata, this is the place to be!

May 4th: some updates after discussion with Jeff Holt, Cary Millsap and Ron Christo of Method-R.

There’s all the documentation, and there all the terrific blogs about performance and Oracle wait events. But if you more or less start with this stuff, or encounter a wait event that is not (extensive enough) documented, or an event turns up and gives a response time you don’t expect, you need to understand what that wait event means. If you don’t know, it’s easy to get stuck at this point.

If you are familiar with waits, and just want to dig further, progress to “Get to know wait events”, if you want to get up speed with waits, read on.

What is a wait event?

This is the definition of the performance tuning guide in the Oracle documentation:

Statistics that are incremented by a server process/thread to indicate that it had to wait for an event to complete before being able to continue processing. Wait events are one of the first places for investigation when performing reactive performance tuning.

This description is quite good I think. If I would have to make a better description, I would say a wait event is essentially the registration of time spend on executing a part of the Oracle database server code, mostly doing something elementary. The granularity of wait events can differ between Oracle versions. Wait events can describe interaction with the operating system (reading or writing are good examples of this), but also interaction between processes (enqueue’s and ipc messaging are good examples of that). Needless to say, both interactions require operating system calls.

Also, a wait event is not necessary a bad thing. Every Oracle database process is either running on CPU or waiting. If your SQL*Plus session is not running, it is waiting for ‘SQL*Net message from client’, which means the database tries to tell you that it waits for the next instruction. Another interesting thing to note about wait events is that the execution of certain instrumented (covered by a wait event) parts if Oracle server code can not report a wait per kernel developer decision. This means, for example, that if doing a latch get read went efficiently, you will not get a ‘latch free’ or ‘latch: latch name‘ wait event, because the wait registration would take exponentially more time than the actual latch get.

How to get the wait events during execution

(skip this section if you are familiar with enabling and disabling sql trace at level 8)

The way to get all the waits of the execution of one or more SQL statements for a given process is to enable SQL trace, alias event 10046, at level 8. This can be done at the system level (meaning all processes) too, but unless you want to reverse engineer the interaction between ALL the process, this is not very useful. There much written about all the ways you can enable SQL trace at level 8. For the sake of simplicity, I use setting event 10046 at level 8 in the session itself. The session requires ‘alter session’ rights to be able to set it. This is how it’s done:

SQL> alter session set events '10046 trace name context forever, level 8';
Session altered.

And can be turned off the following way:

SQL> alter session set events '10046 trace name context off';
Session altered.

Get to know wait events: strace

At this point we know how to enable tracing, but want to dig deeper and understand the operating system interaction inside the waits.

To see what a process (for example your Oracle foreground process) is doing from the perspective of the Operating System, there’s the ‘strace’ executable. The ‘strace’ executable can display all the system calls an operating system process is executing if it has enough privileges to access it. It resembles SQL trace a bit, but at the Operating System/Linux layer. First, you determine the process ID of the session you want to use ‘strace’ on, then use another process to execute ‘strace’. This is what it looks like:

1. Setup a sqlplus session
$ sqlplus ts/ts@v11203

SQL*Plus: Release 11.2.0.3.0 Production on Wed Apr 25 23:03:22 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

TS@v11203 >

2. Attach to the sqlplus process with strace in a session on the linux host where the database is running:
$ strace -p 14128
Process 14128 attached - interrupt to quit
read(14,

3. Now execute something in the sqlplus session, for example ‘select * from dual’:
TS@v11203 > select * from dual;

D
-
X

TS@v11203 >

4. This results in the following list of system calls:

read(14, "\1/\6\21i \376\377\377\377\377\377\377\377\2\2"..., 8208) = 303
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372954
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\1r\6\20\27\2\374T\rD@\255\262t\t\313\242\1\247-8x"..., 370) = 370
read(14, "\25\6\3\5\"\2\17", 8208) = 21
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\254\6\4\1 \1\1{\5\2"..., 172) = 172
read(14, "\1\337\6\3^#)\4\4\376\377\377\377\377\377\377\3774"..., 8208) = 479
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "010\6\v\1\5\314\2\1p\5\207y"..., 304) = 304
read(14,

Wow! Nice! But what does it mean? Well, these are all the system calls used by the Oracle foreground process in order to execute the (simple) SQL. It’s hard to tell something about the execution of the SQL with this list. And be aware, this is only a ‘select * from dual’, so a ‘real’ SQL will probably give a list so long you do not dare to begin studying the list to get something useful out of it.

Well, how about the next chapter:

Get to know wait events: combining waits and strace

How about combining the sql trace output with the system calls? The ‘strace’ executable has quite some options, which can be very helpful:

- list the data written by the process (‘-e write=all’)
This option lists the data written by the session. The trace file is written by the server process, so it will list the contents of that.
- list a timestamp for every system call (‘-tt’)
- list the time spend doing the system call (‘-T’)
- save the output to a file (‘-o /path/to/file.txt’)

These are all helpful to get more information the system calls.

Now setup a session in the database, enable SQL trace at level 8, fetch the process ID of session and run ‘strace -ttT -e write=all -p PID’ in another session and run something. Yes, this generates a massive amount of output. But, as I use it, it’s to investigate what certain waits are doing. Here are some examples:


23:46:20.145793 pread(257, "\6\242\202\3\1\356O\6\2049\245\2\3226\264O\6"..., 1032192, 1996505088) = 1032192
23:46:20.148986 lseek(8, 0, SEEK_CUR) = 46126
23:46:20.149030 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 31 39 38 20 66 69 ad' ela= 3198 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 34 le#=5 bl ock#=664 |
| 00050 33 34 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 34 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 34 38 39 36 31 53903801 48961 |
23:46:20.149121 write(9, "J?e83~1z1\n", 10) = 10
| 00000 4a 3f 65 38 33 7e 31 7a 31 0a J?e83~1z 1. |
23:46:20.149163 write(8, "\n", 1) = 1
| 00000 0a . |
23:46:20.157813 pread(256, "\6\242\2\4\1\360O\6\2\4\217T\2\3226\264O\6"..., 1032192, 2000699392) = 1032192
23:46:20.161124 lseek(8, 0, SEEK_CUR) = 46252
23:46:20.161169 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 33 31 37 20 66 69 ad' ela= 3317 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 35 le#=5 bl ock#=665 |
| 00050 36 32 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 62 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 36 31 31 30 30 53903801 61100 |
23:46:20.161260 write(9, "J?Hv2~1z1\n", 10) = 10
| 00000 4a 3f 48 76 32 7e 31 7a 31 0a J?Hv2~1z 1. |
23:46:20.161624 lseek(8, 0, SEEK_CUR) = 46377
23:46:20.161659 write(8, "\n", 1) = 1
| 00000 0a

The pread() call is the actual IO call done by the database.
The lseek() call is the seek towards the end of the trace file, and sets the pointer for the upcoming write.
The write() call is the write in the trace file. You see the multiblock read (scattered read in the trace file) was done using a pread() call, and the time measured by strace () is line with the time measured by Oracle (ela= 3317), and the number of blocks in the trace (126) matches the number of bytes read 1032192 (126*8192).
The time spend in the system calls is between the ‘smaller’ and ‘bigger’ signs, and get removed by wordpress, so if you where searching for it in the above example, you wouldn’t find it
The write() to file descriptor 9 with the garbage in it is the write to the .trm file.
The lseek() sets the pointer in the file after the writes, and a write() is done to enter a carriage return (\n) to the tracefile.
This is a simple example, to show how a synchronous multiblock read is reported, let’s see a a-synchronous one!


00:17:27.557053 pread(256, "#\242\202\251\243O\6\1\4[\4"..., 8192, 1190150144) = 8192
00:17:27.558973 lseek(8, 0, SEEK_CUR) = 2659
00:17:27.559009 write(8, "WAIT #140244789954888: nam='db f"..., 124) = 124
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 62 20 66 54888: n am='db f |
| 00020 69 6c 65 20 73 65 71 75 65 6e 74 69 61 6c 20 72 ile sequ ential r |
| 00030 65 61 64 27 20 65 6c 61 3d 20 31 39 30 33 20 66 ead' ela = 1903 f |
| 00040 69 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 34 33 ile#=5 b lock#=43 |
| 00050 33 39 34 20 62 6c 6f 63 6b 73 3d 31 20 6f 62 6a 394 bloc ks=1 obj |
| 00060 23 3d 31 34 30 33 33 20 74 69 6d 3d 31 33 33 35 #=14033 tim=1335 |
| 00070 33 39 32 32 34 37 35 35 38 39 34 33 39224755 8943 |
00:17:27.559088 write(9, "N?Rb~22y1\n", 10) = 10
| 00000 4e 3f 52 62 7e 32 32 79 31 0a N?Rb~22y 1. |
00:17:27.559128 write(8, "\n", 1) = 1
| 00000 0a . |
00:17:27.559334 mmap(0x7f8d48e41000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48e41000
00:17:27.559450 mmap(NULL, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x1ec000) = 0x7f8d48c81000
00:17:27.559491 mmap(0x7f8d48c81000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48c81000
00:17:27.559576 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x2fc000) = 0x7f8d48a81000
00:17:27.559614 mmap(0x7f8d48a81000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48a81000
00:17:27.559834 io_submit(140244822339584, 1, {{0x7f8d48a861f8, 0, 0, 0, 256}}) = 1
00:17:27.560008 mmap(0x7f8d48ad1000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48ad1000
00:17:27.560080 io_submit(140244822339584, 1, {{0x7f8d48a86450, 0, 0, 0, 256}}) = 1
00:17:27.560272 io_getevents(140244822339584, 1, 128, {{0x7f8d48a86450, 0x7f8d48a86450, 122880, 0}}, {600, 0}) = 1
00:17:27.560363 io_getevents(140244822339584, 1, 128, {{0x7f8d48a861f8, 0x7f8d48a861f8, 106496, 0}}, {600, 0}) = 1
00:17:27.560865 lseek(8, 0, SEEK_CUR) = 2784
00:17:27.560902 write(8, "WAIT #140244789954888: nam='dire"..., 129) = 129
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 69 72 65 54888: n am='dire |
| 00020 63 74 20 70 61 74 68 20 72 65 61 64 27 20 65 6c ct path read' el |
| 00030 61 3d 20 35 38 36 20 66 69 6c 65 20 6e 75 6d 62 a= 586 f ile numb |
| 00040 65 72 3d 35 20 66 69 72 73 74 20 64 62 61 3d 34 er=5 fir st dba=4 |
| 00050 33 33 39 35 20 62 6c 6f 63 6b 20 63 6e 74 3d 31 3395 blo ck cnt=1 |
| 00060 33 20 6f 62 6a 23 3d 31 34 30 33 33 20 74 69 6d 3 obj#=1 4033 tim |
| 00070 3d 31 33 33 35 33 39 32 32 34 37 35 36 30 38 34 =1335392 24756084 |
| 00080 35 5 |

This advanced example shows how first the segment header is fetched using the ‘db file sequential read’ wait. Despite being on ASM (this is ASM!), this clearly shows the segment header is read synchronously using the pread() call.
The next system calls and waits are even more interesting!
First, memory is allocated using the mmap() call. This is the beginning of a full scan, and the stuff read from disk need to be buffered somewhere. Please mind the memory allocation is not part of any wait. That is no problem, the allocation of memory should be very fast.
Next, TWO asynchronous reads are initiated using the two io_submit() calls, mind the iocb (IO control block) addresses, 0x7f8d48a861f8 and 0x7f8d48a86450.
See the man page of the io_submit call on your linux system using ‘man io_submit’, also use the man page of io_getevents to learn more about that call
Next, the asynchronous reads are reaped using the io_getevents() calls. What is noteworthy here is, is that the two requests are reaped in REVERSE order (this is consistent for the first two asynchronous call pairs for every segment, this example is Oracle version 11.2.0.3). So first request 0x7f8d48a86450 is fetched, then 0x7f8d48a861f8.
After 0x7f8d48a861f8 is reaped, the wait is reported for the FIRST IO call (io_submit() & io_getevents() combination). The second request (0x7f8d48a86450) of the start of a full segment scan is not reported. Never. Further requests do not reverse reaping/’encapsulate’ IO requests (mostly, it can happen a few times more).

I hope you get the idea of digging deeper into waits by examining the system calls.

I am not sure if the above ‘encapsulation’ of IO calls is a bug. But think of this: if you issue multiple IO requests (like we saw above, multiple IO requests are issued!) you cannot report the waits of all these distinct IO requests, because you would get more time reported than has passed. This has some consequences for the above ‘direct path read’ waits: these do not report individual IO times (like ‘db file sequential read’ does). This is just an indication on time spend in the process it had to wait for results from its IO requests, rather than disk IO latency.

The above conclusions are the result of observation and logical reasoning from my side. If you have anything to add, please respond!

Follow

Get every new post delivered to your Inbox.

Join 1,047 other followers

%d bloggers like this: