If you are not familiar with R, this is the description from the R site: R is a language and environment for statistical computing and graphics. I encountered R at the Erasmus university, where I am working on a project with DNA data which is put in an Oracle database (see: http://huvariome.erasmusmc.nl/ (It’s down at the moment)).

R is about using statistics, and helps greatly because of the ability to make graphics of statistical data, which you can use to correlate figures. An example is shown here by Greg Rahn (Structured data) and here an example of Stefan Moeding (blog, the name is a guess, I’ve composed it from the author’s name of the articles on the blog, and I’ve guessed the name of the site (www.moeding.net) is the surname).

I probably will blog about using database contents and R in the future (most probably active session history (ash) data). This blog post is about using R and Oracle database sqltrace with waits (also known as 10046 level 8 traces or 10046/8 traces).

Getting hands-on

If you want to experiment with this, the first you need is R itself. My advice would be to get and use Rstudio, which is a free and open source R integrated development environment (IDE).

Next you need a real Oracle tracefile. If you just want to experiment and don’t have a database at hand with which you can generate one, I have a sample trace file: v11203_ora_19295.trc. I tried to read the tracefile into R (R is very good at using CSV data), and process it to make it usable, but that was very hard. In order to make the tracefile in a CSV-like format, I use a small sed (stream editor; available on Mac OS-X, Linux, well…on most modern platforms except for Windows) script called waits.sed. This script outputs all WAIT lines in the tracefile in a way it can be easily used as CSV file.

This is the content of the waits.sed script:

# format WAIT rows from an oracle 10046/sql trace to
# 1:WAIT 2:cursor number 3:wait name 4:elapsed time 5:p1 6:p2 7:p3 8:object number 9:time
s/^\(WAIT\)\ #\([0-9]*\):\ nam='\(.*\)'\ ela=\ *\([0-9]*\)\ [0-9a-z\ #|]*=\([0-9]*\)\ [0-9a-z\ #|]*=\([0-9]*\)\ [0-9a-z\ #|]*=\([0-9]*\)\ obj#=\([0-9\-]*\)\ tim=\([0-9]*\)$/\1|\2|\3|\4|\5|\6|\7|\8|\9/

In order to make a file with only waits, use the following command:

sed -f waits.sed v11203_ora_19295.trc | grep ^WAIT > v11203_ora_19295.trc.waits

This creates a new file called v11203_ora_19295.trc.waits with only the waits in a CSV format.

Now we are at the point where we get to use R! Open RStudio, and open this R scipt: tracewaits.R. Now choose ‘Edit’,'Run code’,'Run all’. If the v11203_ora_19295.trc.waits file is in the correct directory, the ‘plots’ tab on the right lower side will display the following picture:
Alright. Not much to see yet! This tells us we got two cursors with wait lines in this tracefile. To choose the waits of the one cursor of interest in this tracefile, uncomment (remove the hash) in the following line in the R script:

#trc <- subset(trc, cursor == "47388340751480")

Now run the script again. Now the graph show something far more interesting:

What do we see here? The Y axis shows the value in the ‘p3′ column of the waits. The most waits are ‘direct path read’ (light blue), which means ‘p3′ is the number of blocks read per IO call. The X axis numbers are the seconds of the timestamp in the ‘tim’ column.

We see the number of blocks read per IO call is low for the first few IO calls, then builds up to 126, and later becomes 128 (the maximum in this case). What we see here is automatically allocated extents (of which the first few allocated extents are very small, then bigger, and so on)

Now let’s add the time waited for the waits to the graph, and see if there’s a correlation between blocks read per wait and time waited per wait:

Uncomment the following line:

#p <- p + geom_line(aes(tim, ela,colour="elapsed time")) + stat_smooth(aes(y=ela))

And run the script again. This adds the time waited (ela column) to the graph as a line, and a trendline (stat_smooth):

The “ela” or “elapsed time” line shows why the bar’s in the graph do not have a constant distance between each other: the IO time heavily fluctuates (please mind this is because of my overloaded laptop :) ). Also, when looking at the beginning, I can see the IO times (on my laptop at least) for the small IO’s is lower than the bigger IO’s:
(add the following line to the script to zoom in to the beginning:)

trc <- subset(trc, p2 <= 642)

Which gives this graph:

This clearly shows the bigger (126 blocks) IO’s have a higher response time.

I hope this encourages people to use R and to extend the waits.sed script (it’s a very simple script now, and ONLY handles waits, not parse, execute, fetch, etc.), and extend the R script. Using ‘facet_grid’ you can put multiple graphs of tracefiles/waits which are exactly aligned which each other, so you can see any potential correlation between the two. Also you can overlay multiple graphs, like shown in this example: a graph showing the number of blocks per IO and a graph show the elapsed time per IO.

Have fun using and experimenting with R!

An Oracle Exadata database machine consists of several parts: Intel based servers, infiniband switches, a cisco ethernet switch, a KVM switch and the hardware surrounding it like cables, 19″ rack, power distribution units. The Intel based servers are what “Exadata administrators” are administering the most. The intention of this article is to let the reader gain a little more insight into Exadata specific administration on those.

Two server layers: computing and storage
The two layers have quite different properties: the Exadata computing layer is Linux with Oracle grid infrastructure and the Oracle database software installed, very much as you would do yourself (if you install it in a strict OFA way), and the storage layer is Linux too, but with specific Exadata storage software.

Computing layer
The computing layer filesystems are using LVM on newer (X2) versions of Exadata, and using plain partitions without LVM on the V2 version of Exadata. There are 3 filesystems on a X2 computing node:

[root@xxxxdb01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
30G 14G 15G 49% /
/dev/sda1 124M 16M 102M 14% /boot
/dev/mapper/VGExaDb-LVDbOra1
99G 29G 65G 32% /u01
tmpfs 81G 196M 81G 1% /dev/shm

These are two filesystems inside logical volume, and the boot partition on a plain partition. The /boot partition consists of the grub configuration, kernel and initrd (initial ramdisk). The root logical volume (LVDbSys1) contains the linux operating system files, and the /u01 logical volume (LVDbOra1) contains the Oracle grid/cluster and database software.

A nice thing to know about the computing layer storage is that there is space left in the single volume group which is used for the logical volumes:

[root@emc1db01 ~]# vgdisplay
--- Volume group ---
VG Name VGExaDb
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 3
Max PV 0
Cur PV 1
Act PV 1
VG Size 557.62 GB
PE Size 4.00 MB
Total PE 142751
Alloc PE / Size 39424 / 154.00 GB
Free PE / Size 103327 / 403.62 GB <<-- 400GB space left!
VG UUID AZbMbf-papI-6FE2-hYlR-bzE1-G4pp-Bou81e

The disk configuration on the computing layer is build upon a RAID 5 set of 3 disks with a hot spare. The RAID is hardware based (the diskcontroller is reponsible for the RAID maintenance). This is visible through the MegaCli64 command:

[root@emc1db01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name : DBSYS
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
Size : 557.75 GB
State : Optimal
Stripe Size : 1.0 MB
Number Of Drives : 3
Span Depth : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disabled
Encryption Type : None

Number of Dedicated Hot Spares: 1
0 : EnclId - 252 SlotId - 3

Exit Code: 0x00

Let me emphasize this disks/RAID-set only is responsible for serving the operating system and accompanying files, and the grid/clusterware and database executables. All true database processing is done on the storage layer nodes.

Storage layer
The storage layer filesystems are not using LVM, nor regular partitions (/dev/sd* for disks visible as SCSI devices for linux):

[root@xxxxcel01 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md5 9.9G 3.4G 6.0G 37% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/md7 2.0G 623M 1.3G 33% /opt/oracle
/dev/md4 116M 37M 74M 34% /boot
/dev/md11 2.3G 222M 2.0G 10% /var/log/oracle

The devices used are linux software RAID devices (‘md’ means multiple devices, which is a linux device which is build from independed underlying devices). The reason for using linux software RAID for the operating system and Oracle storage software is to provide redundancy, because the underlying disks on the storage servers do not provide that. Redundancy/error protection for the databases data on the storage servers is done using ASM normal redundancy, which is, put blunt and simple, RAID 10 handled on the ASM layer.

A logical next question is: okay, how do these linux software RAID/MD devices look like? Well, the most simple way is to use /proc/mdstat:

[root@emc1cel01 ~]# cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb1[1] sda1[0]
120384 blocks [2/2] [UU]

md5 : active raid1 sdb5[1] sda5[0]
10482304 blocks [2/2] [UU]

md6 : active raid1 sdb6[1] sda6[0]
10482304 blocks [2/2] [UU]

md7 : active raid1 sdb7[1] sda7[0]
2096384 blocks [2/2] [UU]

md8 : active raid1 sdb8[1] sda8[0]
2096384 blocks [2/2] [UU]

md1 : active raid1 sdb10[1] sda10[0]
714752 blocks [2/2] [UU]

md11 : active raid1 sdb11[1] sda11[0]
2433728 blocks [2/2] [UU]

md2 : active raid1 sdb9[1] sda9[0]
2096384 blocks [2/2] [UU]

unused devices:

We see the MD device name: md and a device number, the state (‘active’), the RAID type (‘raid1′) and the actual devices the md device is build from (sdb1[1] sda1[0] for example). On the next line the number of blocks in the md device, the total number of disks and the number of active disks [2/2], and the status of the disks [UU]. “U” means up. A failed device (probably as a result of a failed disk, but it could be the result of the clearing of a partition) is visible as “_”.

Okay, this is all strongly linux related, where’s the Exadata stuff?

Cellos
‘Cell’ is a name used for the storage server in the Exadata context. Both the computing nodes and the storage nodes have a directory ‘/opt/oracle.cellos’. This directory contains Oracle software which is embedded in Linux. The function of cellos is to maintain and check firmware versions and hardware configurations and hardware activation/deactivation.

Validation functions are done by ‘cellos’, or ‘scripts inside the /opt/oracle.cellos directory’ during startup phase of linux. One of the functions cellos can perform is flashing firmware. Strongly related to cellos are the linux kernel and the OFED stack.

Imageinfo
The state of the cellos image is listed using the imageinfo utility (which exists both in /usr/local/bin and in /opt/oracle.cellos), which executes the /opt/oracle.cellos/patch/bin/imageinfo.sh script:

On a database/computing node:

[root@xxxxdb01 bin]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Image version: 11.2.2.2.0.101206.2
Image created: 2010-12-09 10:06:22 -0800
Image activated: 2011-01-28 14:19:59 -0800
Image image type: production
Image status: success
Internal version:
Image label: OSS_11.2.0.3.0_LINUX.X64_101206.2
Node type: COMPUTE
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

On a cell/storage node:

[root@xxxxcel01 bin]# imageinfo -all

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Cell version: OSS_11.2.0.3.0_LINUX.X64_101206.2
Cell rpm version: cell-11.2.2.2.0_LINUX.X64_101206.2-1

Active image version: 11.2.2.2.0.101206.2
Active image created: 2010-12-09 09:37:46 -0800
Active image activated: 2011-01-28 14:33:44 -0800
Active image type: production
Active image status: success
Active internal version:
Active image label: OSS_11.2.0.3.0_LINUX.X64_101206.2
Active node type: STORAGE
Active system partition on device: /dev/md5
Active software partition on device: /dev/md7

In partition rollback: Impossible

Cell boot usb partition: /dev/sdac1
Cell boot usb version: 11.2.2.2.0.101206.2

Inactive image version: undefined
Rollback to the inactive partitions: Impossible

We see imageinfo knows if it’s run on a computing or storage node (‘Active node type’).

CheckHWnFWProfile
Another executable in /opt/oracle.cellos is ‘CheckHWnFWProfile’. The name of this executabe is quite self-explanory: it checks if the configuration of the server it’s run on is valid, which means it is listed in one of the profiles (Exadata has different incarnations with different hardware, of which Oracle calls the hardware and firmware combination a ‘profile’).

If it’s run without any option, it should return [SUCCESS], which means the hardware and firmware are correct for that version of cellos:

[root@xxxxdb01 ~]# /opt/oracle.cellos/CheckHWnFWProfile
[SUCCESS] The hardware and firmware profile matches one of the supported profiles

If you want more information about the server, you can list the configuration with the ‘-d’ option:

[root@xxxxdb01 ~]# /opt/oracle.cellos/CheckHWnFWProfile -d
[SystemManufacturer]
SUN MICROSYSTEMS
[SystemModel]
SUN FIRE X4170 M2 SERVER
[BIOSVendor]
American Megatrends Inc.
[BIOSVersion]
08040203
[BIOSDate]
09/14/2010
[ProcessorInformation]
Cores:24
Intel(R) Xeon(R) X5670 2.93GHz
step: 2
[ILOMVersion]
3.0.9.27.a r58740
[PhysicalMemoryGB]
96
[DiskControllerPCIeSlotNumber]
PCI-E Slot 0
[DiskControllerModel]
LSI MegaRAID SAS 9261-8i
[DiskControllerFirmwareRevision]
12.9.0-0049
[DiskControllerSilicon]
B4
[DiskControllerBatteryBackupUnitVersion]
iBBU08
[InfinibandHCAPCIeSlotNumber]
PCI-E Slot 2
[FansAndFanCount]
FM0 FM1 FM2 FM3 4
[PowersuppliesAndCount]
PS0:OK PS1:OK 2
[InfinibandHCAPCIeSlotWidth]
5Gbps,x8
[InfinibandHCADeviceID]
26428
[InfinibandHCAChipRevision]
B0
[InfinibandHCAImageType]
ConnectX
[InfinibandHCAFirmwareVersion]
2.7.8100
[SASBackplaneFirmwareVersion]
NotAvailable
[DiskControllerPCIeSlotWidth]
x8
[10GEtherModelPCISlot]
82599EB PCI-E Slot 1
82599EB PCI-E Slot 1
[InfinibandHCAHardwareRev]
50
[LightsOutFirmwareVersion]
Firmware Revision : 3.0
[Enclosure:Slot:DiskType:DiskMake:DiskModel:Diskfw]
ALL:ALL SAS SEAGATE ST930003SSUN300G 0B70

If ‘CheckHWnFWProfile’ resulted in an error, you can list your actual configuration with ‘-d’, and use the ‘-s’ option to list what the profile or supported configuration is.

In my previous post I described how the IO resource manager (IORM) in the exadata storage can be used both to guarantee a minimum amount of IO a database can get (which is what is covered in most material available), but also to set a maximum amount of IO. This is what Oracle calls an inter-database resource manager plan. This is set and configured at the cell level using the cellcli with ‘alter iormplan dbplan’.

But Randy Johnson (who is one of the authors of the upcoming book ‘Expert Oracle Exadata’) commented this is not the only way to limit IO: this can be done using the database resource manager (DBRM) too. This means you can set a limit on the CPU resource usage on the database layer, which gets pushed to the storage layer and enforces IO limits according to the limit set on the database layer. Interesting!

Let’s try it!

Here’s a simple query without any resource management:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:00:44.00

Now create a simple resource manager plan, with one resource group which limits usage to 1%:

BEGIN
DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

DBMS_RESOURCE_MANAGER.CREATE_PLAN(
PLAN => 'TESTPLAN',
COMMENT => 'Testplan! Limit a user to 1 percent');

DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP (
CONSUMER_GROUP => 'LIMITED',
COMMENT => 'This is a limited group');

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
PLAN => 'TESTPLAN',
GROUP_OR_SUBPLAN => 'LIMITED',
COMMENT => 'This group is limited to 1 percent',
MAX_UTILIZATION_LIMIT => 1);

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
PLAN => 'TESTPLAN',
GROUP_OR_SUBPLAN => 'OTHER_GROUPS',
COMMENT => 'Mandatory group');

DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.ORACLE_USER, 'CG', 'LIMITED');

DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA();
DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/

Mind the “MAX_UTILIZATION_LIMIT => 1″! Using this plan, all sessions in this database are not limited, except for the Oracle User ‘CG’, which has a maximal utilization limit of 1 percent.

Now grant access to the resource group ‘LIMITED’ for the user ‘CG’:

BEGIN
DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group(
grantee_name => 'CG',
consumer_group => 'LIMITED',
grant_option => FALSE);
END;
/

And activate the resource manager plan:

ALTER SYSTEM SET RESOURCE_MANAGER_PLAN = 'TESTPLAN' sid='*';

And try the same simple SQL again:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:36:10.21

This is clearly limited!

The resource management is done on the cell/storage layer. This means the waitevents do not indicate resource management (!). When looking at the waits, the process is mostly waiting for the ‘cell smart table scan’ event.

The Exadata IO Resource Manager (IORM) is a fundamental piece of the exadata database machine since the beginning. The function of the IORM is to guarantee a database and/or category (set via the database resource manager) gets the share of IO as defined in the resource plan at each storage server. The “share” here means the minimal amount of IO. This is widely known and documented in the Exadata documentation. Having a minimal amount of IO means you can guarantee a service level agreement (SLA).

But what if you want to define the maximal amount of IO a database may get? The case of a maximal amount of IO a database can get is that you can use a database on Exadata, and do not (or very limited) change the performance once more databases get onto the same database machine. So instead of having the minimal amount of IO a database must get guaranteed (which is what is documented in the exadata documentation), having a limit on the maximal amount a database can get?

This is possible through the ‘limit’ clause of the database resource plan (this option is not available to the category plan clause). This clause is not found in the documentation available to me, but is listed when calling the help function in cellcli for the ‘alter iormplan’ command.

How does such a plan look like?

CellCLI> list iormplan detail
name: dm01cel01_IORMPLAN
catPlan:
dbPlan: name=vxone,level=1,allocation=50,limit=10
name=other,level=1,allocation=50
status: active

This is the IO resource manager plan on one of the cells in a database machine.
What we see here is a plan which says we have a database plan, which says the database ‘vxone’ must always get 50% of all IO capacity at a minimum, and all others (name=other) also get 50%. But now the purpose of this blog: I’ve limited the database ‘vxone’ to only get 10%!

Does it really work? A good practice is to check all things you are told. Especially if it’s something which is widely unknown, and not in the documentation.

Let’s scan a large table without any limit set:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:00:45.97

This is the same SQL, but now with the limit set to 10% (as seen in the iormplan listed above):

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:02:27.28

It is clear the limit clause limits the IO capability of this database: scanning the same database using the same SQL on the same table differs! I’ve ran the tests a few times to see if (both) times are consistent, and they are!

During some testing I encountered an ORA-000214 during startup of an Oracle 11.2.0.2 database instance:

ORA-00214: control file '+RECO_XXXX/test/controlfile/current.334.755391511'
version 268 inconsistent with file
'+DATA_XXXX/test/controlfile/current.299.755390399' version 265

This is a RAC instance on Exadata, but all techniques in this article will work on any Oracle 11.2.x database using ASM.

This message means the database found two controlfiles which have a different version. If this message appears when the database is open, the database will crash. If an instance is startup after this message, the same error appears, and the database remains in nomount state. Further diagnosis: the control file version in the recovery area is more recent than the version in the data diskgroup (version 268 versus version 265).

The obvious and correct way of resolving this issue is to copy the latest control file version in the recovery area over the older one in the data diskgroup.

The most simple and obvious way is to use asmcmd:

[oracle@xxxxdb01 [test1] ~]$ +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
[oracle@xxxxdb01 [+ASM1] ~]$ asmcmd

ASMCMD> cp +RECO_XXXX/test/controlfile/current.334.755391511 +DATA_XXXX/test/controlfile/current.299.755390399
copying +RECO_XXXX/test/controlfile/current.334.755391511 -> +DATA_XXXX/test/controlfile/current.299.755390399
ASMCMD-08016: copy source->'+RECO_XXXX/test/controlfile/current.334.755391511' and target->'+DATA_XXXX/test/controlfile/current.299.755390399' failed
ORA-15056: additional error message
ORA-15046: ASM file name '+DATA_XXXX/test/controlfile/current.299.755390399' is not in single-file creation form
ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 410
ORA-06512: at line 3 (DBD ERROR: OCIStmtExecute)
ASMCMD>

This means asmcmd can not be used to copy the controlfile.

The resolution for this issue is to use rman to copy the controlfile:

[oracle@xxxxdb01 [test1] ~]$ rman target /

Recovery Manager: Release 11.2.0.2.0 - Production on Sat Jul 2 17:23:32 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.

connected to target database: TEST (not mounted)

RMAN> restore controlfile to '+DATA_XXXX' from '+RECO_XXXX/controlfile/current.334.755391511';

Starting restore at 02-JUL-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=847 instance=test1 device type=DISK

channel ORA_DISK_1: copied control file copy
Finished restore at 02-JUL-11

RMAN> exit

By only specifying the diskgroup, a new controlfile is created using OMF (Oracle Managed Files) naming.

At this point we have a valid, consistent control file in the data diskgroup, in order for the database to be able to use it, we need to do the following tasks:
- Gather the name of the new, consistent controlfile in the data diskgroup.
- Remove the old, inconsistent controlfile in the data diskgroup.
- Set the control_files parameter to use the new, copied version in the data diskgroup and use the copy in the recovery area.

Gather the name of the new controlfile in the data diskgroup:
[oracle@xxxxdb01 [+ASM1] ~]$ asmcmd
ASMCMD> find -t controlfile +DATA_EMC1/test *
WARNING:option 't' is deprecated for 'find'
please use 'type'

+DATA_XXXX/test/CONTROLFILE/current.299.755390399
+DATA_XXXX/test/CONTROLFILE/current.314.755457847
ASMCMD>
We know the old, inconsistent version from the errormessage: current.299.755390399. So the new version is current.314.755457847.

Remove the old one:
ASMCMD> rm +DATA_EMC1/test/CONTROLFILE/current.299.755390399

And set the parameter and bounce the instance to test if it works:
[oracle@xxxxdb01 [+ASM1] ~]$ test1
The Oracle base for ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1 is /u01/app/oracle
[oracle@xxxxdb01 [test1] ~]$ sq

SQL*Plus: Release 11.2.0.2.0 Production on Sat Jul 2 17:30:22 2011

Copyright (c) 1982, 2010, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> alter system set control_files='+DATA_XXXX/test/CONTROLFILE/current.314.755457847','+RECO_XXXX/test/controlfile/current.334.755391511' scope=spfile sid='*';

System altered.

SQL> startup force;
ORACLE instance started.

Total System Global Area 1043886080 bytes
Fixed Size 2233088 bytes
Variable Size 478153984 bytes
Database Buffers 423624704 bytes
Redo Buffers 139874304 bytes
Database mounted.
Database opened.
SQL>

The 17th of may 2011 Planboard is organising the 6th version of the Planboard DBA Symposium. The symposium is a whole day event, in Amstelveen near Amsterdam.

There is a wide range of Oracle database topics, including Apex, Oracle GoldenGate, Oracle E-Business suite and Oracle Exadata (presented by me :) !).

If you are interested, take a look at the full agenda!

It’s been a while since I’ve post the first part of this series.

Considerations

I created a little tool, snap, which is more or less a combination of Tanel Poder’s snapper and a script from Tom Kyte. I used this for the first part to give me all the details of database process I was using. The next step is invoking Oracle’s Parallel Query feature. Parallel Query means multiple processing are working together to execute a query.

When measuring the session with snap, I only get the statistics of “my” database session, not the statistics of all the Parallel Query slaves which are working for me. This means I get too little details measuring only my own session. The only way I could think of to measure all relevant statistics in an easy way, is to gather the same statistics as my snap tool, but globally. (global means all processes and in all instances of the database) The result is gsnap; global snap. In my case, I am the only user in the database when doing anything, so the global stats are about my session, the parallel query slaves I have invoked and the background processes. If multiple sessions are working in the database, you cannot use gsnap statistics to determine what your session has done, because it’s unknown which session contributed to the statistics.

Another interesting thing came up during the investigation: when looking in the SQL area (v$sqlarea for example), the statistics with the SQL are the statistics of all the session involved (parallel query slaves) accumulated for that instance. While this sounds logical, it means that the elapsed time with the SQL is not the same as the elapsed time the requesting session experienced. To get the average(!) ‘real’ elapsed time, you need to divide the elapsed time by the number of PX slaves used. The number of slaves can be found using the ‘PX_SERVERS_EXECUTIONS’ column. To get the number slaves used per execution (on average, it’s accumulated), divide by ‘EXECUTIONS’. In case of RAC, the number of executions is only available in the instance in which the SQL is executed, the other instances report ’0′.

In this part I investigate the execution of parallel query on a half rack exadata. This means 4 instances. When measuring the statistics with 4 instances, there is a lot of information. A real lot. For the sake of readability, I do not include all statistics, only the statistics I deem important or interesting.

Parallel query slave usage ‘spills’ to other instances in a RAC database at a certain point. When parallel is set to 2-16 the slaves remain local, when set to 17 all the instances are used. There are no other sessions in any of the instances. The parameter ‘PARALLEL_DEGREE_LIMIT’ is set to CPU. 16 is the number CPU threads on an Exadata V2 database node. (2 sockets, 8 core’s, 16 threads per database node)

In the first part, I have gotten a response time of 37 seconds. In fact, the table has changed a bit so it is different from the first post, but consider it roughly the same. (the same query now takes 33 seconds)

Parallel 16 – local node

As Kevin Closson points out in his comment, the query in the first part was CPU bound, and could be made faster if I could find extra CPU cycles or a faster CPU. Well, I am not going to change the hardware. But wait! I have 16 CPU threads, and the query only uses one. So the obvious next step is to use these threads. To do this, I am going to use Oracle’s Parallel Query functionality. This can be done in several ways, I choose to use the parallel hint in the SQL:

select /*+parallel(16)*/ count(*) from cg_var where sample_id=1;

Wow, the response time now is 13.32 seconds!

What is the result of this parallellisation?

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |   307K(100)|          |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |     3 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000 |     1 |     3 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |     3 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |          |   399M|  1141M|   307K  (1)| 01:01:26 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| CG_VAR   |   399M|  1141M|   307K  (1)| 01:01:26 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                                                                                                                          

   6 - storage(:Z>=:Z AND :Z<=:Z)
       filter("SAMPLE_ID"=1)                                                                                                                                                                                                                                                                                 

Note
-----
   - Degree of Parallelism is 16 because of hint

How do the statistics look like?


i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests			   127,557		    7,645.497
1,STAT physical read total multi block requests 	   127,445		    7,638.784
1,STAT physical read requests optimized 			14			 .839
1,STAT physical read total bytes		   133,276,368,896	    7,988,304,014.076
1,STAT physical write total IO requests 			24			1.439
1,STAT physical write total bytes			   372,736		   22,341.008
1,STAT cell physical IO interconnect bytes	     4,678,771,904	      280,435,704.328
1,STAT cell physical IO bytes eligible for pred    133,275,516,928	    7,988,252,948.914
1,STAT cell physical IO bytes saved by storage		14,680,064		  879,892.025
1,STAT cell smart IO session cache lookups		       192		       11.508
1,STAT cell smart IO session cache hits 		       192		       11.508
1,STAT cell physical IO interconnect bytes retu      4,676,846,784	      280,320,316.702
1,STAT cell session smart scan efficiency		       447		       26.792
1,STAT table scans (short tables)				 1			 .060
1,STAT table scans (long tables)			       208		       12.467
1,STAT table scans (rowid ranges)			       208		       12.467
1,STAT table scans (direct read)			       208		       12.467
1,STAT table scan rows gotten			       391,553,915	       23,468,914.533
1,STAT table scan blocks gotten 			 3,995,742		  239,496.335
1,STAT cell scans					       208		       12.467
1,STAT cell blocks processed by cache layer		16,270,709		  975,231.927
1,STAT cell blocks processed by txn layer		16,270,709		  975,231.927
1,STAT cell blocks processed by data layer		16,267,256		  975,024.961
1,STAT cell blocks helped by minscn optimizatio 	16,270,709		  975,231.927
1,STAT cell simulated session smart scan effici    133,261,361,152	    7,987,404,481.604
1,STAT cell IO uncompressed bytes		   133,261,361,152	    7,987,404,481.604
1,WAIT cell smart table scan				       185.188		       11.100
1,WAIT PX Idle Wait					       426.043		       25.536
1,WAIT PX Deq: Join ACK 					  .007			 .000
1,WAIT PX Deq Credit: send blkd 				  .003			 .000
1,WAIT PX Deq: Parse Reply					  .006			 .000
1,WAIT PX Deq: Execute Reply					13.104			 .785
1,WAIT enq: CF - contention					  .000			 .000
1,WAIT PX Deq: Signal ACK RSG					  .000			 .000
1,WAIT PX Deq: Signal ACK EXT					  .011			 .001
1,WAIT PX Deq: reap credit					  .003			 .000
1,WAIT PX Nsq: PQ load info query				  .200			 .012
1,WAIT PX Deq: Slave Session Stats				  .053			 .003
1,WAIT enq: PS - contention					  .009			 .001
1,WAIT enq: TO - contention					  .000			 .000
1,TIME repeated bind elapsed time				  .000			 .000
1,TIME parse time elapsed					  .009			 .001
1,TIME background cpu time					  .037			 .002
1,TIME PL/SQL execution elapsed time				  .001			 .000
1,TIME DB CPU							22.180			1.329
1,TIME sql execute elapsed time 			       207.183		       12.418
1,TIME DB time						       207.355		       12.428
1,TIME background elapsed time					  .305			 .018

As I’ve stated above, the query response time was 13.32 seconds. Simplistic profile:

 

elapsed time                        : 207
CPU                                 :  22
wait / cell smart table scan        : 185

 

Apparently, I’ve spend 207 seconds of total elapsed time (foreground and PX slaves). 207/16=12.94 a little less than the response time. (13.32)
The profile shows that by enlarging the CPU capacity with PX, the bottleneck shifts, probably towards storage. Are we pushing the storage? Lets see: total amount processed 133,276,368,896, IO eliminated by storage indexes: 14,680,064, that leaves 133,261,688,832.
We got 7 storage servers, so divide by 7: 133,261,688,832/7=19,037,384,119 = average amount of IO done by every cell for this query. If we take 19,037,384,119 and divide that by the total time the query took (13.32); 19,037,384,119/13.32=1,429,233,042. So approximately 1.4GB per second per cell is physically read. I managed to get 1.5GB/s out of a cell during other tests, so we are near the maximum.

This would mean that if I enlarge the parallelism number further, which means multiple instances are used, the response time will not significantly decrease further. Well, let’s see:

Parallel 64 – the entire cluster

select /*+parallel(64)*/ count(*) from cg_var where sample_id=1;

The response time is 12.88 seconds.

 

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       | 76776 (100)|          |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |     3 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000 |     1 |     3 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |     3 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |          |   399M|  1141M| 76776   (1)| 00:15:22 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| CG_VAR   |   399M|  1141M| 76776   (1)| 00:15:22 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                                                                                                                          

   6 - storage(:Z>=:Z AND :Z<=:Z)
       filter("SAMPLE_ID"=1)                                                                                                                                                                                                                                                                                 

Note
-----
   - Degree of Parallelism is 64 because of hint

 

Statistics:


---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests			    33,306		    1,977.687
1,STAT physical read total multi block requests 	    33,197		    1,971.214
1,STAT physical read requests optimized 			 3			 .178
1,STAT physical read total bytes		    34,602,745,856	    2,054,686,537.670
1,STAT physical write total IO requests 			14			 .831
1,STAT physical write total bytes			   240,640		   14,289.033
1,STAT cell physical IO interconnect bytes	     1,208,841,744	       71,780,166.462
1,STAT cell physical IO bytes eligible for pred     34,602,024,960	    2,054,643,731.376
1,STAT cell physical IO bytes saved by storage		 3,145,728		  186,791.100
1,STAT cell smart IO session cache lookups		       200		       11.876
1,STAT cell smart IO session cache hits 		       200		       11.876
1,STAT cell physical IO interconnect bytes retu      1,207,442,960	       71,697,107.658
1,STAT cell session smart scan efficiency		       450		       26.721
1,STAT table scans (short tables)				 1			 .059
1,STAT table scans (long tables)			       216		       12.826
1,STAT table scans (rowid ranges)			       216		       12.826
1,STAT table scans (direct read)			       216		       12.826
1,STAT table scan rows gotten			       101,106,534		6,003,634.369
1,STAT table scan blocks gotten 			 1,031,921		   61,274.738
1,STAT cell scans					       216		       12.826
1,STAT cell blocks processed by cache layer		 4,223,803		  250,806.430
1,STAT cell blocks processed by txn layer		 4,223,803		  250,806.430
1,STAT cell blocks processed by data layer		 4,223,503		  250,788.616
1,STAT cell blocks helped by minscn optimizatio 	 4,223,803		  250,806.430
1,STAT cell simulated session smart scan effici     34,598,936,576	    2,054,460,345.321
1,STAT cell IO uncompressed bytes		    34,598,936,576	    2,054,460,345.321
1,WAIT cell smart table scan				       192.931		       11.456
1,WAIT library cache pin					  .006			 .000
1,WAIT library cache lock					  .002			 .000
1,WAIT library cache: mutex X					  .000			 .000
1,WAIT PX Idle Wait					       448.150		       26.611
1,WAIT PX Deq: Join ACK 					  .007			 .000
1,WAIT PX Deq Credit: send blkd 				  .003			 .000
1,WAIT PX Deq: Parse Reply					  .006			 .000
1,WAIT PX Deq: Execute Reply					12.591			 .748
1,WAIT PX Deq: Execution Msg					 3.475			 .206
1,WAIT PX Deq: Signal ACK RSG					  .000			 .000
1,WAIT PX Deq: Signal ACK EXT					  .014			 .001
1,WAIT PX Deq: reap credit					  .016			 .001
1,WAIT PX Nsq: PQ load info query				  .200			 .012
1,WAIT PX Deq: Slave Session Stats				  .087			 .005
1,WAIT enq: PS - contention					  .024			 .001
1,WAIT enq: TO - contention					  .000			 .000
1,TIME repeated bind elapsed time				  .000			 .000
1,TIME parse time elapsed					  .013			 .001
1,TIME background cpu time					  .039			 .002
1,TIME PL/SQL execution elapsed time				  .001			 .000
1,TIME DB CPU							 6.539			 .388
1,TIME sql execute elapsed time 			       199.272		       11.833
1,TIME DB time						       199.499		       11.846
1,TIME background elapsed time					  .582			 .035
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
2,STAT physical read total IO requests			    31,750		    1,885.293
2,STAT physical read total multi block requests 	    31,666		    1,880.305
2,STAT physical read requests optimized 			10			 .594
2,STAT physical read total bytes		    32,994,410,496	    1,959,184,723.276
2,STAT physical write total IO requests 			12			 .713
2,STAT physical write total bytes			   196,608		   11,674.444
2,STAT cell physical IO interconnect bytes	     1,153,927,168	       68,519,377.838
2,STAT cell physical IO bytes eligible for pred     32,993,722,368	    1,959,143,862.723
2,STAT cell physical IO bytes saved by storage		10,485,760		  622,637.001
2,STAT cell smart IO session cache lookups		       190		       11.282
2,STAT cell smart IO session cache hits 		       190		       11.282
2,STAT cell physical IO interconnect bytes retu      1,152,649,216	       68,443,493.954
2,STAT cell session smart scan efficiency		       451		       26.780
2,STAT cell scans					       206		       12.232
2,STAT cell blocks processed by cache layer		 4,026,750		  239,105.563
2,STAT cell blocks processed by txn layer		 4,026,750		  239,105.563
2,STAT cell blocks processed by data layer		 4,026,283		  239,077.832
2,STAT cell blocks helped by minscn optimizatio 	 4,026,750		  239,105.563
2,STAT cell simulated session smart scan effici     32,983,310,336	    1,958,525,603.638
2,STAT cell IO uncompressed bytes		    32,983,310,336	    1,958,525,603.638
2,WAIT cell smart table scan				       193.646		       11.499
2,WAIT library cache pin					  .006			 .000
2,WAIT library cache lock					  .006			 .000
2,WAIT library cache: mutex X					  .000			 .000
2,WAIT PX Idle Wait					       460.706		       27.356
2,WAIT PX Deq Credit: send blkd 				  .006			 .000
2,WAIT PX Deq: Execution Msg					 2.922			 .173
2,WAIT enq: CO - master slave det				  .001			 .000
2,WAIT PX Deq: reap credit					  .007			 .000
2,WAIT PX Deq: Slave Session Stats				  .114			 .007
2,WAIT enq: PS - contention					  .012			 .001
2,TIME repeated bind elapsed time				  .000			 .000
2,TIME parse time elapsed					  .020			 .001
2,TIME background cpu time					  .062			 .004
2,TIME DB CPU							 6.462			 .384
2,TIME sql execute elapsed time 			       199.424		       11.842
2,TIME DB time						       199.637		       11.854
2,TIME background elapsed time					  .282			 .017
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
3,STAT physical read total IO requests			    33,001		    1,959.576
3,STAT physical read total multi block requests 	    32,902		    1,953.697
3,STAT physical read requests optimized 			 1			 .059
3,STAT physical read total bytes		    34,282,340,352	    2,035,661,085.805
3,STAT physical write total IO requests 			12			 .713
3,STAT physical write total bytes			   196,608		   11,674.444
3,STAT cell physical IO interconnect bytes	     1,201,407,024	       71,338,698.054
3,STAT cell physical IO bytes eligible for pred     34,281,635,840	    2,035,619,252.381
3,STAT cell physical IO bytes saved by storage		 1,048,576		   62,263.700
3,STAT cell smart IO session cache lookups		       198		       11.757
3,STAT cell smart IO session cache hits 		       198		       11.757
3,STAT cell physical IO interconnect bytes retu      1,200,112,688	       71,261,841.300
3,STAT cell session smart scan efficiency		       451		       26.780
3,STAT cell scans					       214		       12.707
3,STAT cell blocks processed by cache layer		 4,185,106		  248,508.630
3,STAT cell blocks processed by txn layer		 4,185,106		  248,508.630
3,STAT cell blocks processed by data layer		 4,184,655		  248,481.850
3,STAT cell blocks helped by minscn optimizatio 	 4,185,106		  248,508.630
3,STAT cell simulated session smart scan effici     34,280,693,760	    2,035,563,312.338
3,STAT cell IO uncompressed bytes		    34,280,693,760	    2,035,563,312.338
3,WAIT cell smart table scan				       193.330		       11.480
3,WAIT library cache pin					  .002			 .000
3,WAIT library cache lock					  .003			 .000
3,WAIT library cache: mutex X					  .000			 .000
3,WAIT PX Idle Wait					       456.274		       27.093
3,WAIT PX Deq Credit: send blkd 				  .005			 .000
3,WAIT PX Deq: Execution Msg					 3.006			 .179
3,WAIT enq: CO - master slave det				  .004			 .000
3,WAIT PX Deq: reap credit					  .007			 .000
3,WAIT PX Deq: Slave Session Stats				  .086			 .005
3,WAIT enq: PS - contention					  .009			 .001
3,TIME repeated bind elapsed time				  .000			 .000
3,TIME parse time elapsed					  .011			 .001
3,TIME background cpu time					  .038			 .002
3,TIME DB CPU							 7.078			 .420
3,TIME sql execute elapsed time 			       199.347		       11.837
3,TIME DB time						       199.570		       11.850
3,TIME background elapsed time					  .449			 .027
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
4,STAT physical read total IO requests			    30,213		    1,794.027
4,STAT physical read total multi block requests 	    30,129		    1,789.039
4,STAT physical read total bytes		    31,398,838,272	    1,864,440,774.857
4,STAT physical write total IO requests 			12			 .713
4,STAT physical write total bytes			   196,608		   11,674.444
4,STAT cell physical IO interconnect bytes	     1,118,129,280	       66,393,724.606
4,STAT cell physical IO bytes eligible for pred     31,398,133,760	    1,864,398,941.433
4,STAT cell smart IO session cache lookups		       180		       10.688
4,STAT cell smart IO session cache hits 		       180		       10.688
4,STAT cell physical IO interconnect bytes retu      1,116,834,944	       66,316,867.852
4,STAT cell session smart scan efficiency		       442		       26.246
4,STAT cell scans					       196		       11.638
4,STAT cell blocks processed by cache layer		 3,833,202		  227,612.819
4,STAT cell blocks processed by txn layer		 3,833,202		  227,612.819
4,STAT cell blocks processed by data layer		 3,832,789		  227,588.296
4,STAT cell blocks helped by minscn optimizatio 	 3,833,202		  227,612.819
4,STAT cell simulated session smart scan effici     31,398,207,488	    1,864,403,319.350
4,STAT cell IO uncompressed bytes		    31,398,207,488	    1,864,403,319.350
4,WAIT cell smart table scan				       192.808		       11.449
4,WAIT library cache pin					  .007			 .000
4,WAIT library cache lock					  .002			 .000
4,WAIT library cache: mutex X					  .000			 .000
4,WAIT PX Idle Wait					       444.140		       26.373
4,WAIT PX Deq Credit: send blkd 				  .005			 .000
4,WAIT PX Deq: Execution Msg					 4.306			 .256
4,WAIT enq: CO - master slave det				  .004			 .000
4,WAIT PX Deq: reap credit					  .005			 .000
4,WAIT PX Deq: Slave Session Stats				  .082			 .005
4,WAIT enq: PS - contention					  .008			 .000
4,TIME repeated bind elapsed time				  .000			 .000
4,TIME parse time elapsed					  .019			 .001
4,TIME background cpu time					  .042			 .002
4,TIME DB CPU							 6.086			 .361
4,TIME sql execute elapsed time 			       198.045		       11.760
4,TIME DB time						       198.275		       11.773
4,TIME background elapsed time					  .538			 .032

Here we see the requests are spread over the available instances. Also, response time is only marginally lower (from 13.32 to 12.88), because the current limitation (physical disk access) hasn’t changed. The average amount of IO becomes:
Total amount of IO: 34,602,745,856+32,994,410,496+34,282,340,352+31,398,838,272=133,278,334,976
Storage index saved: 3,145,728+10,485,760+1,048,576+0=14,680,064
Amount of physical IO per cell: 133,263,654,912 (total amount with storage index IO saved substracted)/7/12.88=1,478,079,579 = 1.4GB/s, which is approximately the same as the previous SQL.

Let’s build the same profile (four times):

 

1- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193
2- sql execute elapsed: 199, cpu time: 6, cell smart table scan: 194
3- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193
4- sql execute elapsed: 198, cpu time: 6, cell smart table scan: 193

 

As expected: we are totally IO bound.

Can we push this further? Yes! Stay tuned for part 3!

This is an investigation into an half rack database machine (the half rack database machine at VX Company). It’s an exadata/database V2, which means SUN hardware and database and cell (storage) software version 11.2.

I build a table (called ‘CG_VAR’), which consists of:
- bytes: 50787188736 (47.30 GB)
- extents: 6194
- blocks: 6199608

The table doesn’t have a primary key, nor any other constraints, nor any indexes. (of course this is not a real life situation)

No exadata optimisation

At first I disabled the Oracle storage optimisation using the session parameter ‘CELL_OFFLOAD_PROCESSING’:
alter session set cell_offload_processing=false;

Then executed: select count(*) from cg_var where sample_id=1;
The value ’1′ in the table ‘CG_VAR’ accounts for roughly 25%.

Execution plan:

Execution Plan
----------------------------------------------------------
Plan hash value: 2301354116
-------------------------------------------------------------------------------------
| Id  | Operation                  | Name   | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |        |     1 |     3 |  1692K  (1)| 05:38:34 |
|   1 |  SORT AGGREGATE            |        |     1 |     3 |            |    |
|*  2 |   TABLE ACCESS STORAGE FULL| CG_VAR |   395M|  1131M|  1692K  (1)| 05:38:34 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("SAMPLE_ID"=1)

Statistics:

592,STAT opened cursors cumulative                                                     1
592,STAT user calls                                                                    3
592,STAT session logical reads                                                   6178215
592,STAT CPU used when call started                                                15398
592,STAT CPU used by this session                                                  15398
592,STAT DB time                                                                   30667
592,STAT user I/O wait time                                                        15397
592,STAT non-idle wait time                                                        15424
592,STAT non-idle wait count                                                       96193
592,STAT session uga memory                                                      2105504
592,STAT session uga memory max                                                   982320
592,STAT session pga memory                                                      1572864
592,STAT session pga memory max                                                  1572864
592,STAT enqueue waits                                                                 3
592,STAT enqueue requests                                                              2
592,STAT enqueue conversions                                                           3
592,STAT enqueue releases                                                              2
592,STAT global enqueue gets sync                                                      5
592,STAT global enqueue releases                                                       2
592,STAT physical read total IO requests                                           48448
592,STAT physical read total multi block requests                                  48446
592,STAT physical read requests optimized                                              2
592,STAT physical read total bytes                                           50610774016
592,STAT cell physical IO interconnect bytes                                 50610774016
592,STAT ges messages sent                                                             4
592,STAT consistent gets                                                         6178215
592,STAT consistent gets from cache                                                  142
592,STAT consistent gets from cache (fastpath)                                       142
592,STAT consistent gets direct                                                  6178073
592,STAT physical reads                                                          6178073
592,STAT physical reads direct                                                   6178073
592,STAT physical read IO requests                                                 48448
592,STAT physical read bytes                                                 50610774016
592,STAT calls to kcmgcs                                                             142
592,STAT calls to get snapshot scn: kcmgss                                             1
592,STAT file io wait time                                                         17477
592,STAT Number of read IOs issued                                                 48448
592,STAT no work - consistent read gets                                          6178073
592,STAT table scans (long tables)                                                     1
592,STAT table scans (direct read)                                                     1
592,STAT table scan rows gotten                                               1596587000
592,STAT table scan blocks gotten                                                6178073
592,STAT session cursor cache hits                                                     1
592,STAT session cursor cache count                                                    1
592,STAT parse count (total)                                                           1
592,STAT execute count                                                                 1
592,STAT bytes sent via SQL*Net to client                                            528
592,STAT bytes received via SQL*Net from client                                       11
592,STAT SQL*Net roundtrips to/from client                                             2
592,STAT cell flash cache read hits                                                    2
592,WAIT enq: KO - fast object checkpoint                                            621
592,WAIT direct path read                                                      153970472
592,WAIT SQL*Net message to client                                                     3
592,WAIT SQL*Net message from client                                          1782253728
592,WAIT kfk: async disk IO                                                       264299
592,WAIT events in waitclass Other                                                  2236
592,TIME parse time elapsed                                                           20
592,TIME DB CPU                                                                153981591
592,TIME sql execute elapsed time                                              306726075
592,TIME DB time                                                               306726219

The executionplan is different with the operation ‘TABLE ACCESS STORAGE FULL’. This means the database is aware the segment is on cell/exadata storage.

We see here the SQL took 306,726,095 microseconds (306.726 seconds; time: parse time+sql execute elapsed time) for scanning 47.30 GB (!!), of which roughly half the time (153.970 seconds) was taken by the wait ‘direct path read’. The I/O wait-time is also in the statistic ‘user I/O wait time’: 15397.

I’ve executed the SQL before; there is no hard parse in the ‘time section’, there’s also no ‘parse count (hard)’ in the statistics section.

I’ve done 48448 IO requests (‘physical read total IO requests’), of which almost all IO’s (48446) where multiblock requests (‘physical read total multi block requests’). This means the average IO wait time is: 153,970,472/48,448 = 3,178 which is 3ms!

Using exadata optimisation, without storage indexes

In order to enable exadata optimisation, I reverted the session parameter ‘CELL_OFFLOAD_PROCESSING’ to ‘TRUE’.

One of the optimisation techniques of exadata is called ‘storage index’. A storage index is a list kept in the memory of the cellservers which keeps track of minimum and maximum values of fields per chunk of 1 MB (called ‘storage region’). Storage indexes are build automatically during usage of the ‘storage region’. A way to flush the storage index is to restart the cell servers. (Warning! Do not do this in a production environment!)

Execution plan:

Execution Plan
----------------------------------------------------------
Plan hash value: 2301354116
-------------------------------------------------------------------------------------
| Id  | Operation                  | Name   | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |        |     1 |     3 |  1692K  (1)| 05:38:34 |
|   1 |  SORT AGGREGATE            |        |     1 |     3 |            |    |
|*  2 |   TABLE ACCESS STORAGE FULL| CG_VAR |   395M|  1131M|  1692K  (1)| 05:38:34 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - storage("SAMPLE_ID"=1)
       filter("SAMPLE_ID"=1)

Statistics:

592,STAT opened cursors cumulative                                                     1
592,STAT user calls                                                                    3
592,STAT session logical reads                                                   6178215
592,STAT CPU used when call started                                                 3707
592,STAT CPU used by this session                                                   3707
592,STAT DB time                                                                    3856
592,STAT user I/O wait time                                                          156
592,STAT non-idle wait time                                                          156
592,STAT non-idle wait count                                                        2437
592,STAT session uga memory                                                     24054408
592,STAT session uga memory max                                                 21948904
592,STAT session pga memory                                                     25755648
592,STAT session pga memory max                                                 23527424
592,STAT enqueue waits                                                                 3
592,STAT enqueue requests                                                              2
592,STAT enqueue conversions                                                           3
592,STAT enqueue releases                                                              2
592,STAT global enqueue gets sync                                                      5
592,STAT global enqueue releases                                                       2
592,STAT physical read total IO requests                                           53714
592,STAT physical read total multi block requests                                  52661
592,STAT physical read total bytes                                           50610774016
592,STAT cell physical IO interconnect bytes                                  5575531904
592,STAT ges messages sent                                                             4
592,STAT consistent gets                                                         6178215
592,STAT consistent gets from cache                                                  142
592,STAT consistent gets from cache (fastpath)                                       142
592,STAT consistent gets direct                                                  6178073
592,STAT physical reads                                                          6178073
592,STAT physical reads direct                                                   6178073
592,STAT physical read IO requests                                                 53714
592,STAT physical read bytes                                                 50610774016
592,STAT calls to kcmgcs                                                             142
592,STAT calls to get snapshot scn: kcmgss                                             1
592,STAT cell physical IO bytes eligible for predicate offload               50610774016
592,STAT cell physical IO interconnect bytes returned by smart scan           5575531904
592,STAT cell session smart scan efficiency                                            9
592,STAT table scans (long tables)                                                     1
592,STAT table scans (direct read)                                                     1
592,STAT table scan rows gotten                                                387130468
592,STAT table scan blocks gotten                                                1503621
592,STAT cell scans                                                                    1
592,STAT cell blocks processed by cache layer                                    6549118
592,STAT cell blocks processed by txn layer                                      6549118
592,STAT cell blocks processed by data layer                                     6183339
592,STAT cell blocks helped by minscn optimization                               6549118
592,STAT cell simulated session smart scan efficiency                        50653913088
592,STAT cell IO uncompressed bytes                                          50653913088
592,STAT session cursor cache count                                                    1
592,STAT parse count (total)                                                           1
592,STAT execute count                                                                 1
592,STAT bytes sent via SQL*Net to client                                            528
592,STAT bytes received via SQL*Net from client                                      327
592,STAT SQL*Net roundtrips to/from client                                             2
592,WAIT enq: KO - fast object checkpoint                                            492
592,WAIT cell smart table scan                                                   1560351
592,WAIT SQL*Net message to client                                                     3
592,WAIT SQL*Net message from client                                             5605106
592,WAIT events in waitclass Other                                                  1590
592,TIME parse time elapsed                                                           46
592,TIME DB CPU                                                                 37064365
592,TIME sql execute elapsed time                                               38571803
592,TIME DB time                                                                38571968

The execution plan shows the ‘TABLE ACCESS STORAGE FULL’ again, to indicate cell storage, but now it was able to offload, which is visible in the predicate information section with the predicate ‘storage(“SAMPLE_ID”=1)’. Let’s see the differences:

NR ,TYP Statistic                                                                  1st           2nd          Diff
592,STAT opened cursors cumulative                                                     1              1              0
592,STAT user calls                                                                    3              3              0
592,STAT session logical reads                                                   6178215        6178215              0
592,STAT CPU used when call started                                                15398           3707         -11691
592,STAT CPU used by this session                                                  15398           3707         -11691
592,STAT DB time                                                                   30667           3856         -26811
592,STAT user I/O wait time                                                        15397            156         -15241
592,STAT non-idle wait time                                                        15424            156         -15268
592,STAT non-idle wait count                                                       96193           2437         -93756
592,STAT session uga memory                                                      2105504       24054408       21948904
592,STAT session uga memory max                                                   982320       21948904       20966584
592,STAT session pga memory                                                      1572864       25755648       24182784
592,STAT session pga memory max                                                  1572864       23527424       21954560
592,STAT enqueue waits                                                                 3              3              0
592,STAT enqueue requests                                                              2              2              0
592,STAT enqueue conversions                                                           3              3              0
592,STAT enqueue releases                                                              2              2              0
592,STAT global enqueue gets sync                                                      5              5              0
592,STAT global enqueue releases                                                       2              2              0
592,STAT physical read total IO requests                                           48448          53714           5266
592,STAT physical read total multi block requests                                  48446          52661           4215
592,STAT physical read requests optimized                                              2              0             -2
592,STAT physical read total bytes                                           50610774016    50610774016              0
592,STAT cell physical IO interconnect bytes                                 50610774016     5575531904   -45035242112
592,STAT ges messages sent                                                             4              4              0
592,STAT consistent gets                                                         6178215        6178215              0
592,STAT consistent gets from cache                                                  142            142              0
592,STAT consistent gets from cache (fastpath)                                       142            142              0
592,STAT consistent gets direct                                                  6178073        6178073              0
592,STAT physical reads                                                          6178073        6178073              0
592,STAT physical reads direct                                                   6178073        6178073              0
592,STAT physical read IO requests                                                 48448          53714           5266
592,STAT physical read bytes                                                 50610774016    50610774016              0
592,STAT calls to kcmgcs                                                             142            142              0
592,STAT calls to get snapshot scn: kcmgss                                             1              1              0
592,STAT file io wait time                                                         17477              0         -17477
592,STAT Number of read IOs issued                                                 48448              0         -48448
592,STAT cell physical IO bytes eligible for predicate offload                         0    50610774016    50610774016
592,STAT cell physical IO interconnect bytes returned by smart scan                    0     5575531904     5575531904
592,STAT cell session smart scan efficiency                                            0              9              9
592,STAT no work - consistent read gets                                          6178073              0       -6178073
592,STAT table scans (long tables)                                                     1              1              0
592,STAT table scans (direct read)                                                     1              1              0
592,STAT table scan rows gotten                                               1596587000      387130468    -1209456532
592,STAT table scan blocks gotten                                                6178073        1503621       -4674452
592,STAT cell scans                                                                    0              1              1
592,STAT cell blocks processed by cache layer                                          0        6549118        6549118
592,STAT cell blocks processed by txn layer                                            0        6549118        6549118
592,STAT cell blocks processed by data layer                                           0        6183339        6183339
592,STAT cell blocks helped by minscn optimization                                     0        6549118        6549118
592,STAT cell simulated session smart scan efficiency                                  0    50653913088    50653913088
592,STAT cell IO uncompressed bytes                                                    0    50653913088    50653913088
592,STAT session cursor cache hits                                                     1              0             -1
592,STAT session cursor cache count                                                    1              1              0
592,STAT parse count (total)                                                           1              1              0
592,STAT execute count                                                                 1              1              0
592,STAT bytes sent via SQL*Net to client                                            528            528              0
592,STAT bytes received via SQL*Net from client                                       11            327            316
592,STAT SQL*Net roundtrips to/from client                                             2              2              0
592,STAT cell flash cache read hits                                                    2              0             -2
592,WAIT enq: KO - fast object checkpoint                                            621            492           -129
592,WAIT direct path read                                                      153970472              0     -153970472
592,WAIT cell smart table scan                                                         0        1560351        1560351
592,WAIT SQL*Net message to client                                                     3              3              0
592,WAIT SQL*Net message from client                                          1782253728        5605106    -1776648622
592,WAIT kfk: async disk IO                                                       264299              0        -264299
592,WAIT events in waitclass Other                                                  2236           1590           -646
592,TIME parse time elapsed                                                           20             46             26
592,TIME DB CPU                                                                153981591       37064365     -116917226
592,TIME sql execute elapsed time                                              306726075       38571803     -268154272
592,TIME DB time                                                               306726219       38571968     -268154251

Let start at the time spend: the first run, this was: 306726075+20= 307 seconds, the second run took: 38571803+46= 39 seconds (!!). That is a reduction of 87%. That _is_ spectacular!

The waits have changed: the ‘direct path read’ is swapped for ‘cell smart table scan’.

In the (session) statistics also have some things to notice:

  • physical read total bytes: Is the same(!) It’s not strange: the size of the segment to read is known by the database, and must be read entirely from the database’s perspective.
  • cell physical IO bytes eligible for predicate offload: this statistic shows the amount of data which the cell server is able to process on behalf of the database, instead of the database processing and the cell server just delivering blocks. In this case, all bytes are processed on the cellserver (cell physical IO bytes eligible for predicate offload=physical read total bytes)
  • cell physical IO interconnect bytes: Here we see that the amount of data traffic between the database and the cell servers. Because of the offload, the amount is reduced immensely (50,610,774,016 versus 5,575,531,904) which is a reduction of 88.9% (!!)
  • cell physical IO interconnect bytes returned by smart scan: This is the amount of data which is caused by the cell smart scan (the statistic name is quite self explaining). If we compare this statistic with ‘cell physical IO interconnect bytes’, we see all data on the interconnect was caused by the smart scan.
  • cell scans: the number of scans offloaded to cellservers. one in this case.
Using exadata optimisation, with storage indexes

Upon the next execution of this statement, the execution plan stays the same, but the cellservers built the storage index.

Let’s see a comparison of the statistics of the previous run (without a storage index in place) with a second run, which will use the storage index, which is built up during the first run:

NR ,TYP Statistic                                                                  1st           2nd          Diff
592,STAT opened cursors cumulative                                                     1              1              0
592,STAT user calls                                                                    3              3              0
592,STAT session logical reads                                                   6178215        6178215              0
592,STAT CPU used when call started                                                 3707           3713              6
592,STAT CPU used by this session                                                   3707           3713              6
592,STAT DB time                                                                    3856           3745           -111
592,STAT user I/O wait time                                                          156             37           -119
592,STAT non-idle wait time                                                          156             37           -119
592,STAT non-idle wait count                                                        2437           1825           -612
592,STAT session uga memory                                                     24054408              0      -24054408
592,STAT session uga memory max                                                 21948904              0      -21948904
592,STAT session pga memory                                                     25755648              0      -25755648
592,STAT session pga memory max                                                 23527424              0      -23527424
592,STAT enqueue waits                                                                 3              3              0
592,STAT enqueue requests                                                              2              2              0
592,STAT enqueue conversions                                                           3              3              0
592,STAT enqueue releases                                                              2              2              0
592,STAT global enqueue gets sync                                                      5              5              0
592,STAT global enqueue releases                                                       2              2              0
592,STAT physical read total IO requests                                           53714          53765             51
592,STAT physical read total multi block requests                                  52661          52688             27
592,STAT physical read requests optimized                                              0          35423          35423
592,STAT physical read total bytes                                           50610774016    50610774016              0
592,STAT cell physical IO interconnect bytes                                  5575531904     5570438336       -5093568
592,STAT ges messages sent                                                             4              4              0
592,STAT consistent gets                                                         6178215        6178215              0
592,STAT consistent gets from cache                                                  142            142              0
592,STAT consistent gets from cache (fastpath)                                       142            142              0
592,STAT consistent gets direct                                                  6178073        6178073              0
592,STAT physical reads                                                          6178073        6178073              0
592,STAT physical reads direct                                                   6178073        6178073              0
592,STAT physical read IO requests                                                 53714          53765             51
592,STAT physical read bytes                                                 50610774016    50610774016              0
592,STAT calls to kcmgcs                                                             142            142              0
592,STAT calls to get snapshot scn: kcmgss                                             1              1              0
592,STAT cell physical IO bytes eligible for predicate offload               50610774016    50610774016              0
592,STAT cell physical IO bytes saved by storage index                                 0    37006434304    37006434304
592,STAT cell physical IO interconnect bytes returned by smart scan           5575531904     5570438336       -5093568
592,STAT cell session smart scan efficiency                                            9              0             -9
592,STAT table scans (long tables)                                                     1              1              0
592,STAT table scans (direct read)                                                     1              1              0
592,STAT table scan rows gotten                                                387130468      387130468              0
592,STAT table scan blocks gotten                                                1503621        1503621              0
592,STAT cell scans                                                                    1              1              0
592,STAT cell blocks processed by cache layer                                    6549118        2025039       -4524079
592,STAT cell blocks processed by txn layer                                      6549118        2025039       -4524079
592,STAT cell blocks processed by data layer                                     6183339        1666003       -4517336
592,STAT cell blocks helped by minscn optimization                               6549118        2025039       -4524079
592,STAT cell simulated session smart scan efficiency                        50653913088    13647896576   -37006016512
592,STAT cell IO uncompressed bytes                                          50653913088    13647896576   -37006016512
592,STAT session cursor cache hits                                                     0              1              1
592,STAT session cursor cache count                                                    1             1            0
592,STAT parse count (total)                                                           1              1              0
592,STAT execute count                                                                 1              1              0
592,STAT bytes sent via SQL*Net to client                                            528            528              0
592,STAT bytes received via SQL*Net from client                                      327            327              0
592,STAT SQL*Net roundtrips to/from client                                             2              2              0
592,WAIT enq: KO - fast object checkpoint                                            492            790            298
592,WAIT cell smart table scan                                                   1560351         366466       -1193885
592,WAIT SQL*Net message to client                                                     3              2             -1
592,WAIT SQL*Net message from client                                             5605106        8163815        2558709
592,WAIT events in waitclass Other                                                  1590           1723            133
592,TIME parse time elapsed                                                           46             23            -23
592,TIME DB CPU                                                                 37064365       37131355          66990
592,TIME sql execute elapsed time                                               38571803       37456254       -1115549
592,TIME DB time                                                                38571968       37457846       -1114122

We see here the storage index reduced an enormous amount of I/O on the cells (cell physical IO bytes saved by storage index: 37006434304), but didn’t result in a enormous reduction of execution time (38.5 seconds versus 37.4 seconds).

This isn’t too unexpected:
1/without exadata optimisation: total time: 307 s, IO wait (direct path read): 154 s.
2/with exadata optimisation: total time: 39 s, IO wait (cell smart table scan): 2 s.

I think you get the point now: after IO waits are brought down to 2 seconds, any further optimisation on IO (which is what the storage indexes are) only will reduce the 2 seconds further. And it did! The optimisation of the storage indexes is again spectacular, only has little impact on overall response time, because it influences a too little part of it (in fact, that is Amdahl’s law):

3/with exadata optimisation and storage indexes: total time: 37 s, IO wait (cell smart table scan): 0.3 s.

Conclusion

This is only an investigation in some very specific things of Exadata. There is much more to explore.

Using the hardware in the database machine, a single process can read 47.30 GB in 154 seconds. That is a very decent number, and sustainable with multiple processes, alias Oracle Parallel Query. That on it self is something of which many shops can benefit from. But, it gets really interesting when using the exadata optimisations. In this case, the total response time dropped from 307 to 37 seconds, with barely a wait on I/O (0.3 seconds for 47.30 GB!!)

OakTable member!

Posted: August 23, 2010 in Oracle
Tags:

Just like Tim Hall I was nominated and approved as a member of the OakTable network. I am very happy to have been accepted to a group of such gifted individuals!

For those of you who haven’t followed all the Oracle Exadata and database machine information, and want a brief introduction to the database machine: here it is!

The confusion

There is some confusion about ‘exadata’ and ‘the database machine’. If we look at the official product names, ‘Exadata’ is the storage server, and the ‘Database machine’ is the complete boxed machine, including exadata for storage. But…in the real world, in different kinds of papers (technical, advertising) exadata sometimes is used as an alias for the database machine.

What is the database machine?

Essentially, the database machine is a version of the Oracle database on balanced, optimised hardware from Oracle. But that is a very short description, and does harm to the unique feature of the database machine: the Oracle Exadata storage server. Exadata offers database offloading by execution some of the database functions on the storage layer. This is a schematic drawing of the architecture of the database machine (cell server is the internal name of the exadata software):

The database machine, a little more verbose

One of the basic design principles of the database machine is fault tolerancy. This means any component may fail, whilst the database still is able to function. So if you look at the schematic drawing above, it means there are multiple instances of the database (=Oracle RAC), there are TWO infiniband switches (in a quarter and half rack, there are three infiniband switches in a full rack), and there are multiple storage servers (3/quarter-, 7/half-, 14/full-rack). The redundancy of the database is NOT done by the storage servers, but is done using Oracle’s ASM (Automatic Storage Management) ‘normal redundancy’, which essentially means data is stored double.

It is possible to use a single server (non RAC) version of the database on the database machine, but that does harm to the architecture of the database machine (in my opinion).

The storage servers are fitted with 384 GB flash memory attached via PCI-X. This flash memory can be used as flash disks, but is configured as flash memory for caching of the storage server by default.

Last but not least the database is able to offload some database actions to the storage:
- Smart scans: row filtering, column filtering, join filtering, incremental backup filtering, data mining model scoring. Transparently.
- Storage indexes: stores min and max values of columns of database blocks every 1MB. Transparently.
- Hybrid Columnar Compression: database compression mechanism where data is stored by column, then compressed. (HCC is using several database blocks). Needs to be specified explicitly as a compression option during creation of modification.

Conclusion

I hope this article was able to get you a basic understanding of the architecture of the database machine. Perhaps it might even make you enthusiastic about it, because it offers several unique features:
- A hardware infrastructure purely built for database performance. Something encountered ‘in the wild’ very rarely.
- State of the art hardware.
- Common Oracle database, no need to alter anything when moving to the database machine.