Archive

Tag Archives: exadata

I am excited to announce that I have accepted a role at Yugabyte as ‘developer advocate’. It’s not easy to leave the talented group of people that is Enkitec, of which many of them I call friends. In fact, the change is much bigger since Yugabyte is a database outside of the Oracle ecosystem, in which ecosystem there also are a lot of people that I have gotten to known and call friends.

So why the change? In essence, my reasons are the same as the ones Bryn Llewellyn mentioned in his blogpost about his move, although I come from a different background being a consultant and not working for Oracle.

Open source

You could argue that having spend investigating the inner working of the Oracle database for 20 or so years means I am discarding all this knowledge I build up. I don’t believe that’s true. Yes, I have build up a lot of knowledge about the Oracle database works, and parts of that knowledge might be known (usable?) to a small group. However, I am looking forward to working with software for which I can actually see how it’s made, and I don’t have to perform extensive research to figure out how it’s “probably” implemented. This is an important reason for me personally. In general I think there are much more reasons for companies for wanting to use open source software instead of proprietary, closed source software.

Distributed architecture & distributed SQL

The area I have spend a significant amount of time in, is assessing and investigating performance of the Oracle database. Oracle, like many traditional monolithic databases, use an architecture where the database itself is actually a single set of files containing the data, which is singular, and are served by processes that perform validation of requests to use that data, and perform performance features like caching so data can be used from memory, which means it doesn’t need to be re-read from the files so it can be served at memory latency. This architecture imposes limits on the scale of use, for which common solutions to overcome these limits with monolithic architectures are:
– Sharing the (still singular) datafiles with multiple machines (RAC).
– Make a copy of the data files, and spool the changes to the copy (DG/OGG/replication).
– (Manually) partition the data over multiple independent databases (sharding).
– Storing data redundantly over the available (still local) disks (RAID).
I am not saying these solutions are wrong. In fact, many of them are very cleverly architected and optimized over the years with the different architectures of databases and work really well.
However, there are a consequences that are inherently imposed by a monolithic architecture, such as scalability, flexibility and failure resistance.

The Yugabyte database provides a cloud ready, cloud vendor agnostic, failure resistant, linear scaling database (sorry for the abundance of hype words). Failure resistance is provided by keeping multiple copies of the data, which is configurable as ‘replication factor’. By keeping copies of the data, failure of a single Yugabyte node (called a ‘tablet server’) can be overcome. By spreading these copies of data over groups of tablet servers in cloud availability zones, the outage of an entire availability zone can be survived. When such a disaster strikes, the cluster will survive and provide normal (read and write functions) as long as 2/3rd of the nodes survive (in case of a replication factor of 3), without the need for any management to accommodate failover, like instantiating a replica/standby.

Each Yugabyte tablet server also serves as an API for data retrieval and manipulation, so with the addition of tablet servers, the processing power increases. A tablet server should use local, non-protected disks to take advantage of low latency, and by adding more tablet servers the work can be spread out further to these, increasing the amount of work that can be done. Aside from increasing processing for API-side processing, Yugabyte provides functionality which pushes down the work to the storage, which Exadata specialists will recognize as ‘smart scans’, alongside things like predicate pushdown and bloom filters.

Yugabyte has not invented yet another SQL and/or no-SQL dialect. Another point that I think is really strong is that Yugabyte reuses Postgresql as the SQL API. The Postgres API is “wire compatible” with Postgres, which means that any product that can talk to a Postgres database can talk to Yugabyte. In fact, instead of writing a Postgres layer, Yugabyte has re-used the Postgres source to use the query layers of postgres, and then provide the storage for postgres using Yugabyte. This also means a lot of the server side functionality of Postgres, such as PL/pgSQL can be used. Needless to say there are limitations inherent to the postgres connection with the Yugabyte storage layer, but most common postgres database functionality is available.

But that is not all! Aside and in parallel to Yugabyte’s Postgres API, it also provides a “NoSQL” APIs, which are compatible with Apache Cassandra and with Redis. Oh, and data storage is done via LSM (log-structured merge-tree), etc.

Conclusion

I hope at this point you can see that Yugabyte provides a fresh, modern approach to a database that provides a lot of advantages over more traditional, monolithic databases, for which I think a lot of cases that I witnessed over the years could significantly benefit from (performance-wise and availability-wise). I am really thrilled to be on the basis of building that further.

Also, a lot of technical reasons that I described are really just a summary, if you have gotten interested or want to learn more, I would urge you (or challenge you :-D) to have a look at https://www.yugabyte.com or read more at https://docs.yugabyte.com. Or if you want to see the code, head over to https://github.com/yugabyte/yugabyte-db!

Recently I was asked to analyse the security impact of the snmp daemon on a recent Exadata. This system was running Exadata image version 12.1.2.1.3. This blog article gives you an overview of a lot of the things that surround snmp and security.

First of all what packages are installed doing something with snmp? A list can be obtained the following way:

# rpm -qa | grep snmp
net-snmp-utils-5.5-54.0.1.el6_7.1.x86_64
net-snmp-libs-5.5-54.0.1.el6_7.1.x86_64
net-snmp-5.5-54.0.1.el6_7.1.x86_64
sas_snmp-14.02-0103.x86_64

Essentially the usual net-snmp packages and a package called ‘sas_snmp’.

A next important thing is how the firewall is configured. However, the default setting of the firewall on the compute nodes with exadata is the firewall turned off:

# iptables -L -v
Chain INPUT (policy ACCEPT 437M packets, 216G bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 343M packets, 748G bytes)
 pkts bytes target     prot opt in     out     source               destination

So if there is something running that listens to a network port that can benefit ‘attackers’, there is no firewall to stop them.

Next obvious question is what snmp processes are actually running:

# ps -ef |grep snmp
root       7088      1  0 Aug16 ?        00:51:32 /usr/sbin/snmpd -LS0-6d -Lf /dev/null -p /var/run/snmpd.pid
root      33443      1  0 03:14 ?        00:00:49 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf
root      33454  33443  0 03:14 ?        00:00:00 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf

The snmpd process is the net-snmp snmp daemon. However, there are two additional processes running with the name ‘snmp’ in them, one is owned by init, and a processes that this process has spawned. The name ‘lsi_mrdsnmpagent’ probably means LSI MegaRaid SNMP agent. That gives a fair hint this processes is doing something snmp related specifically for the LSI MegaRaid adapter, which is the disk controller.

Are there any open ports related to snmp processes?

# netstat -anp | grep snmp
tcp        0      0 127.0.0.1:199               0.0.0.0:*                   LISTEN      7088/snmpd
udp        0      0 0.0.0.0:161                 0.0.0.0:*                               7088/snmpd
udp        0      0 0.0.0.0:22917               0.0.0.0:*                               7088/snmpd

1. tcp port 199
This is support for the SMUX protocol (RFC 1227) to communicate with SMUX-based subagents. This is a deprecated feature in favour of AgentX. It is considered a bug (https://bugzilla.redhat.com/show_bug.cgi?id=110931) the daemon is still using this port. However the port is opened on localhost (127.0.0.1) and as such not reachable from outside of the machine, which means it is not a direct security problem.

2. udp port 161
This is the default snmpd port. This port is open to the outside world on the compute node, which can be seen from the address 0.0.0.0 in the above ‘source’ column. The port being open can be verified using another machine and the ‘nmap’ tool:

$ sudo nmap -Pn -sU -p 161 311.1.1.1
Password:

Starting Nmap 6.47 ( http://nmap.org ) at 2016-10-26 15:00 CEST
Nmap scan report for 311.1.1.1
Host is up (0.087s latency).
PORT    STATE SERVICE
161/udp open  snmp

The status ‘open’ shows this udp port does respond to requests.

3. udp port 22917 (in this case; this port number is random)
This is a random port that gets set for trapsink directive set in the /etc/snmp/snmpd.conf. A trap sink is the destination for snmp traps that get triggered. Although this udp port is in use, it does not respond to network traffic:

$ sudo nmap -Pn -sU -p 22917 311.1.1.1
Password:

Starting Nmap 6.47 ( http://nmap.org ) at 2016-10-26 15:22 CEST
Nmap scan report for 311.1.1.1
Host is up.
PORT      STATE         SERVICE
22917/udp open|filtered unknown

The status ‘open|filtered’ does mean the udp port does not respond to requests.

Now let’s look how the actual configuration file of the snmp daemon looks like on exadata. The configuration file is /etc/snmp/snmpd.conf:

snmp daemon configuration file:
trapcommunity public
trapsink 127.0.0.1 public
rocommunity public 127.0.0.1
rwcommunity public 127.0.0.1

access  RWGroup         ""      any       noauth    exact all all all
com2sec snmpclient      127.0.0.1               public
group   RWGroup                 v1                              snmpclient

pass .1.3.6.1.4.1.4413.4.1 /usr/bin/ucd5820stat
pass .1.3.6.1.4.1.3582 /usr/sbin/lsi_mrdsnmpmain

syscontact Root <root@localhost> (configure /etc/snmp/snmp.local.conf)
syslocation Unknown (edit /etc/snmp/snmpd.conf)

view    all             included      .1                80

The snmpd.conf file shows:
– trapsink destination (127.0.0.1, localhost) and community string (public).
– the ro and rw communities are set to ‘public 127.0.0.1’
In general it is advised to change the community strings to something unique to avoid being easy guessable. However, in this case there’s also a network description following the ro and rw community, which is: 127.0.0.1. This means snmp access is restricted to localhost.
This can be verified by running snmpwalk from another machine:

$ snmpwalk -v 2c -c public 311.1.1.1
Timeout: No Response from 311.1.1.1

This means there is no way to communicate to the snmp daemon from outside of the machine. We can see from the snmp daemon configuration file that access to the snmp deamon is limited to localhost.

It seems the LSI megaraid snmp agent works together with snmpd:

root      33443      1  0 03:14 ?        00:01:01 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf
root      33454  33443  0 03:14 ?        00:00:00 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf

Obviously it reads the snmpd.conf (-c /etc/snmp/snmpd.conf as seen above), but it has got a configuration file of its own. This configuration file of ‘lsi_mrdsnmpagent’ process can be seen when looking at the files inside the sas_snmp rpm package (rpm -ql sas_snmp), but the main evidence it is using the file can be derived from looking at the open file descriptors of the lsi_mrdsnmpagent process:

# ls -ls /proc/$(pgrep -f lsi_mrdsnmpagent | head -1)/fd
total 0
0 lr-x------. 1 root root 64 Oct 26 03:14 0 -> /dev/null
0 lr-x------. 1 root root 64 Oct 26 03:14 1 -> /etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf
0 l-wx------. 1 root root 64 Oct 26 03:14 2 -> /var/log/cellos/cron_daily_cellos.stderr (deleted)
0 lrwx------. 1 root root 64 Oct 26 03:14 3 -> socket:[2923149143]
0 l-wx------. 1 root root 64 Oct 26 03:14 4 -> /var/log/cellos/cellos.log (deleted)
0 l-wx------. 1 root root 64 Oct 26 03:14 5 -> /var/log/cellos/cellos.trc (deleted)
0 lr-x------. 1 root root 64 Oct 26 03:14 6 -> /etc/snmp/snmpd.conf
0 lr-x------. 1 root root 64 Oct 26 03:14 7 -> /etc/redhat-release
0 lr-x------. 1 root root 64 Oct 26 03:14 8 -> /dev/megaraid_sas_ioctl_node
0 lr-x------. 1 root root 64 Oct 26 03:14 9 -> pipe:[2919419375]

Line 4 shows ‘/etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf’! Let’s look inside that configuration file:

# cat /etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf
#################################################
# Agent Service needs the IP addresses to sent trap
# The trap destination may be specified in this file or
# using snmpd.conf file. Following indicators can be set
# on "TrapDestInd" to instruct the agent to pick the IPs
# as the destination.
# 1 - IPs only from snmpd.conf
# 2 - IPs from this file only
# 3 - IPs from both the files
#################################################
TrapDestInd 3
#############Trap Destination IP##################
# Add port no after IP address with no space after
# colon to send the SNMP trap message to custom port.
# Community is to be mentioned after IP. If no community
# is mentioned, default SNMP community 'public' shall be
# used. 'trapcommunity' token is also used in snmpd.conf.
# Alternatively, you can also use trapsink command
# in snmpd.conf to send the SNMP trap message to
# custom port, else default SNMP trap port '162' shall
# be used.
127.0.0.1	public
# 145.147.201.88:1234	public
# 145.146.180.20:3061	testComm
127.0.0.1:8162 public

It is a configuration file that works alongside the snmpd.conf configuration. What is important to see, is ‘TrapDestInd’, which is set at ‘3’, which means that traps are send to trap destinations set in the snmpd.conf file AND set in the sas_TrapDestionation.conf file. Two traps are defined in the file, 127.0.0.1 with community string public, which means it sends a trap to udp port 161 (at which the snmpd process is listening, as we saw earlier in the open ports list), but the most interesting thing here is there’s also a trap send to 127.0.0.1 at port 8162. That is a port number I do not know from the top of my head!

However, it’s simple to find out. The first thing to check is to see what process is running at port 8162:

# netstat -anp | grep 8162
udp        0      0 :::8162                     :::*                                    15233/java

That’s a java process! Let’s grep the process number to see if the full command line gives more clues what this java process is:

# ps -ef | grep 15233
dbmsvc    15233  15136  0 Aug16 ?        05:32:25 /usr/java/jdk1.7.0_80/bin/java -client -Xms256m -Xmx512m -XX:CompileThreshold=8000 -XX:PermSize=128m -XX:MaxPermSize=256m -Dweblogic.Name=msServer -Djava.security.policy=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server/lib/weblogic.policy -XX:-UseLargePages -XX:ParallelGCThreads=8 -Dweblogic.ListenPort=7878 -Djava.security.egd=file:/dev/./urandom -Xverify:none -da -Dplatform.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3 -Dwls.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server -Dweblogic.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server -Dweblogic.management.discover=true -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/patch_wls1036/profiles/default/sysext_manifest_classpath weblogic.Server

That’s java running weblogic, with the name ‘msServer’. That is something that is part of the daemons that serve dbmcli (alike the daemons that service cellcli on the cells)!

This actually makes sense. The daemons that manage the database server fetch hardware status information and hardware failures from the BMC using the IPMI device (/dev/ipmi0). However the LSI MegaRaid adapter can not provide its status in that way. So in order for the management daemons to keep track of events on the LSI MegaRaid adapter (hardware issues), a daemon that works together with the snmp daemon is setup, which sends snmp traps if something occurs. The management daemon has setup a port that listens for these traps.

I do not know if the community strings is processed by the management deamon processing the trap. However, the port number on which the daemon is listening for traps is defined in ‘/opt/oracle/dbserver/dbms/deploy/config/cellinit.ora’ with the directive BMC_SNMP_PORT.

Conclusion
As far as I can see, the reason the snmp deamon is running is to be able to run the LSI MegaRaid SNMP agent process, so it can send traps to the compute node’s management daemons. Since most Exadata compute nodes do not have the firewall enabled, udp port 161 is exposed. The settings of the snmp daemon itself limits access to localhost.

A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;

  COUNT(*)
----------
   1000000

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;
    169999

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1’ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0’.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on 11.2.0.3 and 12.1.0.2, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on 11.2.0.3, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.

Conclusions

The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on 12.1.0.2 gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

This post is about memory management on the operating system level of an Oracle database. The first question that might pop in your head is: isn’t this a solved problem? The answer is: yes, if you use Oracle’s AMM (Automatic Memory Management) feature, which let’s you set a limit for the Oracle datababase’s two main memory area’s: SGA and PGA. But in my opinion any serious, real life, usage of an Oracle database on Linux will be (severely) constrained in performance because of the lack of huge pages with AMM, and I personally witnessed very strange behaviour and process deaths with the AMM feature and high demand for memory.

This means that I strongly advise customers to use Oracle’s ASMM (Automatic Shared Memory Management) feature. In the newer versions of 11.2 I found this to be working very well. Earlier versions like 10.2 could suffer from an ever growing shared pool (which also means an ever shrinking buffer cache), especially when bind variables weren’t used. This still could happen, but it seems the SGA memory management feature in 11.2 handles this well in most cases. The ASMM feature means a fixed memory area is allocated for the SGA. SGA allocation has always been fixed outside of the AMM feature, as far as I know.

When ASMM doesn’t work, meaning the memory areas are getting sized wrong and performance is influenced by that, the last option is to size the memory area’s yourself. However, since version 11.2.0.2 Oracle will resize when the memory manager thinks it’s feasible. See Kurt van Meerbeek’s article about that.

That leaves the PGA (Process Global Area) as a memory area on itself. Most databases are using the automatic PGA memory management, which is enabled once the PGA_AGGREGATE_TARGET parameter is set to a non zero value. A common misunderstanding is this setting is actually limiting the overall PGA usage of an instance. The truth is automatic PGA memory management will make attempts to adhere to the PGA_AGGREGATE_TARGET value. These are the actual words in the official Oracle documentation: ‘attempts to adhere’!

This means sort memory, hash memory and bitmap memory will be actively limited in size per process by automatic PGA memory management, any attempt to allocate more than automatic PGA memory management allows will result in moving some contents of these memory areas to the assigned temporary tablespace of the database user, to make room for new data.

However, there are more memory area’s allocatable per process, which are never swapped to disk, thus always will stay in memory, and these could not be limited in an officially supported way prior to Oracle version 12. Two structures which are allocated in PGA and never swapped to disk are PL/SQL collections and PL/SQL tables. Creating and filling these requires the usage of PL/SQL (hence their names); the reason for mentioning this is that if your database is not used by PL/SQL but only SQL, you almost certainly will not run into the problem I describe below.

You might be thinking: wait a minute! Does this mean a developer can just create such a structure, and allocate whatever he/she likes, with all the consequences that it can have, like the operating system starting to swap, and can do that for every single process? Yes, this is what this means. This is why Oracle introduced a parameter called PGA_AGGREGATE_LIMIT with Oracle 12, to effectively limit the overall PGA heap size.

In case you wonder what this means, or even doubting my words, I have written a little program to demonstrate this behaviour.

This is the source code to create my test table T2:

exec dbms_random.seed('abracadabra');
create table t2
as
with generator as (
    select      rownum      id
    from        dual
    connect by
                rownum <= 1000
)
select
    rownum                                                id,
    trunc((rownum-1)/50)                            clustered,
    mod(rownum,20000)                               scattered,
    trunc(dbms_random.value(0,20000))               randomized,
    trunc(sysdate) + dbms_random.value(-180, 180)   random_date,
    dbms_random.string('l',6)                       random_string,
    lpad(rownum,10,0)                               vc_small,
    rpad('x',100,'x')                               vc_padding
from
    generator   g1,
    generator   g2
where
    rownum <= 1000000
;
exec dbms_stats.gather_table_stats(null,'T2');

This is a very smart way to generate a table. I actually borrowed this from Jonathan Lewis.

Next up, I created a small anonymous PL/SQL block to take the contents from the T2 table, and store them in a collection until I hit the limit in the variable ‘grow_until’.

declare
	type sourcetab is table of t2%ROWTYPE;
	c_tmp		sourcetab;
	c_def		sourcetab	:= sourcetab();
	v_b_p		number		:= 0;
	v_c_p		number		:= 0;
	v_b_u		number		:= 0;
	v_c_u		number		:= 0;
	grow_until	number		:= 700000000;
	p_a_t		number;
begin
	select value into v_b_p from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session pga memory max';
	select value into v_b_u from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session uga memory max';
	select value into p_a_t from v$parameter where name = 'pga_aggregate_target';
	select * bulk collect into c_tmp from t2;
	while v_c_p < grow_until loop
		for c in c_tmp.first .. c_tmp.last loop
			c_def.extend(1);
			c_def(c_def.last) := c_tmp(c);
			select value into v_c_p from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session pga memory max';
			select value into v_c_u from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session uga memory max';
			if v_c_p >= grow_until then
				exit;
			end if;
		end loop;
	end loop;
	dbms_output.put_line('vbp : '||v_b_p);
	dbms_output.put_line('vcp : '||v_c_p);
	dbms_output.put_line('vbu : '||v_b_u);
	dbms_output.put_line('vcu : '||v_c_u);
	dbms_output.put_line('pat : '||p_a_t);
end;
/

Please mind the session needs to have create table, create session granted, enough quota in the default tablespace and select on v_$mystat, v_$parameter and v_$statname granted.

This is run on an Oracle 12.1.0.2 database:

TS@v12102 > @pga_filler
vbp : 3535368
vcp : 700051976
vbu : 1103192
vcu : 4755704
pat : 524288000

PL/SQL procedure successfully completed.

The begin sizes of the UGA (vbu) and PGA (vbp) are 1’103’192 and 3’535’368. The PGA_AGGREGATE_TARGET size is set to 524’288’000 (500MB). I did set the grow_until variable to 700’000’000 (roughly 700MB), which is more than PGA_AGGREGATE_TARGET. After running this, it’s easy to spot the values of vcu (UGA allocation) and vcp (PGA allocation). vcu grew to 4’755’704 during the run, however vcp grew to 700’051’976, a little more than 700MB! This shows that the collection is stored in the PGA, and that the collection grew beyond the value set with PGA_AGGREGATE_TARGET.

This behaviour is consistent in versions 12.1.0.1, 11.2.0.4, 11.2.0.3, 11.2.0.2 and 11.2.0.1.

Let me emphasise once again that the above proof of concept code managed to allocate more memory than was set for the overall PGA usage of the entire instance. This can have an enormous, devastating impact on a consolidated database setup (meaning having multiple instances running on a single machine). Typically, once memory consumption of all the processes exceeds physically available memory, the operating system tries to use the swap device, to which it will swap memory pages in and out depending on memory usage of active (=on CPU) processes. Mild swapping shows as severely slowed-down processing (because a number of memory pages for processing need to be read from the swap device and placed in memory, from which the former contents need to be written to the swap device), heavy swapping shows as the machine coming down to a standstill.

Please mind that a diagnosis on the state of memory usage (alias swapping), just by looking at the amount of used swap (as can be seen in the ‘top’ output, or ‘swapon -s’) could be misleading. It’s also important to look at actual swapping in and out, as can be seen with ‘vmstat 1’ (si/so columns) or swap -W. I’ve found several systems which had been running for some time (approximately longer than a month) that had swap usage, sometimes up to 40%, while no ‘active swapping’, so memory pages being transfered to and from the swap device, was happening.

Luckily, starting with Oracle 12 you can actually limit overall PGA usage using the parameter PGA_AGGREGATE_LIMIT. The default value is the greater of (list from Oracle documentation):
a) 2GB
b) 200% of PGA_AGGREGATE_TARGET parameter (or lower if 200% > (90% of physical memory – total SGA size) but not below 100%)
c) 3MB * PROCESSES parameter
The parameter can not set below it’s default value, except when set in a pfile or spfile.

Let’s set the PGA_AGGREGATE_LIMIT to 600MB and see what happens when we start doing a large allocation again:

SQL> alter system set pga_aggregate_limit=600m scope=spfile;

System altered.

SQL> startup force;

Okay, let’s run the pga_filler.sql script again, and try to allocate 900MB. This means the “grow_until” variable must be set to 900000000.
PLEASE MIND this is done as a regular user, the SYS user and background processes other than job queue processes are not subject to the limiting.

TS@v12102 > @pga_filler
declare
*
ERROR at line 1:
ORA-01423: error encountered while checking for extra rows in exact fetch
ORA-00039: error during periodic action
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
ORA-06512: at line 21

Great! Exactly like we expect, right?
Well…yes, but let’s look at the alert.log

Sat Dec 13 15:08:57 2014
Errors in file /u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc  (incident=46599):
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
Incident details in: /u01/app/oracle/diag/rdbms/v12102/v12102/incident/incdir_46599/v12102_ora_4147_i46599.trc
Sat Dec 13 15:09:07 2014
Dumping diagnostic data in directory=[cdmp_20141213150907], requested by (instance=1, osid=4147), summary=[incident=46599].
Sat Dec 13 15:09:09 2014
Sweep [inc][46599]: completed
Sweep [inc2][46599]: completed

Okay, essentially, this tells us nothing interesting, except for the tracefile. Let’s look in/u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc, being the tracefile as indicated in the above alert.log snippet:

*** 2014-12-13 15:08:57.351
Process may have gone over pga_aggregate_limit
Just allocated 65536 bytes
Dumping short stack in preparation for potential ORA-4036
----- Abridged Call Stack Trace -----
ksedsts()+244<-ksm_pga_limit_short_stack()+1016<-ksm_check_over_limit()+469<-ksmarfg()+574<-kghgex()+1376<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghalf()+1003<-klmalf()+103
<-kllcqas()+194<-kcblasm()+108<-kxhfNewBuffer()+607<-qerhjSplitBuild()+632
----- End of Abridged Call Stack Trace -----
=======================================
PRIVATE MEMORY SUMMARY FOR THIS PROCESS
---------------------------------------
******************************************************
PRIVATE HEAP SUMMARY DUMP
781 MB total:
   781 MB commented, 646 KB permanent
   208 KB free (0 KB in empty extents),
     779 MB,   2 heaps:   "koh-kghu call  "            57 KB free held
------------------------------------------------------
Summary of subheaps at depth 1
779 MB total:
   778 MB commented, 110 KB permanent
    63 KB free (0 KB in empty extents),
     667 MB, 42786 chunks:  "pmuccst: adt/record       "
      83 MB, 5333 chunks:  "pl/sql vc2                "

Actually, this is the end of the tracefile. It seems that the pga limit dump (the text in between “Process may have gone over pga_aggregate_limit” to the private memory summary heap dumps) occurs several times before an actual ORA-4036 is triggered. In my private test instance, where I am obviously the only user process doing something, I get a pga limit dump approximately 20 times before the ORA-4036 is actually triggered:

sending 4036 interrupt
Incident 46599 created, dump file: /u01/app/oracle/diag/rdbms/v12102/v12102/incident/incdir_46599/v12102_ora_4147_i46599.trc
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT

Did you actually spot the oddity here?

Remember the PGA_AGGREGATE_LIMIT was set to 600M. Now look at the process’ PGA/Private heap summary dump above: it says 781M. Please mind the 781M is the PGA heap of a SINGLE process! When looking at the total PGA allocated for the entire instance, it’s even more:

SYS@v12102 AS SYSDBA> select value/power(1024,2) "MB" from v$pgastat where name = 'maximum PGA allocated';

	MB
----------
1041.16699

So…despite PGA_AGGREGATE_LIMIT set to 600M, according to the v$pgastat view, there’s 1041MB allocated for PGA. Please mind I haven’t looked into how accurate v$pgastat is, but I tend to believe this.

Summary.
I’ve seen PGA_AGGREGATE_TARGET being used as a calculation value for actual PGA usage of an instance. This is simply wrong. The actual amount of PGA memory allocated by the instance is highly depended on what is done, and can be less than PGA_AGGREGATE_TARGET, or more. Automatic PGA can control three per process memory area’s: the sort, hash and bitmap memory area’s. These are sized based on the setting of PGA_AGGREGATE_TARGET and the actual PGA memory usage instance wide. If more memory is needed for sort, hash or bitmap memory than is made available by the memory manager, excess memory needed is allocated in the temporary tablespace. Any other PGA memory allocation is always done, regardless of the setting of PGA_AGGREGATE_TARGET.

Starting with Oracle 12, it seems the actual PGA allocation now can actually be limited with the new parameter PGA_AGGREGATE_LIMIT. However, during some simple testing it shows that actually more memory is allocated than set with PGA_AGGREGATE_LIMIT as limit. I haven’t tested it in more situations, this post is meant to grow awareness that the actual limit as set by PGA_AGGREGATE_LIMIT might not be that hard as you would expect.

Please mind, PGA_AGGREGATE_LIMIT seems to truly limit PGA usage instance wide, not limit the PGA heap per process, as event 10251 (PGA usage limiting way for Oracle 11.2) does. However, once again: PGA_AGGREGATE_LIMIT seems to try to be smart and actually does not limit at the exact size set, but beyond that.

The next post will introduce a way to limit PGA usage in Oracle 11.2. Stay tuned!

Last week I’ve gotten a question on how storage indexes (SI) behave when the table for which the SI is holding data is changed. Based on logical reasoning, it can be two things: the SI is invalidated because the data it’s holding is changed, or the SI is updated to reflect the change. Think about this for yourself, and pick a choice. I would love to hear if you did choose the correct one.

First let’s do a step back and lay some groundwork first. The tests done in this blogpost are done on an actual Exadata (V2 hardware), with Oracle version 11.2.0.4.6 (meaning bundle patch 6). The Exadata “cellos” (Cell O/S) version is 11.2.3.3.1.140529.1 on both the compute nodes and the storage nodes.

A storage index is a memory structure used by the cell daemon, which is the storage server process on the storage layer of Exadata. By default (although I’ve never seen it different yet) a SI contains minimum and maximum values for up to eight columns of the table it is describing. The memory structure is transient. It describes a region of one megabyte of a table. A storage index is populated during the fetching of data for a smart scan, based on the filter predicates of the query causing the smart scan. In essence the whole SI management is done automatically by the cell daemon (“essence” means you can play around with some undocumented settings on the database level, and there are some undocumented settings and events you can set on the cell server level).

Okay, back to the the original question. Let’s test this. First we need a table large enough to be smartscanned. I’ve got a table called ‘BIGTAB_NOHCC’ (as you probably guessed, there is a table ‘BIGTAB’ in the schema I am using, which is hybrid columnar compressed). This table consists of 2671736 blocks of 8KB, which means it got a total size of 21G. This is big enough for my database instance (big enough is relative to the buffer cache size) to get a smart scan.

For this test, I use the session statistics. In v$sesstat there is a statistic called ‘cell physical IO bytes saved by storage index’, which tells us how many bytes we saved from being scanned because of the use of storage indexes. I also show some output of a (single) storage server, although a normal Exadata rack typically will have 3, 7 or 14 storage servers in use.

First of all, in order to get an idea and be pretty sure my query will populate storage indexes, I stop and start the cell daemon on the storage servers. I use ‘service celld restart’ for that.

Next, in order to get storage index information for the table I use, I need some metadata. The metadata I need are:
a) data object id

select data_object_id from dba_objects where owner = 'MARTIN' and object_name = 'BIGTAB_NOHCC';

b) tablespace number

select ts# from v$tablespace t, dba_segments s, dba_objects o where t.name=s.tablespace_name and s.segment_name=o.object_name and o.data_object_id = 18716;

c) database unique id from the x$ksqdn

select ksqdngunid from x$ksqdn;

(thanks to the progeeking website for this information)

Next, I run the query on my BIGTAB_NOHCC table. Please mind it’s fundamental to have a filter predicate, and that the execution plan shows we are doing a full table scan. Only with a filter predicate the storage server has a reason to build a storage index:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

Now let’s see if the storage layer built a storage index:

CellCLI> alter cell events = "immediate cellsrv.cellsrv_storidx('dumpridx','all',18716,5,1126366144);
Dump sequence #1 has been written to /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Cell enkcel01 successfully altered

This is why it’s handy to have the aforementioned data on the table: you can dump the specific storage indexes for a single table. The three numbers after ‘all’ are data_object_id, tablespace number and the ksqdngunid from the x$ksqdn table.

Let’s look in the file which the cell event produced:

# less /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Trace file /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
ORACLE_HOME = /opt/oracle/cell
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.39-400.128.17.el5uek
Version:        #1 SMP Tue May 27 13:20:24 PDT 2014
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.3.1_LINUX.X64_140529.1

*** 2014-07-12 13:16:17.030

*** 2014-07-12 13:16:17.030
UserThread: LWPID: 27455 userId: 58 kernelId: 58 pthreadID: 0x7f2863fe7940
2014-07-12 13:21:36.499123 :0005C9DE: $$$ Dumping storage idx summary for griddisk DATA_CD_08_enkcel01:
2014-07-12 13:21:36.499217 :0005C9E0: Dump sequence #1:

*** 2014-07-12 13:21:36.499
2014-07-12 13:21:36.499212 :0005C9DF: 
***************************
2014-07-12 13:21:36.499249 :0005C9E1: Dumping RIDX summary for objd 18716, tsn 5, dbid 1126366144

2014-07-12 13:21:36.499249*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-12 13:21:36.499249*: RIDX: strt 0 end 2048 offset 6215958528 size 1048576 rgnIdx 5928 RgnOffset 0 scn: 0x099a.5cc294b8 hist: 2
2014-07-12 13:21:36.499249*: RIDX validation history: 
2014-07-12 13:21:36.499249*: 0:FullRead 1:Undef 2:Undef 3:Undef 4:Undef 5:Undef 6:Undef 7:Undef 8:Undef 9:Undef
2014-07-12 13:21:36.499249*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-12 13:21:36.499249*: lo: c3 c 34 11 0 0 0 0
2014-07-12 13:21:36.499249*: hi: c3 5a 5 18 0 0 0 0

Here you see the typical heading of an Oracle trace file, next the announcement of the dump (“Dumping storage idx summary”,”Dumping RIDX summary”). The real storage index information starts with “RIDX(0x7f27b4ddab64)”. Starting with that and until the “lo” and “hi” values, you are looking at an actual storage index which holds data for a single column. You can see which column by looking at the “Col id” in the square brackets: 1. It’s interesting to note that there is a scn (system change number) included. The storage index shows if there are NULL values in the column (in this case it says HASNONNULLVALUES, so we don’t have any NULL values in the 1MB chunk in the column this storage index describes), and, of course, the low and high values in the Oracle internal data format.

So, despite any indication on the database layer, the query built storage indexes! That should mean that executing the same query again will result in actually using the storage indexes which were just build:

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

Yes! This using the storage index we caused the be built up previously!

With this being established, we can try to get an answer to our question: how does the storage index react to DML on the table it is describing?

We build the storage index, and used it. Now let’s update the ‘id’ field for which the storage index was build, and redo our query test:

MARTIN:dbm011> update bigtab_nohcc set id = id + 1;

16000000 rows updated.

MARTIN:dbm011> commit;

Commit complete.

Okay, now let’s redo the select again, and take stats before and after on storage index usage!

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

What we see here with the storage index session statistic, is the update statement didn’t use the storage index (which is obvious, we updated the data, which happened on the database layer, so we didn’t use a smartscan) because this number is the same as the last time we looked at it before the update statement.
When we executed the select query on the table with a filter on the id column again, there is no storage index usage, because the storage index session statistic didn’t increase.

Actually, this query built new storage indexes. When this query is executed again, we can use these:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				   14534426624

(actually, it sometimes can take more than one execution for the storage indexes to be created again, my guess would be some heuristics are used to try to come up with the best candidates for storage indexes)

Let’s try another case: update a column in the table for which storage indexes are created on another column.
First make sure storage indexes are build:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

For completeness sake, I verified the storage indexes for this table by using the dump shown previously. A simple grep on the dump shows this query created storage indexes for only column 1, the id column:

grep -e RIDX_VALID -e 'Col id' /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_4.trc
014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddac40) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddad1c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddadf8) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddaed4) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddb08c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
...etc.

Now let’s update another field:

MARTIN@dbm011 > update bigtab_nohcc set spcol = spcol + 1;

16000000 rows updated.

MARTIN@dbm011 > commit;

Commit complete.

And query the storage index use, and do our query with filter predicate again:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

So…this strongly suggests the update invalidated our storage index, even when the column was not in the storage index.
Just for completeness sake, let’s issue the select statement again to see if the storage index was build up again:

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    9808936960

Yes, there it is again, the previous statement didn’t benefit from the storage indexes, but build them, and now we take advantage of it again.

Conclusion.
I ran these tests on a quarter rack without any other usage. The method I used for measuring how storage indexes behave was to execute on the database layer, and see the result on the storage layer and in database layer session statistics. It’s important to realise that despite this being quite strong evidence, there isn’t definite proof on my conclusion. Also, the behaviour described can change in future storage server versions.

My conclusion is that the storage server uses a system change number to validate if the storage indexes are valid. If not, storage indexes are not used. If blocks contained in a storage index progresses it’s system change number (in other words: are updated), the storage index is invalidated right away (during my tests, when the update starts, no or few storage indexes remain).

So, the ones that picked storage indexes being invalidated were right. As far as I could detect, there is no updating of storage indexes.

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

– IO size by number of reads and writes
– IO size versus latency for reads and writes
– IO size versus pending IO count for reads and writes
– IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18”, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)"; 

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2” means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

Recently we upgraded an Exadata to the currently latest version, 11.2.3.2.0. The Exadata software itself consists of an image for the storage servers (the storage servers are essentially re-imaged), and a set of updates for the database/computing nodes, including: firmware for ILOM (lights out adapter), BIOS, LSI RAID adapter, Infiniband adapter, linux kernel, drivers, mandatory packages, to name some.

One of the exceptional things this upgrade does, is remove the hot-spare out of the RAID set on the database/compute nodes. This is documented in MOS note: 1468877.1, as ‘known issue 5: hotspare removed for compute nodes’. For some sites, this actually can be a good thing, if they are really tight on disk space on the compute nodes of Exadata. I must say that we have not encountered this situation. What this means, is that the actual HDD configuration on the compute node is left to the customer, instead of having one mandatory configuration (having 3 disks in a RAID-5 configuration, and one hot-spare).

So if you decide to use the former hot-spare disk as an active part of the RAID configuration, you are effectively trading availability for diskspace. Please mind the RAID set itself already provides redundancy, even without the hot-spare!

On the other hand, I think in most configurations, it makes sense to convert the disk back to being hot-spare.

This is done in the following way:

a) Get an overview of the current disk configuration:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B70

This shows the disk in slot number 3 being left “unconfigured”, but in “good” state (of a disk has gone bad because of errors, it will be removed from the RAID set, and will show up as “unconfigured(bad)”!)
This is the state the upgrade to 11.2.3.2.0 leaves your system.

New let’s make the disk hot spare again!

b) Get the inclosure id:
/opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252

This means we know the enclosure id (252) and the slot number (3), which is the information needed for the MegaCli utility to revert the unconfigured disk to hot-spare again!

c) Revert the unconfigured disk back to hot-spare
/opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[252:3] -a0
Adapter: 0: Set Physical Drive at EnclId-252 SlotId-3 as Hot Spare Success.
Exit Code: 0x00

d) Check the disk configuration again:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Hotspare, Spun Up
Device Firmware Level: 0B70

It appears that having the disk been removed from the RAID set by the update to 11.2.3.2.0 generates an ASR message. At least it did at our site, despite this being an undocumented bug (7161023,’ASR generating false errors in relation to disks’) which is marked resolved in 11.2.3.1.0 (?). Most sites I encounter have ASR setup, but not having all messages send additionally to local, onsite monitoring. I want to stress it’s very important to have the ASR messages sent to your own monitoring too!

Oracle Support does not list all the specifications from an ASR message it has gotten. Instead, a Service Request is made with enough information for Oracle itself (!!). In our case, the exact error message was NOT specified, only ‘compute server hard disk predictive failure’ and the node name.

Where do you look on an Exadata for that information? The first logical point is the ASR daemon. I didn’t spend too much time on it, but it seems that it’s more a proxy for messages than a database. I wasn’t able to find useful information about the systems which where using this daemon.

What are the sources for ASR with an Exadata? These are:

Computing node:
– “compmon daemon” / Linux level monitoring
– ILOM
Storage node:
– “cell daemon” / Linux level monitoring
– ILOM

For the computing node, it’s quite easy to see if there are any detected failed devices from the viewpoint of the ILOM:
(please mind ipmitool -I open only works on the local system)
# ipmitool -I open sunoem cli "show /SP/logs/event/list Severity==(Major,Critical,Down)"
Connected. Use ^D to exit.
-> show /SP/logs/event/list Severity==(Major,Critical,Down)

ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------

-> Session closed
Disconnected

This shows no messages with the severity Major, Critical or Down are in the eventlog in the ILOM. Please mind that the logons to the ILOM have severity “Minor”. These are in most system the vast majority of the messages, which are not of interest for this investigation. If you want to know if something has failed, there even a simpler command:
# ipmitool -I open sunoem cli "show faulty"

For the “compmon daemon”, grep the processlist for “compmon”:
# ps -ef | grep compmon
root 12812 1 0 Oct22 ? 00:00:11 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server

The most important part here is the directory: /opt/oracle/cellos/compmon
If you navigate to that directory, you will see a number of “state files”: asrs.state, traps.state and disks.state.
The disks.state lists the disk status as listed with a) with the firmware state.
The most important file for the ASR message investigation is the traps.state file. This file lists traps it has sent to ASR. In our case:
1 ; Mon Oct 22 14:39:10 2012 ; 86425886-b359-4587-8d46-f31ff2ecb135 ; Physicaldisk : Make Model: is at status predictive failure. Raised fault id: HALRT-02008 ; Physical disk should be replaced. Exadata Compute Server: Disk Serial Number:
Yes, this is pasted correctly, it misses Physicaldisk, Make Model and Disk Serial Number information. This has not been omitted for safety, it just is not listed.
So, the failure which was sent was HALRT-02008 in our case.

For completeness, the ILOM layer can be investigated identically to the description of the ILOM handling on the computing layer. The Linux layer messages can be investigated with: # cellcli -e list alerthistory
32 2012-10-17T02:00:27+02:00 info "HDD disk controller battery on disk contoller at adapter 0 is going into a learn cycle. This is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours. The disk controller cache might go into WriteThrough caching mode during the learn cycle. Disk write throughput might be temporarily lower during this time. The message is informational only, no action is required."
33 2012-10-22T11:43:21+02:00 info "Factory defaults restored for Adapter 0"
34 2012-10-22T11:43:23+02:00 info "Factory defaults restored for Adapter 0"

%d bloggers like this: