Archive

Oracle XE

This is a small note describing how Oracle implemented the situation which is covered by the db file parallel read wait event. This events happens if Oracle knows it must read multiple blocks which are not adjacent (thus from different random files and locations), and cannot continue processing with the result of a single block. In other words: if it cannot process something after reading a single block (otherwise Oracle will read a single block visible by the wait ‘db file sequential read’).

This is how it shows up if you enable sql trace:

WAIT #139658359011504: nam='db file parallel read' ela= 69997140 files=1 blocks=70 requests=70 obj#=76227 tim=1373200106669612

What this shows, is Oracle issuing a request for 70 blocks. This has an interesting implication for monitoring and looking at the time spend on the event ‘db file parallel read': if you don’t know the number of blocks for which an IO request is issued, it’s impossible to say something about the time. So just monitoring or looking a cumulative time spend in ‘db file parallel read’ doesn’t say anything about IO latency, it only tells something about where the Oracle process did spend its time on.

How did Oracle implement this? This is obviously port specific (which means the implementation will be different on different operating systems). My test environment is Oracle Linux 6u3 X64, Oracle 11.2.0.3 64 bit using ASM.

This is how the requests are asynchronously submitted to the operating system:

Breakpoint 2, io_submit (ctx=0x7f04c0c8d000, nr=70, iocbs=0x7fff86d965f0) at io_submit.c:23

So all the IO requests are submitted in one go!

After the IO requests are submitted (which is not covered by a wait, which makes sense, because the io_submit call is/is supposed to be non blocking.

Next Oracle waits for ALL the IOs to finish, covered by the ‘db file parallel read’ wait event:

Breakpoint 13, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, io_getevents_0_4 (ctx=0x7f04c0c8d000, min_nr=70, nr=128, events=0x7fff86d9c798, timeout=0x7fff86d9d7a0) at io_getevents.c:46
Breakpoint 14, 0x0000000008fa1334 in kslwtectx ()

We see kslwtbctx which indicates the start of a waitevent, then a io_getevents call:

‘ctx’ is the IO context. This is how Linux keeps track of groups of asynchronous IO requests.
‘min_nr’ is the minimal number of requests that must be ready for this call to succeed, this call will wait until ‘timeout’ is reached. io_getevents will just peek if ‘timeout’ is set to zero.
‘nr’ is the maximal number of requests that io_getevents will “fetch”.
‘events’ is a struct (table like structure) that holds the information about the iocb’s (IO control blocks) of the requests.
‘timeout’ is a struct that sets the timeout of this call. For Oracle IO I see timeout being 600 (seconds) most of the time.

The last line show kslwtectx indicating that the wait has ended.

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
– Less than 50% of the blocks of the table is already in the buffercache.
– Less than 25% of the blocks in the buffercache are dirty.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)':

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500”.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

May 4th: some updates after discussion with Jeff Holt, Cary Millsap and Ron Christo of Method-R.

There’s all the documentation, and there all the terrific blogs about performance and Oracle wait events. But if you more or less start with this stuff, or encounter a wait event that is not (extensive enough) documented, or an event turns up and gives a response time you don’t expect, you need to understand what that wait event means. If you don’t know, it’s easy to get stuck at this point.

If you are familiar with waits, and just want to dig further, progress to “Get to know wait events”, if you want to get up speed with waits, read on.

What is a wait event?

This is the definition of the performance tuning guide in the Oracle documentation:

Statistics that are incremented by a server process/thread to indicate that it had to wait for an event to complete before being able to continue processing. Wait events are one of the first places for investigation when performing reactive performance tuning.

This description is quite good I think. If I would have to make a better description, I would say a wait event is essentially the registration of time spend on executing a part of the Oracle database server code, mostly doing something elementary. The granularity of wait events can differ between Oracle versions. Wait events can describe interaction with the operating system (reading or writing are good examples of this), but also interaction between processes (enqueue’s and ipc messaging are good examples of that). Needless to say, both interactions require operating system calls.

Also, a wait event is not necessary a bad thing. Every Oracle database process is either running on CPU or waiting. If your SQL*Plus session is not running, it is waiting for ‘SQL*Net message from client’, which means the database tries to tell you that it waits for the next instruction. Another interesting thing to note about wait events is that the execution of certain instrumented (covered by a wait event) parts if Oracle server code can not report a wait per kernel developer decision. This means, for example, that if doing a latch get read went efficiently, you will not get a ‘latch free’ or ‘latch: latch name‘ wait event, because the wait registration would take exponentially more time than the actual latch get.

How to get the wait events during execution

(skip this section if you are familiar with enabling and disabling sql trace at level 8)

The way to get all the waits of the execution of one or more SQL statements for a given process is to enable SQL trace, alias event 10046, at level 8. This can be done at the system level (meaning all processes) too, but unless you want to reverse engineer the interaction between ALL the process, this is not very useful. There much written about all the ways you can enable SQL trace at level 8. For the sake of simplicity, I use setting event 10046 at level 8 in the session itself. The session requires ‘alter session’ rights to be able to set it. This is how it’s done:

SQL> alter session set events '10046 trace name context forever, level 8';
Session altered.

And can be turned off the following way:

SQL> alter session set events '10046 trace name context off';
Session altered.

Get to know wait events: strace

At this point we know how to enable tracing, but want to dig deeper and understand the operating system interaction inside the waits.

To see what a process (for example your Oracle foreground process) is doing from the perspective of the Operating System, there’s the ‘strace’ executable. The ‘strace’ executable can display all the system calls an operating system process is executing if it has enough privileges to access it. It resembles SQL trace a bit, but at the Operating System/Linux layer. First, you determine the process ID of the session you want to use ‘strace’ on, then use another process to execute ‘strace’. This is what it looks like:

1. Setup a sqlplus session
$ sqlplus ts/ts@v11203

SQL*Plus: Release 11.2.0.3.0 Production on Wed Apr 25 23:03:22 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

TS@v11203 >

2. Attach to the sqlplus process with strace in a session on the linux host where the database is running:
$ strace -p 14128
Process 14128 attached - interrupt to quit
read(14,

3. Now execute something in the sqlplus session, for example ‘select * from dual':
TS@v11203 > select * from dual;

D
-
X

TS@v11203 >

4. This results in the following list of system calls:

read(14, "\1/\6\21i \376\377\377\377\377\377\377\377\2\2"..., 8208) = 303
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372954
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\1r\6\20\27\2\374T\rD@\255\262t\t\313\242\1\247-8x"..., 370) = 370
read(14, "\25\6\3\5\"\2\17", 8208) = 21
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\254\6\4\1 \1\1{\5\2"..., 172) = 172
read(14, "\1\337\6\3^#)\4\4\376\377\377\377\377\377\377\3774"..., 8208) = 479
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "010\6\v\1\5\314\2\1p\5\207y"..., 304) = 304
read(14,

Wow! Nice! But what does it mean? Well, these are all the system calls used by the Oracle foreground process in order to execute the (simple) SQL. It’s hard to tell something about the execution of the SQL with this list. And be aware, this is only a ‘select * from dual’, so a ‘real’ SQL will probably give a list so long you do not dare to begin studying the list to get something useful out of it.

Well, how about the next chapter:

Get to know wait events: combining waits and strace

How about combining the sql trace output with the system calls? The ‘strace’ executable has quite some options, which can be very helpful:

- list the data written by the process (‘-e write=all’)
This option lists the data written by the session. The trace file is written by the server process, so it will list the contents of that.
– list a timestamp for every system call (‘-tt’)
– list the time spend doing the system call (‘-T’)
– save the output to a file (‘-o /path/to/file.txt’)

These are all helpful to get more information the system calls.

Now setup a session in the database, enable SQL trace at level 8, fetch the process ID of session and run ‘strace -ttT -e write=all -p PID’ in another session and run something. Yes, this generates a massive amount of output. But, as I use it, it’s to investigate what certain waits are doing. Here are some examples:


23:46:20.145793 pread(257, "\6\242\202\3\1\356O\6\2049\245\2\3226\264O\6"..., 1032192, 1996505088) = 1032192
23:46:20.148986 lseek(8, 0, SEEK_CUR) = 46126
23:46:20.149030 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 31 39 38 20 66 69 ad' ela= 3198 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 34 le#=5 bl ock#=664 |
| 00050 33 34 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 34 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 34 38 39 36 31 53903801 48961 |
23:46:20.149121 write(9, "J?e83~1z1\n", 10) = 10
| 00000 4a 3f 65 38 33 7e 31 7a 31 0a J?e83~1z 1. |
23:46:20.149163 write(8, "\n", 1) = 1
| 00000 0a . |
23:46:20.157813 pread(256, "\6\242\2\4\1\360O\6\2\4\217T\2\3226\264O\6"..., 1032192, 2000699392) = 1032192
23:46:20.161124 lseek(8, 0, SEEK_CUR) = 46252
23:46:20.161169 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 33 31 37 20 66 69 ad' ela= 3317 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 35 le#=5 bl ock#=665 |
| 00050 36 32 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 62 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 36 31 31 30 30 53903801 61100 |
23:46:20.161260 write(9, "J?Hv2~1z1\n", 10) = 10
| 00000 4a 3f 48 76 32 7e 31 7a 31 0a J?Hv2~1z 1. |
23:46:20.161624 lseek(8, 0, SEEK_CUR) = 46377
23:46:20.161659 write(8, "\n", 1) = 1
| 00000 0a

The pread() call is the actual IO call done by the database.
The lseek() call is the seek towards the end of the trace file, and sets the pointer for the upcoming write.
The write() call is the write in the trace file. You see the multiblock read (scattered read in the trace file) was done using a pread() call, and the time measured by strace () is line with the time measured by Oracle (ela= 3317), and the number of blocks in the trace (126) matches the number of bytes read 1032192 (126*8192).
The time spend in the system calls is between the ‘smaller’ and ‘bigger’ signs, and get removed by wordpress, so if you where searching for it in the above example, you wouldn’t find it
The write() to file descriptor 9 with the garbage in it is the write to the .trm file.
The lseek() sets the pointer in the file after the writes, and a write() is done to enter a carriage return (\n) to the tracefile.
This is a simple example, to show how a synchronous multiblock read is reported, let’s see a a-synchronous one!


00:17:27.557053 pread(256, "#\242\202\251\243O\6\1\4[\4"..., 8192, 1190150144) = 8192
00:17:27.558973 lseek(8, 0, SEEK_CUR) = 2659
00:17:27.559009 write(8, "WAIT #140244789954888: nam='db f"..., 124) = 124
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 62 20 66 54888: n am='db f |
| 00020 69 6c 65 20 73 65 71 75 65 6e 74 69 61 6c 20 72 ile sequ ential r |
| 00030 65 61 64 27 20 65 6c 61 3d 20 31 39 30 33 20 66 ead' ela = 1903 f |
| 00040 69 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 34 33 ile#=5 b lock#=43 |
| 00050 33 39 34 20 62 6c 6f 63 6b 73 3d 31 20 6f 62 6a 394 bloc ks=1 obj |
| 00060 23 3d 31 34 30 33 33 20 74 69 6d 3d 31 33 33 35 #=14033 tim=1335 |
| 00070 33 39 32 32 34 37 35 35 38 39 34 33 39224755 8943 |
00:17:27.559088 write(9, "N?Rb~22y1\n", 10) = 10
| 00000 4e 3f 52 62 7e 32 32 79 31 0a N?Rb~22y 1. |
00:17:27.559128 write(8, "\n", 1) = 1
| 00000 0a . |
00:17:27.559334 mmap(0x7f8d48e41000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48e41000
00:17:27.559450 mmap(NULL, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x1ec000) = 0x7f8d48c81000
00:17:27.559491 mmap(0x7f8d48c81000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48c81000
00:17:27.559576 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x2fc000) = 0x7f8d48a81000
00:17:27.559614 mmap(0x7f8d48a81000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48a81000
00:17:27.559834 io_submit(140244822339584, 1, {{0x7f8d48a861f8, 0, 0, 0, 256}}) = 1
00:17:27.560008 mmap(0x7f8d48ad1000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48ad1000
00:17:27.560080 io_submit(140244822339584, 1, {{0x7f8d48a86450, 0, 0, 0, 256}}) = 1
00:17:27.560272 io_getevents(140244822339584, 1, 128, {{0x7f8d48a86450, 0x7f8d48a86450, 122880, 0}}, {600, 0}) = 1
00:17:27.560363 io_getevents(140244822339584, 1, 128, {{0x7f8d48a861f8, 0x7f8d48a861f8, 106496, 0}}, {600, 0}) = 1
00:17:27.560865 lseek(8, 0, SEEK_CUR) = 2784
00:17:27.560902 write(8, "WAIT #140244789954888: nam='dire"..., 129) = 129
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 69 72 65 54888: n am='dire |
| 00020 63 74 20 70 61 74 68 20 72 65 61 64 27 20 65 6c ct path read' el |
| 00030 61 3d 20 35 38 36 20 66 69 6c 65 20 6e 75 6d 62 a= 586 f ile numb |
| 00040 65 72 3d 35 20 66 69 72 73 74 20 64 62 61 3d 34 er=5 fir st dba=4 |
| 00050 33 33 39 35 20 62 6c 6f 63 6b 20 63 6e 74 3d 31 3395 blo ck cnt=1 |
| 00060 33 20 6f 62 6a 23 3d 31 34 30 33 33 20 74 69 6d 3 obj#=1 4033 tim |
| 00070 3d 31 33 33 35 33 39 32 32 34 37 35 36 30 38 34 =1335392 24756084 |
| 00080 35 5 |

This advanced example shows how first the segment header is fetched using the ‘db file sequential read’ wait. Despite being on ASM (this is ASM!), this clearly shows the segment header is read synchronously using the pread() call.
The next system calls and waits are even more interesting!
First, memory is allocated using the mmap() call. This is the beginning of a full scan, and the stuff read from disk need to be buffered somewhere. Please mind the memory allocation is not part of any wait. That is no problem, the allocation of memory should be very fast.
Next, TWO asynchronous reads are initiated using the two io_submit() calls, mind the iocb (IO control block) addresses, 0x7f8d48a861f8 and 0x7f8d48a86450.
See the man page of the io_submit call on your linux system using ‘man io_submit’, also use the man page of io_getevents to learn more about that call
Next, the asynchronous reads are reaped using the io_getevents() calls. What is noteworthy here is, is that the two requests are reaped in REVERSE order (this is consistent for the first two asynchronous call pairs for every segment, this example is Oracle version 11.2.0.3). So first request 0x7f8d48a86450 is fetched, then 0x7f8d48a861f8.
After 0x7f8d48a861f8 is reaped, the wait is reported for the FIRST IO call (io_submit() & io_getevents() combination). The second request (0x7f8d48a86450) of the start of a full segment scan is not reported. Never. Further requests do not reverse reaping/’encapsulate’ IO requests (mostly, it can happen a few times more).

I hope you get the idea of digging deeper into waits by examining the system calls.

I am not sure if the above ‘encapsulation’ of IO calls is a bug. But think of this: if you issue multiple IO requests (like we saw above, multiple IO requests are issued!) you cannot report the waits of all these distinct IO requests, because you would get more time reported than has passed. This has some consequences for the above ‘direct path read’ waits: these do not report individual IO times (like ‘db file sequential read’ does). This is just an indication on time spend in the process it had to wait for results from its IO requests, rather than disk IO latency.

The above conclusions are the result of observation and logical reasoning from my side. If you have anything to add, please respond!

One of the advantages of ASM is doing non (operating system) buffered IO (also known as ‘DIO’ or Direct IO), and doing asynchronous IO (also known as ‘AIO’ or ‘asynch IO’).

This is an excerpt from “ASM 10gr2 Best Practices“:

The database file level access (read/write) of ASM files is similar to pre-10g, except that any database file name that begins with a “+”, will automatically be handled and managed using the ASM code path. However, with ASM files, the database file access inherently has the characteristics of raw devices; i.e., un- buffered (direct IO) with kernelized asynchronous IO (KAIO).

What does this quote say?
  • If a database file name begins with a “+”, it will be automatically be handled and managed using the ASM code path.

I think this is true.

  • With ASM files (I read that as ‘database files in ASM’) the database file access inherently has the characteristics of raw device; i.e., un-buffered (direct IO)

I am confident this is true too: there is no filesystem involved during the reading of ASM file, so it can’t be buffered in a filesystem cache.

  • with kernelized asynchronous IO (KAIO)

This is mostly true. If asynchronous IO is setup correctly, most IO’s are done asynchronous, but some are synchronous.

How do you measure if asynchronous IO is being used?

The note on MOS (My Oracle Support) on asynchronous IO (‘How To Check if Asynchronous I/O is Working On Linux, Doc ID 237299.1) has changed. It used to point to the linux slabinfo (visible in /proc/slabinfo; these are kernel “slabs”), to look for allocations of “kiocb” (kernel IO callback) and “kioctx” (kernel IO context) which would allow someone to see if *some* process has initiated the usage of these slabs, which indicates asynchronous IO is initialised. Mind “some process”: this doesn’t have to be the database. Not an extremely reliable way of telling asynchronous IO is used.

Now, the note also includes more accurate ways to detect the kind of IO’s done:
-An example of a ‘strace’ of a process (the databasewriter, dbwr) doing asynchronous IO, which is (easy) visible by the asynchronous IO system calls: io_submit(), which issues an IO request, and io_getevents(), which reads the completion queue to verify the status of submitted IO’s.
-An example of a ‘strace’ of the same process doing the same calls using synchronous IO, which also is easy visible by the synchronous write call: pwrite().

When ASMLib is used, it gets a little more difficult: processes use synchronous IO (read()) to issue calls to ASMLib meta-devices, which do asynchronous IO, depending on Oracle database settings, on the behalf of the calling process. Please mind I haven’t investigated this more in-depth, this is what the metalink document says!

So: the IO method is visible by tracing the systemcalls of a process:
io_submit, io_getevents = asynchronous IO
pread, pwrite = synchronous IO

disk_asynch_io

The usage of asynchronous IO depends on the database parameter ‘disk_asynch_io’. If set to ‘false’, the database uses synchronous IO, even when ASM is used. So the above mentioned note is NOT true. With ASM and ‘disk_asynch_io’ set to ‘false’, it means the IO calls will be unbuffered, but synchronous. I have no databases on raw devices, but I am confident it will behave the same.

But whilst disk_asynch_io=false always means the IO is synchronous, when set to ‘true’ it does not mean asynchronous IO is always used.

filesystemio_options

By default, the parameter ‘filesystemio_options’ is empty after a database is created. On linux (2.6 kernel) both 32 and 64 bit, a database (I’ve tested 11.2.0.1, but assume it’s the same for lower versions) is doing synchronous IO if it’s using the ‘ext3′ filesystem. In order to activate unbuffered (direct) IO, the parameter filesystemio_options needs to be set to ‘directio’, in order to use asynchronous IO it needs to be set to ‘asynch’, in order to use both use ‘setall’. To list all options: set it to ‘none’ to disable both DIO and AIO.

If a database is created which uses ASM for storage, asynchronous IO is done (and direct IO, by design) whilst the parameter ‘filesystemio_options’ is empty.

Mostly asynchronous

I’ve mentioned ‘mostly asynchronous’ a number of times. What does that mean? In the current version (11.2.0.1) of the database, physical reads of data dictionary objects are ALWAYS done synchronously, regardless of options set and storage types.

Below samples done on a VM in VMWare fushion, OEL5u5 x64, database 11.2.0.1.

1. Database on ext3 filesystem, filesystemio_options=none (or empty)

1.a. logon to the database, list the PID of the server process
$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jul 21 16:24:26 2010

Copyright (c) 1982, 2009, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select spid from v$session a, v$process b where a.paddr=b.addr and a.sid = (select distinct sid from v$mystat);

SPID
------------------------
6775

1.b. open a second session (on the database server), and start a strace:
$ strace -cp 6775
Process 6775 attached - interrupt to quit

1.c. switch back to the sqlplus session, and execute ‘select count(*) from dba_extents’ and exit:
SQL> select count(*) from dba_extents;

COUNT(*)
----------
7963

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

1.d. now switch to the strace session, it will summarise the systemcalls:

Process 7165 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
94.38 1.550058 304 5091 pread
2.57 0.042190 0 295641 getrusage
2.52 0.041446 364 114 munmap
0.45 0.007470 22 346 mmap
0.07 0.001130 0 10408 times

It tells us we spend 1.55 seconds on pread systemcalls. This is expected. But: this is synchronous IO. I guess most Oracle databases on linux just run synchronous and buffered.

2. Database on ext3 filesystem, filesystemio_options=setall

2.a logon to the database, list the PID of the server process
$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jul 21 16:24:26 2010

Copyright (c) 1982, 2009, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select spid from v$session a, v$process b where a.paddr=b.addr and a.sid = (select distinct sid from v$mystat);

SPID
------------------------
7586

2.b. open a second session (on the database server), and start a strace:
$ strace -cp 7586
Process 7586 attached - interrupt to quit

2.c. switch back to the sqlplus session, and execute ‘select count(*) from dba_extents’ and exit:
SQL> select count(*) from dba_extents;

COUNT(*)
----------
7963

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

2.d. now switch to the strace session for the summary:

Process 7586 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
93.42 1.412650 278 5089 pread
3.26 0.049300 0 295641 getrusage
2.99 0.045252 393 115 munmap
0.23 0.003473 10 344 mmap
0.10 0.001463 0 10408 times

This tells us we still did pread(), alias synchronous IO calls, despite filesystemio_options being set to ‘setall’.

3. Reading a regular table.

3.a. login to the database and create a table:
$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jul 21 16:48:54 2010

Copyright (c) 1982, 2009, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table test as select * from dba_source;

Table created.

3.b. list PID
SQL> select spid from v$session a, v$process b where a.paddr=b.addr and a.sid = (select distinct sid from v$mystat);

SPID
------------------------
15716

3.c. start a strace session in another window:
$ strace -cp 15716
Process 15716 attached - interrupt to quit

3.d. issue ‘select count(*) from test’ in the sqlplus session, and exit:
SQL> select count(*) from test;

COUNT(*)
----------
623146

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

3.e. now look at the strace session for the summary:
Process 14284 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.39 0.023942 73 326 io_submit
35.43 0.014781 336 44 pread
7.18 0.002997 9 317 io_getevents
0.00 0.000000 0 2 read
0.00 0.000000 0 2 write

Ah, now we have done asynchronous IO (io_submit, io_getevents), but there’s also ‘pread’, thus synchronous IO?
The pread() is the lookup in the data-dictionary the session needed to do to be able to get all the needed metadata for reading the ‘TEST’ table.

Let’s look into a database using ASM:

4.a. logon, verify ASM usage:
$ sq

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jul 21 17:12:50 2010

Copyright (c) 1982, 2009, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

SQL> select name from v$datafile;

NAME
--------------------------------------------------------------------------------
+DATA/bioinfo1/datafile/system.256.721389685
+DATA/bioinfo1/datafile/sysaux.257.721389685
+DATA/bioinfo1/datafile/undotbs1.258.721389685
+DATA/bioinfo1/datafile/users.259.721389685
+DATA/bioinfo1/datafile/original.266.721394641
+DATA/bioinfo1/datafile/dbfs.267.724598427

6 rows selected.

SQL>
This database is using ASM!
4.b. start a ‘strace -cp ‘ in another session.
4.c. execute ‘select count(*) from dba_extents'; exit
4.d. investigate summary of strace:
Process 5269 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
94.23 1.559458 279 5599 pread
3.89 0.064443 0 323685 getrusage
1.54 0.025487 167 153 munmap
0.28 0.004667 23 207 mmap
0.06 0.000955 0 11394 times

So, we just done synchronous IO, despite being on ASM.

Lets create the table from dba_source again, and look if that will invoke AIO:
5.a. login, create table
$ sq

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jul 21 17:18:38 2010

Copyright (c) 1982, 2009, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

SQL> create table test as select * from dba_source;

Table created.

5.b. set ‘strace -cp ‘ again in another session
5.c. execute ‘select count(*) from test;’, then exit
5.d. investigate strace summary:

Process 5742 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
44.59 0.023152 71 327 io_submit
26.93 0.013983 215 65 pread
18.16 0.009431 63 149 munmap
9.62 0.004994 17 292 io_getevents
0.55 0.000283 283 1 mmap

Ah, now we are mostly doing asynchronous IO, again not for the data-dictionary lookups, but we do for the data itself!

A recent thread on the oracle-l maillist (http://www.freelists.org/list/oracle-l) called “REDO LOG Concepts” caught my attention. Someone asked about the transaction state of the (oracle) database when the online redologfiles have been deleted prior to a commit.

Most people know that on linux and unixes a file keeps on living (=available) in the context of a process as long as the file is kept open by the process, despite deletion by another process. This file is gone for all processes that try to access it after the deletion, but the deleted files will only get really disposed once all processes that have the file opened close the the connection (file descriptor) to it.

I guess everybody who is doing/has done linux/unix administration has encountered the situation at least once where a filesystem gets filled with some kind of logfile or tracefile which grows rapidly and keeps on growing at a very fast pace, and deleting it does not release any space (because the process which is filling the file is not stopped, so the file is held open, so the file still exists, despite deletion)

Some suggestions where made for recovery and protecting against it (mirroring online redologfiles). I guess most people know recovery fundamentals, so I will not dive further into that matter.

A fellow oracle-l member (stefan knecht) however suggested:

I’ve run into this situation before as well. It happens, people accidentally delete files.

What would be interesting to research, is if you still can somehow recover those files. As they’re still there, and locked by the process accessing them (in your case LGWR), there might be a way to “steal” the filehandle from that process and just write the file to another location before it’s being closed. Though you would have to intercept the syscall closing the file.

That made me think: linux has the proc filesystem, which gives system information, but also process information. This process information is quite extensive. One of the sections (directory actually) is a list of all the file descriptors for a process. These file descriptors are shown as symlinks:


[oracle@centos50-oracle10203 bin]$ ./sqlplus '/as sysdba'

SQL*Plus: Release 10.2.0.3.0 - Production on Thu May 17 16:24:01 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size 1260696 bytes
Variable Size 130024296 bytes
Database Buffers 33554432 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning, OLAP and Data Mining options
[oracle@centos50-oracle10203 bin]$ ps -ef | grep pmon
oracle 8634 1 0 16:24 ? 00:00:00 ora_pmon_frits
oracle 8680 2642 0 16:24 pts/1 00:00:00 grep pmon
[oracle@centos50-oracle10203 bin]$ cd /proc/8634/fd
[oracle@centos50-oracle10203 fd]$ ls -ls
total 0
0 lr-x------ 1 oracle oracle 64 May 17 16:24 0 -> /dev/null
0 lr-x------ 1 oracle oracle 64 May 17 16:24 1 -> /dev/null
0 lrwx------ 1 oracle oracle 64 May 17 16:24 10 -> /oracle/db/10.2.0.3/dbs/hc_frits.dat
0 lrwx------ 1 oracle oracle 64 May 17 16:24 11 -> /oracle/db/10.2.0.3/admin/frits/adump/ora_8632.aud
0 lr-x------ 1 oracle oracle 64 May 17 16:24 12 -> /dev/zero
0 lr-x------ 1 oracle oracle 64 May 17 16:24 13 -> /dev/zero
0 lr-x------ 1 oracle oracle 64 May 17 16:24 14 -> /oracle/db/10.2.0.3/rdbms/mesg/oraus.msb
0 lrwx------ 1 oracle oracle 64 May 17 16:24 15 -> /oracle/db/10.2.0.3/dbs/hc_frits.dat
0 lrwx------ 1 oracle oracle 64 May 17 16:24 16 -> socket:[33865]
0 lrwx------ 1 oracle oracle 64 May 17 16:24 17 -> /oracle/db/10.2.0.3/dbs/lkFRITS
0 lr-x------ 1 oracle oracle 64 May 17 16:24 18 -> /oracle/db/10.2.0.3/rdbms/mesg/oraus.msb
0 lr-x------ 1 oracle oracle 64 May 17 16:24 2 -> /dev/null
0 lr-x------ 1 oracle oracle 64 May 17 16:24 3 -> /dev/null
0 lr-x------ 1 oracle oracle 64 May 17 16:24 4 -> /dev/null
0 l-wx------ 1 oracle oracle 64 May 17 16:24 5 -> /oracle/db/10.2.0.3/admin/frits/udump/frits_ora_8632.trc
0 l-wx------ 1 oracle oracle 64 May 17 16:24 6 -> /oracle/db/10.2.0.3/admin/frits/bdump/alert_frits.log
0 lrwx------ 1 oracle oracle 64 May 17 16:24 7 -> /oracle/db/10.2.0.3/dbs/lkinstfrits (deleted)
0 l-wx------ 1 oracle oracle 64 May 17 16:24 8 -> /oracle/db/10.2.0.3/admin/frits/bdump/alert_frits.log
0 l-wx------ 1 oracle oracle 64 May 17 16:24 9 -> pipe:[11466]

As you can see, these are the open files of the pmon process of an oracle 10.2.0.3 database just after startup.
(if you want to use ltrace, strace or gdb on a process, this is the place to find the mapping of file descriptor to a file, socket or pipe!)

Can I use a file descriptor to get data out of the file? Probably, because it’s a symlink which is perfectly normal to use. Well then, back to the suggestion made:

“As they’re still there, and locked by the process accessing them (in your case LGWR), there might be a way to “steal” the filehandle from that process and just write the file to another location before it’s being closed.”

A quick test on a test system (centos 4.4, oracle 10.2.0.3):


SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------------------------------------------------------------------------ ---
3 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_3_31l7kp85_.log NO
3 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_3_31l7kqsl_.log YES
2 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_2_31l7ko7c_.log NO
2 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_2_31l7koq3_.log YES
1 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log NO
1 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_1_31l7knrh_.log YES

6 rows selected.

SQL>

Let’s see which log group is the current one:


SQL> select * from v$log;

GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS FIRST_CHANGE# IRST_TIM
---------- ---------- ---------- ---------- ---------- --- ----------------------------- ---------
1 1 38 52428800 2 NO INACTIVE 1198861 10-APR-07
2 1 39 52428800 2 NO INACTIVE 1238951 10-APR-07
3 1 40 52428800 2 NO CURRENT 1274534 17-MAY-07

SQL>

Log group 3 is the current one. Well, lets delete one of the files of group 1 (I have two members in every log group, lets pretend we have only one member):


[oracle@filetest bin]$ rm /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log

Now look at the file descriptors of the logwriter:


...
1 lrwx------ 1 root root 64 May 17 17:07 21 -> /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log (deleted)
...

And lets make a copy of the file using the file descriptor:


[root@filetest ~]# cat /proc/4703/fd/21 > o1_mf_1_31l7knbs_.log
[root@filetest ~]# ls -ls o1_mf_1_31l7knbs_.log
51260 -rw-r--r-- 1 root root 52429312 May 17 17:11 o1_mf_1_31l7knbs_.log

It looks like we’ve made a copy of a deleted file, using the file descriptor in the proc filesystem. Now let’s shutdown the database, and start it up. This will go wrong, because a member of the online redolog group 1 can not be identified:


SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size 1260696 bytes
Variable Size 109052776 bytes
Database Buffers 54525952 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.

SQL>

Excuse me? Shouldn’t we get an error message saying an online logfile could not be identified? Let’s investigate further:


SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------- ----------------------------------------------------------------- ---
3 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_3_31l7kp85_.log NO
3 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_3_31l7kqsl_.log YES
2 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_2_31l7ko7c_.log NO
2 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_2_31l7koq3_.log YES
1 INVALID ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log NO
1 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_1_31l7knrh_.log YES

6 rows selected.

SQL>

Well, apparently not….I recall situations where a lost online redologfile would cause the startup to fail, and needing to copy an existing one in the group to the missing one. I guess this is altered in some version of the database. Anybody know which version?

Well, it makes sense. The status of the file is set to ‘INVALID’ and the surviving file(s) are used. A message is printed in the alert.log:


Thu May 17 17:43:13 2007
Errors in file /oracle/db/10.2.0.3/admin/frits/bdump/frits_lgwr_13960.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Thu May 17 17:43:13 2007
Errors in file /oracle/db/10.2.0.3/admin/frits/bdump/frits_lgwr_13960.trc:
ORA-00321: log 1 of thread 1, cannot update log file header
ORA-00312: online log 1 thread 1: '/oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log'

Now, let’s replace the missing one with our copy from the file descriptor:


[root@filetest ~]# cat o1_mf_1_31l7knbs_.log > /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log
[root@filetest ~]# chown oracle:oinstall /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log

Now switch from logfile until we reach the log group in which the invalid file resides (log group 1)…


SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------------------------------------------------------------------------ ---
3 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_3_31l7kp85_.log NO
3 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_3_31l7kqsl_.log YES
2 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_2_31l7ko7c_.log NO
2 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_2_31l7koq3_.log YES
1 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log NO
1 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_1_31l7knrh_.log YES

6 rows selected.

SQL>

Look! The file is valid again!

Once a logfile is invalid for whatever reason, oracle continues to work as long as there are valid members in the group. During a logswitch, the files in the log group where oracle switches to are identified, even if they are invalid, and reset to a valid state if they are usable again.

Also, it is possible to make a copy of an already deleted file (as root on centos 4.4 (centos 4.4 is a free version of RHEL4) and as the owner of the file or root on centos 5 (RHEL5)) using the filedescriptor in proc.

In the 10.2.0.1 version of the oracle database a new heap checking mechanism, together with a new messaging system is introduced. This new mechanism reports memory allocations above a threshold in the alert.log, together with a tracefile in the udump directory.

In certain situations it can be handy to know if large allocations are being done in the sga heap (shared pool), but only to troubleshoot memory allocation problems (which eventually will appear as the infamous ORA-4031). For normal day to day work, I doubt the effectivity of these messages.

The default limit is set at 2048K. It appears normal usage can result in messages from the memory manager. During my last installation (with all options set default, all options installed), this resulted in 125 messages from the memory manager.

Oracle gives a solution on metalink in note 330239.1:
Set _kgl_large_heap_warning_threshold to a “reasonable high” value. This parameter is set in bytes, and oracle recommends the value 8388608 (8192K).

Oracle recently shipped the production version of Oracle XE, the free version of Oracle’s 10g release 2 database. This is an investigation from a DBA’s point of view. Is it comparable to a ‘normal’ 10g database, or is it something completely different?

Oracle Technet Link for more information and free download: Oracle XE Home

For linux, oracle ships the default gzipped cpio for the normal database, but a rpm (in the beta version), and a rpm and a deb (!!) file for the XE production version. I use ubuntu on my testing machine, so I am very pleased I can use a deb file (because ubuntu uses the debian package manager), and do not have to use ‘alien’ to convert the rpm to deb.

Each package manager checks the environment to see if it meets the requirements. the oracle XE package checks the amount of swapspace quite drastic. I have 512M, and a swapspace of 1G, but it needed 1008 M. I have gotten around this by adding some filesystem swap:


root@bedrock:~# dd if=/dev/zero of=/extraswap bs=1k count=1024000
root@bedrock:~# mkswap /extraswap
root@bedrock:~# swapon /extraswap
root@bedrock:~# swapon -s
Filename Type Size Used Priority
/dev/mapper/vg00-swap partition 1048568 138096 -1
/extraswap file 1023992 0 -4

Well, with this in mind, the install should work:


root@bedrock:~# dpkg -i oracle-xe_10.2.0.1-1.0_i386.deb
(Reading database ... 70694 files and directories currently installed.)
Unpacking oracle-xe (from oracle-xe_10.2.0.1-1.0_i386.deb) ...
Setting up oracle-xe (10.2.0.1-1.0) ...
Executing Post-install steps...

You must run '/etc/init.d/oracle-xe configure' as the root user to configure the database.

root@bedrock:~#

This is easy! The software is installed, we only need to configure the database now, in order to use it!


root@bedrock:~# /etc/init.d/oracle-xe configure

Oracle Database 10g Express Edition Configuration
-------------------------------------------------
This will configure on-boot properties of Oracle Database 10g Express
Edition. The following questions will determine whether the database should
be starting upon system boot, the ports it will use, and the passwords that
will be used for database accounts. Press to accept the defaults.
Ctrl-C will abort.

Specify the HTTP port that will be used for Oracle Application Express [8080]:

Specify a port that will be used for the database listener [1521]:

Specify a password to be used for database accounts. Note that the same
password will be used for SYS and SYSTEM. Oracle recommends the use of
different passwords for each database account. This can be done after
initial configuration:
Confirm the password:

Do you want Oracle Database 10g Express Edition to be started on boot (y/n) [y]:n

Starting Oracle Net Listener...Done
Configuring Database...Done
Starting Oracle Database 10g Express Edition Instance...Done
Installation Completed Successfully.
To access the Database Home Page go to "http://127.0.0.1:8080/apex"

My first question is: what do we have installed now?

Well, it appears we have a running database, together with listener. Just that! The application (formerly known as ‘HTMLDB’, now it is called ‘application express’) runs entirely in the database:


oracle@bedrock:~$ ps -ef | grep oracle
oracle 20226 20225 0 13:13 pts/6 00:00:00 -su
oracle 22384 1 0 14:51 ? 00:00:00 /usr/lib/oracle/xe/app/oracle/product/10.2.0/server/bin/tnslsnr LISTENER -inherit
oracle 22515 1 0 14:53 ? 00:00:00 xe_pmon_XE
oracle 22517 1 0 14:53 ? 00:00:00 xe_psp0_XE
oracle 22519 1 0 14:53 ? 00:00:00 xe_mman_XE
oracle 22521 1 0 14:53 ? 00:00:00 xe_dbw0_XE
oracle 22523 1 0 14:53 ? 00:00:00 xe_lgwr_XE
oracle 22525 1 0 14:53 ? 00:00:00 xe_ckpt_XE
oracle 22527 1 0 14:53 ? 00:00:00 xe_smon_XE
oracle 22529 1 0 14:53 ? 00:00:00 xe_reco_XE
oracle 22531 1 0 14:53 ? 00:00:00 xe_cjq0_XE
oracle 22533 1 0 14:53 ? 00:00:00 xe_mmon_XE
oracle 22535 1 0 14:53 ? 00:00:00 xe_mmnl_XE
oracle 22537 1 0 14:53 ? 00:00:00 xe_d000_XE
oracle 22539 1 3 14:53 ? 00:00:05 xe_s000_XE
oracle 22541 1 1 14:53 ? 00:00:02 xe_s001_XE
oracle 22543 1 0 14:53 ? 00:00:01 xe_s002_XE
oracle 22545 1 0 14:53 ? 00:00:00 xe_s003_XE
oracle 22550 1 0 14:53 ? 00:00:00 xe_qmnc_XE
oracle 22579 1 0 14:54 ? 00:00:00 xe_q000_XE
oracle 22581 1 0 14:54 ? 00:00:00 xe_q001_XE
oracle 22622 20226 0 14:56 pts/6 00:00:00 ps -ef
oracle 22623 20226 0 14:56 pts/6 00:00:00 grep oracle

Where is the software installed?

If we examine the startup scripts, or list the contents of the oracle XE package, we see the software gets installed in /usr/lib/oracle, just as many other debian packages. That’s the debian default. I can live with that.

What’s the oracle home?

The ORACLE_HOME is /usr/lib/oracle/xe/app/oracle/product/10.2.0/server. That’s hidden quite far away from the root.

And how about the database of oracle XE?

The oracle XE database is called ‘XE’. That’s the db_name and the instance_name. The database is startup with a spfile, which is located in $ORACLE_HOME/dbs. The controlfile is located in the directory with the datafiles, which is: /usr/lib/oracle/xe/oradata/XE. The online redologfiles are located in: /usr/lib/oracle/xe/app/flash_recovery_area/XE/onlinelog and are called o1_mf_20onfoz2_.log and o1_mf_20onfnc0_.log.

Besides the fact that only one database can be run with XE (starting up a second instance results in: ORA-44410: XE edition single instance violation error), I still would love to see all files related to an oracle instance be kept in a single place. This eases administration, makes transportability easy, and makes backup and restore very much easier. Now the files are quite scattered among the oracle tree.

It is a bad idea to have the spfile (or init.ora file) (which is critical for starting up a database, it *must* be present) in the oracle software tree. I strongly advise to have a symlink in the dbs directory, instead of the real file.

The online redologfiles are reasonable sized for small installations without any batch ever. For normal database use (yes, I know, this is express, and NOT a normal database) I would start at a size of 200M. Also the number of redolog groups is quite small (2). Online redologfiles are *critical* to database transaction performance, and therefore need to be easily accessible for a DBA and need to have a reasonable names. The current names (o1_mf_20onfoz2_.log and o1_mf_20onfnc0_.log) are weird, and the place is odd to (somewhere in the flash_recovery_area). I would suggest placing the online redologfiles somewhere close to the datafiles, but on a separate filesystem. The online redologfile names are randomly choosen during the install.

How is the default database sized?


oracle@bedrock:~$ export ORACLE_HOME=/usr/lib/oracle/xe/app/oracle/product/10.2.0/server
oracle@bedrock:~$ export PATH=$PATH:$ORACLE_HOME/bin
oracle@bedrock:~$ export ORACLE_SID=XE
oracle@bedrock:~$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Sun Mar 5 15:20:54 2006

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production

SQL>show parameter sga_max_size

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
sga_max_size big integer 140M
SQL> show parameter pga_aggregate_target

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 41216K

Well, it is small, but the memory usage can be altered to 1G memory, according to the documentation.

Can we break it?

In essence, XE is a normal 10gr2 database, just limited. So all means to break it is also applicable to non-XE oracle databases! I very easily could get an unstartable database. Look at the following activity:


oracle@bedrock:~$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Sun Mar 5 15:24:51 2006

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production

SQL> alter system set cluster_database=true scope=spfile;

System altered.

SQL> shut
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORA-00439: feature not enabled: Real Application Clusters

This completely deadlocks the normal way to alter the spfile. Even a startup nomount can not be issued, because the database can not get to the nomount stage. The workaround is quite simple, though:


oracle@bedrock:~$ cd $ORACLE_HOME/dbs
oracle@bedrock:~/app/oracle/product/10.2.0/server/dbs$ strings spfileXE.ora | grep -v cluster_database > initXE.ora
oracle@bedrock:~/app/oracle/product/10.2.0/server/dbs$ rm spfileXE.ora
oracle@bedrock:~/app/oracle/product/10.2.0/server/dbs$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Sun Mar 5 16:31:44 2006

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 146800640 bytes
Fixed Size 1257668 bytes
Variable Size 67112764 bytes
Database Buffers 75497472 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.
SQL> create spfile from pfile;

File created.

SQL>

Follow

Get every new post delivered to your Inbox.

Join 2,058 other followers

%d bloggers like this: