Archive

Exadata

A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;

  COUNT(*)
----------
   1000000

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;
    169999

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1′ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0′.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on 11.2.0.3 and 12.1.0.2, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on 11.2.0.3, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.

Conclusions

The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on 12.1.0.2 gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

Every DBA working with the Oracle database must have seen memory dumps in tracefiles. It is present in ORA-600 (internal error) ORA-7445 (operating system error), system state dumps, process state dumps and a lot of other dumps.

This is how it looks likes:

Dump of memory from 0x00007F06BF9A9E00 to 0x00007F06BF9ADE00
7F06BF9A9E00 0000C215 0000001F 00000CC1 0401FFFF  [................]
7F06BF9A9E10 000032F3 00010003 00000002 442B0000  [.2............+D]
7F06BF9A9E20 2F415441 31323156 4F2F3230 4E494C4E  [ATA/V12102/ONLIN]
7F06BF9A9E30 474F4C45 6F72672F 315F7075 3735322E  [ELOG/group_1.257]
7F06BF9A9E40 3336382E 36313435 00003338 00000000  [.863541683......]
7F06BF9A9E50 00000000 00000000 00000000 00000000  [................]

The first column is the memory location in hexadecimal.
The second to fifth columns represent the actual memory values in hexadecimal.
The sixth column shows an ASCII representation of the memory contents. If a position does not represent an ASCII character, a dot (“.”) is printed.

Actually, the values in the second to fifth column are grouped in four columns. This is how the values in a column look like:
{hex val}{hex val}{hex val}{hex val}, for example: 00010203 means: 0, 1, 2, 3.

In the ASCII representation (sixth column) the spaces after every four values are not put in.

However, look at the following line:

7F06BF9A9E10 000032F3 00010003 00000002 442B0000  [.2............+D]

And focus on the last four characters:
“..+D” (two non-printables, plus, D)
Now look at the corresponding memory contents from the dump:
“442B0000″ This is: “44 2B 00 00″, which should correspond to “. . + D”.
There is something the matter here: the plus and the D seem to be represented by “00”. That’s not correct.

Let’s see what “442B0000″ actually represents in ASCI:

$ echo -e "\x44\x2B\x00\x00"
D+

Ah! That looks backwards! Let’s take a full line and see what that gives:
(This is the line with memory address 0x7F06BF9A9E20)

$ echo -e "\x2F\x41\x54\x41 \x31\x32\x31\x56 \x4F\x2F\x32\x30 \x4E\x49\x4C\x4E"
/ATA 121V O/20 NILN

So if you want to look at the actual memory contents, you need to start with the column on the left side, read the values from right to left, then go the next column, etc.

Endianness
Actual, I asked my friend Philippe Fierens for a trace file from a SPARC (big endian) platform, to see if the endianness of the platform was causing this. I test my stuff on Linux, which is little endian.

Here’s a little snippet:

Dump of memory from 0xFFFFFFFF7D977E00 to 0xFFFFFFFF7D97BE00
FFFFFFFF7D977E00 15C20000 00000001 00000000 00000104  [................]
FFFFFFFF7D977E10 F4250000 00000000 0B200400 E2EB8A3D  [.%....... .....=]
FFFFFFFF7D977E20 44475445 53540000 32F6D98B 00000590  [DGTEST..2.......]
FFFFFFFF7D977E30 00004000 00000001 00000000 00000000  [..@.............]
FFFFFFFF7D977E40 00000000 00000000 00000000 00000000  [................]

Let’s test the line from address 0xFFFFFFFF7D977E20:

[oracle@bigmachine [v12102] trace]$ echo -e "\x44\x47\x54\x45 \x53\x54\x00\x00 \x32\xF6\xD9\x8B \x00\x00\x05\x90"
DGTE ST 2� �

So, the endianness determines how the raw memory contents should be read.

This is the 4th post in a series of posts on PGA behaviour of Oracle. Earlier posts are: here (PGA limiting for Oracle 12), here (PGA limiting for Oracle 11.2) and the quiz on using PGA with AMM, into which this blogpost dives deeper.

As laid out in the quiz blogpost, I have a database with the following specifics:
-Oracle Linux x86_64 6u6.
-Oracle database 11.2.0.4 PSU 4
-Oracle database (single instance) with the following parameter set: memory_target=1G. No other memory related parameters set.

In this setup, I run the pga_filler script (source code here), which creates a collection until the session statistic ‘session pga memory’ exceeds the grow_until variable, which for this case I set to 2100000000 (approximately 2.1G).

So: the instance is set to have AMM (memory_target) with a size of 1GB, which is supposed to be the total amount memory which this instance uses, and a session runs a PL/SQL procedure which only stops if it has allocated 2.1GB, which is clearly more than configured with the memory_target parameter. Please mind a collection, which the anonymous procedure uses to allocate memory, is outside of the memory areas for which Oracle can move data to the assigned temporary tablespace (sort, hash and bitmap memory areas).

After startup of the instance with only memory_target set to 1G, the memory partitioning looks like this:

SYS@v11204 AS SYSDBA> select component, current_size/power(1024,2), last_oper_type from v$memory_dynamic_components where current_size != 0;

COMPONENT							 CURRENT_SIZE/POWER(1024,2) LAST_OPER_TYP
---------------------------------------------------------------- -------------------------- -------------
shared pool										168 STATIC
large pool										  4 STATIC
java pool										  4 STATIC
SGA Target										612 STATIC
DEFAULT buffer cache									424 INITIALIZING
PGA Target										412 STATIC

This is how v$pgastat looks like:

SYS@v11204 AS SYSDBA> select * from v$pgastat;

NAME								      VALUE UNIT
---------------------------------------------------------------- ---------- ------------
aggregate PGA target parameter					  432013312 bytes
aggregate PGA auto target					  318200832 bytes
global memory bound						   86402048 bytes
total PGA inuse 						   78572544 bytes
total PGA allocated						   90871808 bytes
maximum PGA allocated						   93495296 bytes
total freeable PGA memory					    2818048 bytes
process count								 57
max processes count							 58
PGA memory freed back to OS					    3211264 bytes
total PGA used for auto workareas					  0 bytes
maximum PGA used for auto workareas					  0 bytes
total PGA used for manual workareas					  0 bytes
maximum PGA used for manual workareas					  0 bytes
over allocation count							  0
bytes processed 						    8479744 bytes
extra bytes read/written						  0 bytes
cache hit percentage							100 percent
recompute count (total) 						 18

SYS@v11204 AS SYSDBA> show parameter pga

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target		     big integer 0

Okay, so far so good. v$memory_dynamic_components shows the PGA Target being 412M, and v$pgastat shows the aggregate PGA target setting being 412M too. I haven’t set pga_aggregate_target (as shown with ‘show parameter pga’), because I am using memory_target/AMM for the argument I hear the most in favour of it: one knob to tune.

Next up, I start the pga_filler script, which means the session starts to allocate PGA.

I keep a close watch using v$pgastat:

SYS@v11204 AS SYSDBA> select * from v$pgastat;

NAME								      VALUE UNIT
---------------------------------------------------------------- ---------- ------------
aggregate PGA target parameter					  432013312 bytes
aggregate PGA auto target					  124443648 bytes
global memory bound						   86402048 bytes
total PGA inuse 						  296896512 bytes
total PGA allocated						  313212928 bytes
maximum PGA allocated						  313212928 bytes

This shows the pga_filler script in progress by looking at v$pgastat from another session. The total amount of PGA allocated has grown to 313212928 (298M) here.

A little while later, the amount of PGA taken has grown beyond the PGA target (only relevant rows):

total PGA inuse 						  628974592 bytes
total PGA allocated						  645480448 bytes
maximum PGA allocated						  645480448 bytes

However, when looking at the memory components using v$memory_dynamic_components, it gives the impression PGA memory is still 412M:

SYS@v11204 AS SYSDBA> select component, current_size/power(1024,2), last_oper_type from v$memory_dynamic_components where current_size != 0;

COMPONENT							 CURRENT_SIZE/POWER(1024,2) LAST_OPER_TYP
---------------------------------------------------------------- -------------------------- -------------
shared pool										168 STATIC
large pool										  4 STATIC
java pool										  4 STATIC
SGA Target										612 STATIC
DEFAULT buffer cache									424 INITIALIZING
PGA Target										412 STATIC

You could argue PGA is explicitly mentioned as ‘PGA Target’, but then: the total of the memory area’s (PGA Target+SGA Target) do show a size that roughly sums up to be equal to the memory_target.

A little while later, this is what v$pgastat is showing:

total PGA inuse 						  991568896 bytes
total PGA allocated						 1008303104 bytes
maximum PGA allocated						 1008303104 bytes

Another glimpse at v$memory_dynamic_components shows the same output as above, PGA Target at 412M. This is the point where it get’s a bit weird: the total amount of PGA memory (according to v$pgastat) shows it’s almost 1G, memory_target is set at 1G, and yet v$memory_dynamic_components show no change at all.

Again a little further in time:

total PGA inuse 						 1325501440 bytes
total PGA allocated						 1342077952 bytes
maximum PGA allocated						 1342077952 bytes

Okay, here it get’s really strange: there’s more memory allocated for PGA memory alone than has been set with memory_target for both PGA and SGA memory structures. Also, v$memory_dynamic_components shows no change in SGA memory structures or exchange of memory from SGA to PGA memory.

If v$pgastat is correct, and memory_target actively limits the total amount of both SGA and PGA, then the session must allocate memory out of thin air! But I guess you already came to the conclusion too that either v$pgastat is incorrect, or memory_target does not limit memory allocations (as at least I think it would do).

Let’s dump the PGA heap of the active process to see the real memory allocations of this process:

SYS@v11204 AS SYSDBA> oradebug setospid 9041
Oracle pid: 58, Unix process pid: 9041, image: oracle@bigmachine.local (TNS V1-V3)
SYS@v11204 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11204 AS SYSDBA> oradebug dump heapdump 1
Statement processed.

(9041 is the PID of the process running PL/SQL)

Now look into (the relevant) data of the PGA heap dump:

[oracle@bigmachine [v11204] trace]$ grep Total\ heap\ size v11204_ora_9041.trc
Total heap size    =1494712248
Total heap size    =    65512
Total heap size    =  1638184

Okay, this is clear: the process actually took 1494712248 (=1425M) plus a little more memory. So, memory_target isn’t that much of a hard setting after all.

But where does this memory come from? There ought to be a sort of combined memory effort together with the SGA for memory, right? That was the memory_target promise!

Let’s take a look at the actual memory allocations of a new foreground process in /proc/PID/maps:

[oracle@bigmachine [v11204] trace]$ less /proc/11405/maps
00400000-0bcf3000 r-xp 00000000 fc:02 405855559                          /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/oracle
0bef2000-0c0eb000 rw-p 0b8f2000 fc:02 405855559                          /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/oracle
0c0eb000-0c142000 rw-p 00000000 00:00 0
0c962000-0c9c6000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:10 351997                             /dev/shm/ora_v11204_232652803_0
60001000-60400000 rw-s 00001000 00:10 351997                             /dev/shm/ora_v11204_232652803_0
...
9fc00000-a0000000 rw-s 00000000 00:10 352255                             /dev/shm/ora_v11204_232685572_252
a0000000-a0400000 rw-s 00000000 00:10 354306                             /dev/shm/ora_v11204_232718341_0
3bb3000000-3bb3020000 r-xp 00000000 fc:00 134595                         /lib64/ld-2.12.so
3bb321f000-3bb3220000 r--p 0001f000 fc:00 134595                         /lib64/ld-2.12.so
3bb3220000-3bb3221000 rw-p 00020000 fc:00 134595                         /lib64/ld-2.12.so
3bb3221000-3bb3222000 rw-p 00000000 00:00 0
3bb3400000-3bb3401000 r-xp 00000000 fc:00 146311                         /lib64/libaio.so.1.0.1
...
3bb5e16000-3bb5e17000 rw-p 00016000 fc:00 150740                         /lib64/libnsl-2.12.so
3bb5e17000-3bb5e19000 rw-p 00000000 00:00 0
7f018415a000-7f018416a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018416a000-7f018417a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018417a000-7f018418a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018418a000-7f018419a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018419a000-7f01841aa000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841aa000-7f01841ba000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841ba000-7f01841ca000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841ca000-7f01841da000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841da000-7f01841ea000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841ea000-7f01841fa000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841fa000-7f018420a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018420a000-7f018421a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018421a000-7f018422a000 rw-p 00000000 00:05 1030                       /dev/zero
7f68d497b000-7f68d4985000 r-xp 00000000 fc:02 268585089                  /u01/app/oracle/product/11.2.0.4/dbhome_1/lib/libnque11.so
...

When I run the pga_filler anonymous PL/SQL block, and strace (system call trace) utility, I see (snippet):

mmap(0x7f0194f7a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194f7a000
mmap(0x7f0194f8a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194f8a000
mmap(0x7f0194f9a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194f9a000
mmap(0x7f0194faa000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194faa000
mmap(0x7f0194fba000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194fba000
mmap(0x7f0194fca000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194fca000
mmap(0x7f0194fda000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194fda000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 6, 0xea000) = 0x7f0194e6a000
mmap(0x7f0194e6a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194e6a000
mmap(0x7f0194e7a000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194e7a000
mmap(0x7f0194e9a000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194e9a000
mmap(0x7f0194eba000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194eba000

So, when looking back, it’s very easy to spot the SGA memory, which resides in /dev/shm in my case, and looks like ‘/dev/shm/ora_v11204_232652803_0′ in the above /proc/PID/maps snippet.
This means that the mmap() calls are simply, as anyone would have guessed by now, the PGA memory allocations. In the maps snippet these are visible as being mapped to /dev/zero.
When looking at the mmap() call, at the 5th argument, which is the number 6, we look at a file descriptor. In /proc/PID/fd the file descriptors can be seen, and file descriptor 6 is /dev/zero, as you probably suspected. This way the allocated memory is initial set to zero.

By now, the pga_filler script finishes:

TS@v11204 > @pga_filler
begin pga size : 3908792
last  pga size : 2100012216
begin uga size : 1607440
last  uga size : 2000368
parameter pat  : 0

Taking the entire 2.1G I made the collection to grow to. With memory_target set to 1G.

Conclusion
The first conclusion I made is that PGA memory is very much different than SGA/shared memory. Anyone with a background in Oracle operating-system troubleshooting will find this quite logical. However, the “promise” AMM/memory_target made, in my interpretation, is that the memory would be used seamless. This is simply not the case. Shared memory is in /dev/shm, and PGA is mmaped/allocated as private memory.

Still, this wouldn’t be that much of an issue if memory_target would limit memory in a rigid way, and memory could, and actually would, very easily float between PGA and SGA. It simply doesn’t.

Why don’t we see Oracle trying to reallocate memory? This is the point where I can only guess.

– Probably, Oracle would try to grow the shared pool if it has problems allocating memory for SQL, library cache, etc. This probably hasn’t happened in my test.
– Probably, Oracle would try to grow the buffer cache if it can calculate a certain benefit from enlarging it. This probably hasn’t happened in my test.
– The other SGA area’s (large and java pool) probably are grown if these are used, and need more space for allocations. This probably didn’t happen in my test.
– For the PGA, a wild guess is the memory manager calculates using the workarea sizes (sort, hash and bitmap areas), which are not noticeably used in my test.

Another conclusion and opinion is AMM/memory_target is not a set once and forget option. In fact, it isn’t that much of a difference from using ASMM from a DBA perspective: you carefully need to understand the SGA size, and you carefully need to (try to) manage the PGA memory. Or reasoned the other way around: the only way you can sensibly set memory_target is if you know the correct SGA size and the PGA usage. Also having Oracle manage the memory area’s automatically is not unique to AMM: Oracle will reallocate (inside the SGA) if it finds it necessary, with AMM, ASMM and even manual set memory area’s. But the big dis-advantage of AMM (at least on linux, not sure about other operating systems) is that huge pages can’t be used, which has a severe impact on “real life” databases, in my experience. (Solaris CAN use huge pages with AMM(!)).

A final word: of course I tested a very specific situation. In most real-life cases there will be multiple sessions, and the PGA manageable memory areas will be used. However, the point I try to make is memory_target is simply not a way to very easily make your database be hard limited to the value set. Probably, in real life, the real amount of memory used by the instance will in the area of the value set with memory_target, but this will be subject to what memory areas you are exactly using. Of course it can differ in a spectaculair way if collections or alike structures are used by a large number of sessions.

This is a series of blogposts on how the Oracle database makes use of PGA. Earlier posts can be found here (PGA limiting for Oracle 12) and here (PGA limiting for Oracle 11.2).

Today a little wednesday fun: a quiz.

What do you think will happen in the following situation (leave a response as comment please!):

-Oracle Linux x86_64 6u6.
-Oracle database 11.2.0.4 PSU 4
-Oracle database (single instance) with the following parameter set: memory_target=1G. No other memory related parameters set.

Run the pga_filler script (which can be found here (PGA limiting for Oracle 12)), with grow_until set to 2100000000 (approximately 2.1G).

I’ll try to create a blogpost on the outcome and an explanation on short notice!

This is the second part of a series of blogpost on Oracle database PGA usage. See the first part here. The first part described SGA and PGA usage, their distinction (SGA being static, PGA being variable), the problem (no limitation for PGA allocations outside of sort, hash and bitmap memory), a resolution for Oracle 12 (PGA_AGGREGATE_LIMIT), and some specifics about that (it doesn’t look like a very hard limit).

But this leaves out Oracle version 11.2. In reality, the vast majority of the database that I deal with at the time of writing is at version 11.2, and my guess is that this is not just the databases I deal with, but a general tendency. This could change in the coming time with the desupport of Oracle 11.2, however I suspect the installed base of Oracle version 12 to increase gradually and smoothly instead of in a big bang.

With version 11.2 there’s no PGA_AGGREGATE_LIMIT. This simply means there is no official way to limit the PGA. Full stop. However, there is an undocumented event to limit PGA usage: event 10261. This means that if you want to use this in a production database, you should ask Oracle support to bless the usage of it. On the other hand, Oracle corporation made this event public in an official white paper: Exadata consolidation best practices.

Let’s test event 10261! I’ve got the same table (T2) setup, a description how to set this up, and the anonymous PL/SQL code to allocate PGA using a collection is in the first part. I am using a database version 11.2.0.4 with PSU 4 applied. The reason for choosing this version is that if you run a serious business on Oracle 11.2, THAT should be the version you should be running on!
(disclaimer: everything shown in this blogpost is purely for educational purposes. Do test everything thoroughly before applying this to a production system. Behaviour can or may be different in your specific situation)
The reason for this disclaimer: Bernhard (@bdcbuning_gridit) tweeted that he was warned that when setting it at the instance level, it could crash the instance. I am not sure if this means setting it at runtime, this event is always evaluated at the instance level.

Okay, let’s replicate more or less the test done to Oracle version 12.1.0.2 in the first part. In this database PGA_AGGREGATE_SIZE is set to 500M, now let’s try to set the event to 600M, which means we set the PGA limit to 600M:
This is setting the event on runtime:

SYS@v11204 AS SYSDBA> alter system set events = '10261 trace name context forever, level 600000';

System altered.

This is setting the event in the spfile (which means you need a restart of the instance to activate this event, or the above syntax to set it on runtime):

SYS@v11204 AS SYSDBA> alter system set event = '10261 trace name context forever, level 600000' scope=spfile;

System altered.

The level is the amount of memory to which the PGA must be limited, in kilobytes.

Now start the anonymous PL/SQL block to fill up the PGA with a collection, again set to 900M:

TS@v11204 > @pga_filler
declare
*
ERROR at line 1:
ORA-10260: limit size (600000) of the PGA heap set by event 10261 exceeded
ORA-06512: at line 20

That’s nice! There’s actually a meaningful, describing error message which explains why this PL/SQL block ended!

Let’s look at the actual PGA memory used, as reported by v$pgastat:

SYS@v11204 AS SYSDBA> select value/power(1024,2) from v$pgastat where name = 'maximum PGA allocated';

VALUE/POWER(1024,2)
-------------------
	 676.078125

This is different than setting PGA_AGGREGATE_LIMIT, however there’s still more memory allocated than set as the limit (600000KB), but lesser (676M in 11.2.0.4 versus 1041M in 12.1.0.2). The outside visibility of the limiting happening is different too: there is NO notice of a process hitting the PGA limit set in the alert.log file nor the process’ trace file(!). Another difference is even SYS is limited, a test with the procedure running as SYS gotten me the ORA-10260 too, PGA_AGGREGATE_LIMIT does not limit SYS.

Event 10261 has got the same description to at least as low as version 11.2.0.1. Here’s a test with with the event 10261 set at version 11.2.0.3 to 600M:

TS@v11203 > @pga_filler
declare
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [723], [123552], [top uga heap], [], [], [], [], [], [], [], [], []
ORA-06512: at line 20

As has been detailed in the Oracle white paper, prior to version 11.2.0.4, an ORA-600 [723] is signalled when event 10261 is set, and more PGA memory is allocated as has been specified as limit. The amount of total allocated PGA is 677M, so roughly the same as with version 11.2.0.4.

Because this is a genuine ORA-600 (internal error, ‘OERI’), this gives messages in the alert.log file:

Tue Dec 16 10:40:09 2014
Errors in file /u01/app/oracle/diag/rdbms/v11203/v11203/trace/v11203_ora_8963.trc  (incident=9279):
ORA-00600: internal error code, arguments: [723], [123552], [top uga heap], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/v11203/v11203/incident/incdir_9279/v11203_ora_8963_i9279.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

The process’ trace file in the trace directory only points to the incident file, no further details are available there.
The incident trace file contains a complete diagnostics dump.

The behaviour is identical with Oracle 11.2.0.2.

Summary
The limiting of the total amount of PGA memory used must be done using an undocumented event prior to Oracle version 12. The event is 10261. The event is made known in an official white paper. Still I would open a service request with Oracle to ask blessing for setting this. This does not mean this functionality is not needed, I would deem it highly important in almost any environment, even when running a single database: this setting, when done appropriately, protects your system from over allocating memory, which could mean entering the swapping death-spiral. The protection means a process gets an ORA message, and the PGA allocation aborted and deallocated.

With version 11.2.0.4 hitting the limit as set with event 10261 is not published, outside of the process getting the ORA-10260.

With versions prior to 11.2.0.4 (11.2.0.3 and 11.2.0.2 verified) processes do get an ORA-600 [723], which is also visible in the alert.log, and incidents are created accordingly.

When a limit has been set using event 10261, it still means more memory is allocated than set as limit (approximately 677M when 600M is set), but this is way less than with the PGA_AGGREGATE_LIMIT (1041M when 600M is set) in my specific situation. Test this in your own environment when you start using this.

Important addendum:
A very good comment to emphasise on the behaviour of using/setting event 10261 by Alexander Sidorov: this event sets a limit per process, not for the entire instance!! (tested with 11.2.0.4 and 11.2.0.3)

This post is about memory management on the operating system level of an Oracle database. The first question that might pop in your head is: isn’t this a solved problem? The answer is: yes, if you use Oracle’s AMM (Automatic Memory Management) feature, which let’s you set a limit for the Oracle datababase’s two main memory area’s: SGA and PGA. But in my opinion any serious, real life, usage of an Oracle database on Linux will be (severely) constrained in performance because of the lack of huge pages with AMM, and I personally witnessed very strange behaviour and process deaths with the AMM feature and high demand for memory.

This means that I strongly advise customers to use Oracle’s ASMM (Automatic Shared Memory Management) feature. In the newer versions of 11.2 I found this to be working very well. Earlier versions like 10.2 could suffer from an ever growing shared pool (which also means an ever shrinking buffer cache), especially when bind variables weren’t used. This still could happen, but it seems the SGA memory management feature in 11.2 handles this well in most cases. The ASMM feature means a fixed memory area is allocated for the SGA. SGA allocation has always been fixed outside of the AMM feature, as far as I know.

When ASMM doesn’t work, meaning the memory areas are getting sized wrong and performance is influenced by that, the last option is to size the memory area’s yourself. However, since version 11.2.0.2 Oracle will resize when the memory manager thinks it’s feasible. See Kurt van Meerbeek’s article about that.

That leaves the PGA (Process Global Area) as a memory area on itself. Most databases are using the automatic PGA memory management, which is enabled once the PGA_AGGREGATE_TARGET parameter is set to a non zero value. A common misunderstanding is this setting is actually limiting the overall PGA usage of an instance. The truth is automatic PGA memory management will make attempts to adhere to the PGA_AGGREGATE_TARGET value. These are the actual words in the official Oracle documentation: ‘attempts to adhere’!

This means sort memory, hash memory and bitmap memory will be actively limited in size per process by automatic PGA memory management, any attempt to allocate more than automatic PGA memory management allows will result in moving some contents of these memory areas to the assigned temporary tablespace of the database user, to make room for new data.

However, there are more memory area’s allocatable per process, which are never swapped to disk, thus always will stay in memory, and these could not be limited in an officially supported way prior to Oracle version 12. Two structures which are allocated in PGA and never swapped to disk are PL/SQL collections and PL/SQL tables. Creating and filling these requires the usage of PL/SQL (hence their names); the reason for mentioning this is that if your database is not used by PL/SQL but only SQL, you almost certainly will not run into the problem I describe below.

You might be thinking: wait a minute! Does this mean a developer can just create such a structure, and allocate whatever he/she likes, with all the consequences that it can have, like the operating system starting to swap, and can do that for every single process? Yes, this is what this means. This is why Oracle introduced a parameter called PGA_AGGREGATE_LIMIT with Oracle 12, to effectively limit the overall PGA heap size.

In case you wonder what this means, or even doubting my words, I have written a little program to demonstrate this behaviour.

This is the source code to create my test table T2:

exec dbms_random.seed('abracadabra');
create table t2
as
with generator as (
    select      rownum      id
    from        dual
    connect by
                rownum <= 1000
)
select
    rownum                                                id,
    trunc((rownum-1)/50)                            clustered,
    mod(rownum,20000)                               scattered,
    trunc(dbms_random.value(0,20000))               randomized,
    trunc(sysdate) + dbms_random.value(-180, 180)   random_date,
    dbms_random.string('l',6)                       random_string,
    lpad(rownum,10,0)                               vc_small,
    rpad('x',100,'x')                               vc_padding
from
    generator   g1,
    generator   g2
where
    rownum <= 1000000
;
exec dbms_stats.gather_table_stats(null,'T2');

This is a very smart way to generate a table. I actually borrowed this from Jonathan Lewis.

Next up, I created a small anonymous PL/SQL block to take the contents from the T2 table, and store them in a collection until I hit the limit in the variable ‘grow_until’.

declare
	type sourcetab is table of t2%ROWTYPE;
	c_tmp		sourcetab;
	c_def		sourcetab	:= sourcetab();
	v_b_p		number		:= 0;
	v_c_p		number		:= 0;
	v_b_u		number		:= 0;
	v_c_u		number		:= 0;
	grow_until	number		:= 700000000;
	p_a_t		number;
begin
	select value into v_b_p from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session pga memory max';
	select value into v_b_u from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session uga memory max';
	select value into p_a_t from v$parameter where name = 'pga_aggregate_target';
	select * bulk collect into c_tmp from t2;
	while v_c_p < grow_until loop
		for c in c_tmp.first .. c_tmp.last loop
			c_def.extend(1);
			c_def(c_def.last) := c_tmp(c);
			select value into v_c_p from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session pga memory max';
			select value into v_c_u from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session uga memory max';
			if v_c_p >= grow_until then
				exit;
			end if;
		end loop;
	end loop;
	dbms_output.put_line('vbp : '||v_b_p);
	dbms_output.put_line('vcp : '||v_c_p);
	dbms_output.put_line('vbu : '||v_b_u);
	dbms_output.put_line('vcu : '||v_c_u);
	dbms_output.put_line('pat : '||p_a_t);
end;
/

Please mind the session needs to have create table, create session granted, enough quota in the default tablespace and select on v_$mystat, v_$parameter and v_$statname granted.

This is run on an Oracle 12.1.0.2 database:

TS@v12102 > @pga_filler
vbp : 3535368
vcp : 700051976
vbu : 1103192
vcu : 4755704
pat : 524288000

PL/SQL procedure successfully completed.

The begin sizes of the UGA (vbu) and PGA (vbp) are 1’103’192 and 3’535’368. The PGA_AGGREGATE_TARGET size is set to 524’288’000 (500MB). I did set the grow_until variable to 700’000’000 (roughly 700MB), which is more than PGA_AGGREGATE_TARGET. After running this, it’s easy to spot the values of vcu (UGA allocation) and vcp (PGA allocation). vcu grew to 4’755’704 during the run, however vcp grew to 700’051’976, a little more than 700MB! This shows that the collection is stored in the PGA, and that the collection grew beyond the value set with PGA_AGGREGATE_TARGET.

This behaviour is consistent in versions 12.1.0.1, 11.2.0.4, 11.2.0.3, 11.2.0.2 and 11.2.0.1.

Let me emphasise once again that the above proof of concept code managed to allocate more memory than was set for the overall PGA usage of the entire instance. This can have an enormous, devastating impact on a consolidated database setup (meaning having multiple instances running on a single machine). Typically, once memory consumption of all the processes exceeds physically available memory, the operating system tries to use the swap device, to which it will swap memory pages in and out depending on memory usage of active (=on CPU) processes. Mild swapping shows as severely slowed-down processing (because a number of memory pages for processing need to be read from the swap device and placed in memory, from which the former contents need to be written to the swap device), heavy swapping shows as the machine coming down to a standstill.

Please mind that a diagnosis on the state of memory usage (alias swapping), just by looking at the amount of used swap (as can be seen in the ‘top’ output, or ‘swapon -s’) could be misleading. It’s also important to look at actual swapping in and out, as can be seen with ‘vmstat 1′ (si/so columns) or swap -W. I’ve found several systems which had been running for some time (approximately longer than a month) that had swap usage, sometimes up to 40%, while no ‘active swapping’, so memory pages being transfered to and from the swap device, was happening.

Luckily, starting with Oracle 12 you can actually limit overall PGA usage using the parameter PGA_AGGREGATE_LIMIT. The default value is the greater of (list from Oracle documentation):
a) 2GB
b) 200% of PGA_AGGREGATE_TARGET parameter (or lower if 200% > (90% of physical memory – total SGA size) but not below 100%)
c) 3MB * PROCESSES parameter
The parameter can not set below it’s default value, except when set in a pfile or spfile.

Let’s set the PGA_AGGREGATE_LIMIT to 600MB and see what happens when we start doing a large allocation again:

SQL> alter system set pga_aggregate_limit=600m scope=spfile;

System altered.

SQL> startup force;

Okay, let’s run the pga_filler.sql script again, and try to allocate 900MB. This means the “grow_until” variable must be set to 900000000.
PLEASE MIND this is done as a regular user, the SYS user and background processes other than job queue processes are not subject to the limiting.

TS@v12102 > @pga_filler
declare
*
ERROR at line 1:
ORA-01423: error encountered while checking for extra rows in exact fetch
ORA-00039: error during periodic action
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
ORA-06512: at line 21

Great! Exactly like we expect, right?
Well…yes, but let’s look at the alert.log

Sat Dec 13 15:08:57 2014
Errors in file /u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc  (incident=46599):
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
Incident details in: /u01/app/oracle/diag/rdbms/v12102/v12102/incident/incdir_46599/v12102_ora_4147_i46599.trc
Sat Dec 13 15:09:07 2014
Dumping diagnostic data in directory=[cdmp_20141213150907], requested by (instance=1, osid=4147), summary=[incident=46599].
Sat Dec 13 15:09:09 2014
Sweep [inc][46599]: completed
Sweep [inc2][46599]: completed

Okay, essentially, this tells us nothing interesting, except for the tracefile. Let’s look in/u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc, being the tracefile as indicated in the above alert.log snippet:

*** 2014-12-13 15:08:57.351
Process may have gone over pga_aggregate_limit
Just allocated 65536 bytes
Dumping short stack in preparation for potential ORA-4036
----- Abridged Call Stack Trace -----
ksedsts()+244<-ksm_pga_limit_short_stack()+1016<-ksm_check_over_limit()+469<-ksmarfg()+574<-kghgex()+1376<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghalf()+1003<-klmalf()+103
<-kllcqas()+194<-kcblasm()+108<-kxhfNewBuffer()+607<-qerhjSplitBuild()+632
----- End of Abridged Call Stack Trace -----
=======================================
PRIVATE MEMORY SUMMARY FOR THIS PROCESS
---------------------------------------
******************************************************
PRIVATE HEAP SUMMARY DUMP
781 MB total:
   781 MB commented, 646 KB permanent
   208 KB free (0 KB in empty extents),
     779 MB,   2 heaps:   "koh-kghu call  "            57 KB free held
------------------------------------------------------
Summary of subheaps at depth 1
779 MB total:
   778 MB commented, 110 KB permanent
    63 KB free (0 KB in empty extents),
     667 MB, 42786 chunks:  "pmuccst: adt/record       "
      83 MB, 5333 chunks:  "pl/sql vc2                "

Actually, this is the end of the tracefile. It seems that the pga limit dump (the text in between “Process may have gone over pga_aggregate_limit” to the private memory summary heap dumps) occurs several times before an actual ORA-4036 is triggered. In my private test instance, where I am obviously the only user process doing something, I get a pga limit dump approximately 20 times before the ORA-4036 is actually triggered:

sending 4036 interrupt
Incident 46599 created, dump file: /u01/app/oracle/diag/rdbms/v12102/v12102/incident/incdir_46599/v12102_ora_4147_i46599.trc
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT

Did you actually spot the oddity here?

Remember the PGA_AGGREGATE_LIMIT was set to 600M. Now look at the process’ PGA/Private heap summary dump above: it says 781M. Please mind the 781M is the PGA heap of a SINGLE process! When looking at the total PGA allocated for the entire instance, it’s even more:

SYS@v12102 AS SYSDBA> select value/power(1024,2) "MB" from v$pgastat where name = 'maximum PGA allocated';

	MB
----------
1041.16699

So…despite PGA_AGGREGATE_LIMIT set to 600M, according to the v$pgastat view, there’s 1041MB allocated for PGA. Please mind I haven’t looked into how accurate v$pgastat is, but I tend to believe this.

Summary.
I’ve seen PGA_AGGREGATE_TARGET being used as a calculation value for actual PGA usage of an instance. This is simply wrong. The actual amount of PGA memory allocated by the instance is highly depended on what is done, and can be less than PGA_AGGREGATE_TARGET, or more. Automatic PGA can control three per process memory area’s: the sort, hash and bitmap memory area’s. These are sized based on the setting of PGA_AGGREGATE_TARGET and the actual PGA memory usage instance wide. If more memory is needed for sort, hash or bitmap memory than is made available by the memory manager, excess memory needed is allocated in the temporary tablespace. Any other PGA memory allocation is always done, regardless of the setting of PGA_AGGREGATE_TARGET.

Starting with Oracle 12, it seems the actual PGA allocation now can actually be limited with the new parameter PGA_AGGREGATE_LIMIT. However, during some simple testing it shows that actually more memory is allocated than set with PGA_AGGREGATE_LIMIT as limit. I haven’t tested it in more situations, this post is meant to grow awareness that the actual limit as set by PGA_AGGREGATE_LIMIT might not be that hard as you would expect.

Please mind, PGA_AGGREGATE_LIMIT seems to truly limit PGA usage instance wide, not limit the PGA heap per process, as event 10251 (PGA usage limiting way for Oracle 11.2) does. However, once again: PGA_AGGREGATE_LIMIT seems to try to be smart and actually does not limit at the exact size set, but beyond that.

The next post will introduce a way to limit PGA usage in Oracle 11.2. Stay tuned!

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
Parameters:
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

Implementation:
One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).

Follow

Get every new post delivered to your Inbox.

Join 2,490 other followers

%d bloggers like this: