A lot of blogposts and other internet publications have been written on the full segment scan behaviour of a serial process starting from Oracle version 11gR2. This behaviour is the Oracle engine making a decision between scanning the blocks of a segment into the Oracle buffercache or scanning these blocks into the process’ private process global area (PGA). This decision is even more important on the Exadata platform, because the Oracle engine must have made the decision to read the blocks into the process’ PGA in order to be able to do a smartscan. This means that if you are on Oracle 11gR2 already, and thinking about using the Exadata platform, the wait event ‘direct path read’ gives you an indication on how much potentially could be offloaded on Exadata, if you keep all the settings the same.

This blogpost is about looking into full segment scans, and get an understanding when and why the engine changes from buffered reads to direct path reads. Luckily, Oracle provides a (non documented) event to show the decision between buffered and direct path reads. As with most of the trace facilities Oracle provides, the information which the tracing provides is symbolic and requires interpretation before it can be used.

The event is:

alter session set events 'trace[nsmtio]';

(nsmtio: non smart IO)

1. Table too small for direct path read
TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from smalltable;


TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the relevant part of the generated trace:

NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 4 (blocks), Threshold: MTT(18293 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled

Some of the lines of the nsmtio trace are prefixed with ‘NSMTIO’. When the line is prefixed with NSMTIO, the function about which the line prints information is shown. We see two functions here: kcbism and qertbFetch.

The kcbism (this probably means kernel cache buffers is small) line shows some information (here are some of the things that seem logical):
islarge 0: this probably means this is not considered a large object.
nblks 4: the size of the object is 4 blocks.
type 2: oracle’s internal database type number, 2 means table, 1 means index (OBJ$.TYPE#).
kcbisdbfc 0: probably information about the database flash cache.
kcbnbh 182931: kernel cache buffer number of buffer headers; the size of the cache in blocks.
kcbstt 3658: this is the _small_table_threshold value.
keep_nb: number of blocks in the keep cache.
kcbnwp 1: kernel cache buffer number of writer processes. The number of database writer processes.

Then a line for the qertbFetch function:
NoDirectRead: This is very clear: there is no direct read executed, which means the scan is done buffered.
[- STT < OBJECT_SIZE < MTT]: I am not exactly sure what this means to express. It got a minus sign as first, then STT < OBJECT_SIZE < MTT, which I read as "the object's size is bigger than small table threshold, but smaller than medium table threshold", which is not true, because the size of 4 blocks is much smaller than STT.
Obect's size: 4 (blocks), Threshold: MTT(18293 blocks): The typo here is by Oracle. Also this describes MTT, medium table threshold, which is 5 times small table threshold. It says that the threshold is MTT. This is not true, as we will see.
Next, there are a few lines about properties which probably influence the buffered/direct path decision:
_object_statistics: enabled: This is the object's size being determined via the statistics, rather than from the segment header, which can be changed by the parameter _DIRECT_READ_DECISION_STATISTICS_DRIVEN.
Sage: enabled: Sage means Exadata (Storage Appliance for Grid Environments). Starting from version 11, the exadata code base is always enabled. Probably in version 10 this had to be added by adding code, alike linking in functionality such as RAC and SDO (spatial), by running make ins_rdbms.mk rac_on/rac_off, sdo_on/sdo_off, etc.
Direct Read for serial qry: enabled(::::::): this means that IF the segment was big enough, it would be possible to use the direct read functionality for a serial query. If it would have been impossible, in between the colons the reason would be stated.
flashback_table_scan: FALSE: This is not a flashback table scan, which (quite probably) would disable direct path reads.
Row Versions Query: FALSE: No row versioning.
SqlId: dm0hq1419y734, plan_hash_value: 4110451325, Object#: 21979, Parition#: 0 DW_scan: disabled: Most of this speaks for itself. The typo (Parition) is by Oracle again. I am not entirely sure what DW_scan means, there are a number of undocumented parameters related to dw_scan, which describe cooling and warming.

Basically, if you enable the nsmtio trace and you see the line 'NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]', it means Oracle is doing a buffered read because the segment was smaller than small table threshold.

2. Table big enough for direct path full table scan
NSMTIO: kcbism: islarge 1 next 0 nblks 1467796 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 1467796 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbivlo: nblks 1467796 vlot 500 pnb 182931 kcbisdbfc 0 is_large 1
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=914655
Object# = 21980, Object_Size = 1467796 blocks
SqlId = 5ryf9ahvv4hdq, plan_hash_value = 2414078247, Partition# = 0

First the kcbism line which is described above. Here islarge is set to 1. This means the objects is considered too large for being a small segment.
The next line is the kcbimd function, based on this list, a guess for the function name is kernel cache buffers is medium. is_medium is set to 0, this is not a medium size object.
Then kcbivlo, kernel cache buffers is very large object. is_large is set to 1, this is a large object. vlot is listed as ‘500’. This value is set by the parameter _very_large_object_threshold, and means the threshold being 500 percent of the buffercache.
The qertbFetch line says DirectRead, which indicates this object is going to be read via direct path. The reason for doing this is [OBJECT_SIZE>VLOT].
The next line shows the actual size of VLOT (very large object threshold), which is in my case 914655, which is exactly 5*kcbnbh.

When the line ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’ is in the nsmtio trace, the object is bigger than 5 times the size of the buffercache, and the object will be scanned via direct path without any further considerations.

3. Table considered medium size

First let’s take a look when Oracle switches from considering an object to be small to thinking it is medium sized. We already know when Oracle thinks it is big and always will do a direct path read: 5 times the buffercache, which is often referred to as ‘VLOT’.

I prepared a table to be just bigger than STT (which is set to 3658 in my instance):

TS@v12102 > select segment_name, blocks from user_segments where segment_name = 'TESTTAB';
TESTTAB 			     3712

TS@v12102 > alter session set events 'trace[nsmtio]';

Session altered.

TS@v12102 > select count(*) from testtab;

TS@v12102 > alter session set events 'trace[nsmtio] off';

Session altered.

Here is the nsmtio tracing:

NSMTIO: kcbism: islarge 1 next 0 nblks 3668 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 3658 keep_nb 0 kcbnbh 182931 kcbnwp 1
NSMTIO: kcbimd: nblks 3668 kcbstt 3658 kcbpnb 18293 kcbisdbfc 3 is_medium 0
NSMTIO: kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 51716 0 182931 3668 0 0
NSMTIO: kcbivlo: nblks 3668 vlot 500 pnb 182931 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988
ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0
Direct Path for pdb 0 tsn 4  objd 21988 objn 21988
Direct Path 1 ckpt 1, nblks 3668 ntcache 66 ntdist 0
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=914655
Object# = 21988, Object_Size = 3668 blocks
SqlId = 6h8as97694jk8, plan_hash_value = 269898743, Partition# = 0

First of all, we see the segment size considered by kcbism/kcbimd/kcbivlo (nblks) being different than the total number of blocks from dba_segments. Probably only blocks which are truly in use are considered by the code, instead of all the blocks which are allocated to the segment.
On the kcbism line we see ‘islarge 1’ which probably means it is not considered small (sized up to small table threshold) but is larger.
A few lines down the kcbivlo line says it is not large here too (is_large 0), which means larger than VLOT.
This must mean it is considered larger than small, and smaller than large, thus: medium.
Interestingly, the kcbimd line says ‘is_medium 0’.

An important point is the switch to considering doing a direct path read, alias a segment is considered medium sized, is simply when STT is exceeded.

In between the kcbism/kcbimd/kcbivlo lines there is an additional line: kcbcmt1, which seems to measure additional things which could be used for costing.

What is very interesting, and a bit confusing, is the line: NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp). First of all, this line now does NOT show the decision, unlike the same line with segments smaller than STT and bigger than VLOT. Second, [MTT < OBJECT_SIZE < VLOT] indicates the segment being bigger than MTT (5*STT) and smaller than VLOT, which is not true, the segment size is nblks 3668, STT is kcbstt 3658, which means MTT is 18290.

The decision is shown in the line: NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 21988, objn: 21988. Probably kcbdpc means kernel cache buffers direct path choice. As we can see, the choice in this case is DirectRead. The next line is important: ckpt: 1, nblks: 3668, ntcache: 66, ntdist:0. The ntcache value is the number of blocks in the local buffer cache. When RAC is involved, the ntdist value can be different than 0. Instead of reflecting the number of blocks in remote caches, the ntdist reflects the number of blocks not in the local cache. I am not sure if this means that Oracle assumes when blocks are not in the local cache, they ought to be in the remote cache. It looks like it.

If the decision is a buffered read, the line shows: NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 20480, objn: 20480. ckpt: 0, nblks: 21128, ntcache: 20810, ntdist:0. Of course the values are database depended.

If a segment is bigger than MTT (STT*5), the line with the function kcbcmt1 is not visible.

The last lines that are unique to a medium segment scan are:
Direct Path mndb 66 tdiob 121 txiob 0 tciob 14545
Direct path diomrc 128 dios 2 kcbisdbfc 0
The things that are recognisable for me are diomrc (quite probably direct IO multiblock read count) which is set to the multiblock read count value. The other one is dios (quite probably direct IO slots), which shows the starting value of the direct IO slots, which is the amount of IOs the database will issue asynchronously when starting a full segment scan. Fully automatic Oracle will measure throughput and CPU usage, and determine if more IOs can be issued at the same time. This actually is a bit of parallelism.

Medium sized segments and the direct path/no-direct path decision

During my tests on and, as soon as a segment exceeded STT, the Oracle engine switched to direct path reads, unless there was 99% or more of the blocks in the local cache. This is quite contrary to popular believe that the threshold is 50% of the blocks in cache to switch to reading blocks into the buffer cache. In all honesty, I have presented on the switch point value being 50% too.

When adding in writes to the mix it gets even more interesting. I first done an update of approximately 40% of the blocks, and did not commit. When tracing a simple count(*) on the entire table (this is on, which gives less information) it shows:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 1, nblks: 52791, ntcache: 21091, ntdist:21091

So, doing direct path reads, and chkpt is set to 1 (I think indicating the need to checkpoint), which seems logical, if my session wants to do a direct path read of modified blocks.

Now this is how it looks like when I update 50% of the table:
First select count(*) from table:
First time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326

Second time:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 52513, ntdist:278

That’s odd…I first do a direct path read, and the second time I am not doing a no-direct alias buffered read?
Actually, if you look at the number of blocks in the cache (ntcache), it magically changed between the two runs from 26326 to 52513. And 52513/52791*100=99.5%, which is above the apparent limit of 99%, so should be buffered.

Actually, a hint is visible in the first run. If we were to do a direct path read, how come ckpt: 0? I can not see how it would be possible to do a direct path scan when there are changes on blocks in the cache. The answer comes from combining the nsmtio trace with a SQL trace:

alter session set events 'trace[nsmtio]:sql_trace level 8';
alter session set events 'trace[nsmtio] off:sql_trace off';

Here is the relevant part of the trace:

NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 7, objd: 16100, objn: 16100
ckpt: 0, nblks: 52791, ntcache: 26326, ntdist:26326
NSMTIO: Additional Info: VLOT=2407385
Object# = 16100, Object_Size = 52791 blocks
SqlId = 6b258jhbcbwbh, plan_hash_value = 3364514158, Partition# = 0

*** 2015-06-29 08:48:18.825
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1484 cellhash#=3176594409 diskhash#=1604910222 bytes=1015808 obj#=16100 tim=1435585698825188
WAIT #140240535473320: nam='cell multiblock physical read' ela= 1421 cellhash#=3176594409 diskhash#=1604910222 bytes=1048576 obj#=16100 tim=1435585698828291

The wait events ‘cell multilbock physical read’ is a buffered read. So, despite ‘kcbdpc:DirectRead’ from the nsmtio trace, this is actually doing a buffered read. I am not really happy the trace is inconsistent. You could argue that it is an Oracle internal tracing function, so Oracle can and will not guarantee anything, but this way the tracing could tell the wrong story.


The nsmtio trace is a way to look into the direct path or non-direct path/buffered decision. Sadly, it can tell a wrong story.

However, there are a few things to conclude based on my research about the direct path decision:
– A segment smaller than _small_table_threshold is read for full table scan into the buffercache.
– A segment that is bigger than 500% of the buffercache is always scanned for read for full table scan via direct path.
– A segment that is sized between _small_table_threshold and 500% of the buffer cache is medium and could be full table scanned using direct path reads and using buffered reads.
– The tracing on gives a hint there is a difference in consideration between a medium segment sized smaller than MTT (medium table threshold, which is 5 times _small_table_threshold) and bigger than it. This is because of the function kcbcmt1 showing aging/timing information on the blocks when a segment is smaller than MTT.
– For a medium sized segment, a scan for reading a full table scan is done via direct path, unless there are more than 99% of the blocks in the cache.
– For a medium sized segment, a scan for reading a full table scan is done via the buffercache if the amount of blocks that is “dirty” is 50% or more.

Final consideration: I have performed these investigations on databases that were not really heavily used. As could be seen with the kcbcmt1 function, there are additional heuristics that could make the behaviour different if there is more going on in the database. I am pretty sure this blogpost is a good outline, but behaviour could be different in specific cases. Hopefully this blogpost provides enough information and pointers to investigate this for yourself.

In my previous article I started exploring the memory usage of a process on a recent linux kernel (2.6.39-400.243.1 (UEK2)), recent means “recent for the Enterprise Linux distributions” in this context, linux kernel developers would point out that the kernel itself is at version 3.19 (“stable version” at the time of writing of this blogpost).

The previous article showed that every process has its own address space, and that different allocations exists for execution. These allocations can be seen in the proc pseudo filesystem, in a process specific file called ‘maps’. The process itself needs some administration area’s which are anonymous allocations which are marked with [heap] and [stack], and some a few others called [vdso] and [vsyscall]. Every process executes something (not all people realise that: a process can wait for something, but essentially is always executing). So there always will be an executable in a (regular) process’ address space. In a lot of cases, the executable uses shared libraries. In that case, the libraries are loaded in the address space of the process too, alongside the executable.

The executable contains (at least) two sections; the code segment, which is readonly and potentially shared, and the data segment, which is read write and gets truly allocated instead of used shared with the parent if the process needs to write. In a lot of cases, the executable uses shared libraries, which means it uses functions which are stored in that library. A library also needs to be loaded, and also contains multiple sections, which can be read only or read write, and are shared unless the process needs to write to that segment.

For completeness, here’s the complete maps output of a process executing the ‘cat’ executable again:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

If you look closely, you will see that I didn’t explain one type of allocation yet: the anonymous allocations. The anonymous allocations are visible in lines 4, 11, 13 and 16. The anonymous mappings directly following the data segment of either an executable or a library mapped into a process address space is called the BSS. The data segment and the BSS store static variables, however the data segment stores initialised variables, the BSS stores uninitialised variables. The anonymous mapping for the BSS section might exist for a library, as seen above, or might not exist; not all Oracle database executable libraries use an anonymous memory mapping for example. Actually, there is one other memory allocation visible, /usr/lib/locale/locale-archive, which is a file for locale (multi-language support) functions in the C library, which is out of scope for this article.

When a process requests memory to store something, the system call malloc() (memory allocation) can be called. This system call inspects the size of the allocation, and will allocate memory from either the process’ heap (the memory mapping with [heap], using the system call brk()) or it will allocate space using a new anonymous memory segment, using the system call mmap(). If you follow the link with malloc(), you can read the source code of the malloc() call. There are different malloc()’s, which fulfil different purposes (embedded devices have different requirements than huge servers), the implementation that Enterprise Linuxes use is one called ptmalloc2, which is based on a version written bij Doug Lea. If you read the comments in the source code, specifically at ‘Why use this malloc?’, you will see that it tries to be smart with requests up to 128KB (for memory re-usability, to avoid fragmentation and memory wastage), which are allocated from the heap. If an allocation is larger than 128KB, it will use the system memory mapping facilities.

Okay, this brings us back at the original question: how much memory does this process take? I hope you recall from the first blogpost that Linux tries to share as much memory as possible, and when a new process is created, the allocations in the address space of this new process are pointers to the memory areas of the parent process. Let’s first use a utility a lot of people are using: top.

top - 10:53:40 up 9 days, 14:11,  2 users,  load average: 1.34, 1.36, 1.37
Tasks: 1124 total,   1 running, 1123 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  0.8%sy,  0.1%ni, 98.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98807316k total, 97411444k used,  1395872k free,   400288k buffers
Swap: 25165820k total,  3560852k used, 21604968k free, 27573200k cached

 16391 oracle    -2   0 2369m 4960 4812 S  0.7  0.0  91:41.37 asm_vktm_+asm1
 16407 oracle    20   0 2388m  20m 8580 S  0.7  0.0  46:16.40 asm_dia0_+asm1

I edited the output a bit to show only two process from an ASM instance.
The columns that show information on memory are VIRT, RES, SHR, %MEM.

VIRT is described in the man-page of top as ‘The total amount of virtual memory used by the task’. This means it’s ALL the memory visible in the addressing space of the process. A useful utility to get the contents of the virtual memory allocations for a process is pmap, let’s use it for process 16391, which is asm_vktm_+asm1:

$ pmap -x 16391
16391:   asm_vktm_+ASM1
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000  246644    2520       0 r-x--  oracle
000000000f6dd000    1952      44      20 rw---  oracle
000000000f8c5000     140       0       0 rw---    [ anon ]
000000001042c000     356       0       0 rw---    [ anon ]
0000000060000000    4096       0       0 rw-s-  SYSV00000000 (deleted)
0000000061000000 2080768       0       0 rw-s-  SYSV00000000 (deleted)
00000000e0000000      12       4       4 rw-s-    [ shmid=0x4e78003 ]
00007fa275589000      72      12       0 r-x--  libnfsodm12.so
00007fa27559b000    2044       0       0 -----  libnfsodm12.so
00007fa27579a000       8       0       0 rw---  libnfsodm12.so
00007fa27579c000    1604       0       0 r-x--  libshpksse4212.so
00007fa27592d000    2044       0       0 -----  libshpksse4212.so
00007fa275b2c000      72       0       0 rw---  libshpksse4212.so
00007fa275b3e000      20       4       0 r-x--  libcxgb3-rdmav2.so
00007fa275b43000    2044       0       0 -----  libcxgb3-rdmav2.so
00007fa275d42000       4       0       0 rw---  libcxgb3-rdmav2.so
00007fa27abd5000       8       0       0 r-x--  libodmd12.so
00007fa27abd7000    2044       0       0 -----  libodmd12.so
00007fa27add6000       4       0       0 rw---  libodmd12.so
00007fa27add7000     128     116       0 r-x--  ld-2.12.so
00007fa27adf8000     512      12      12 rw---    [ anon ]
00007fa27ae78000     212       4       4 r--s-  passwd
00007fa27aead000    1260      68      68 rw---    [ anon ]
00007fa27aff3000       4       4       0 rw-s-  hc_+ASM1.dat
00007fa27aff4000       8       0       0 rw---    [ anon ]
00007fa27aff6000       4       0       0 r----  ld-2.12.so
00007fa27aff7000       4       0       0 rw---  ld-2.12.so
00007fa27aff8000       4       0       0 rw---    [ anon ]
00007fff7b116000     132       8       8 rw---    [ stack ]
00007fff7b1ff000       4       4       0 r-x--    [ anon ]
ffffffffff600000       4       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB         2426668    5056     156

The column ‘Kbytes’ represents the full size of the executable, libraries, shared memory, anonymous mappings and other mappings of this process. For completeness sake: 2426668/1024=2369.79, which matches the 2369 in the top output. This is all the memory this process can see, and could use. Does this tell us anything on what memory the process 16391 actually takes? No. (parts of) the Oracle executable’s allocations are potentially shared, the shared memory (SYSV00000000 (deleted) and [ shmid=0x4e78003 ], which represent the Oracle SGA) is shared, the memory allocations for the libraries are potentially shared. The anonymous memory mappings have been defined, but the actual allocation is not visible in the Kbytes column. What this column in top does for me, is tell the approximate SGA size, especially if the SGA is larger (meaning multiple gigabytes).

The second memory column in top is RES. RES is described as: ‘The non-swapped physical memory a task is using’. RES is sometimes referred to as RSS, and called ‘resident set size’. As we can see from the total, the RSS is way lesser than the virtual memory size. One important thing in the RES description of top is that it described that swapped memory pages are not counted for the RES/RSS value. RES/RSS corresponds to the actual used (“touched”) memory by a process, and is directly usable. If you look back to the RSS column of pmap above, you see the oracle executable’s two mappings, one has a RSS size of 2520, and one has a RSS size of 44. But…if you remember that the code/readonly segment is potentially shared with other process, and then look at the 2520 value (which is of the oracle memory segment with the rights r-x–, which means the code segment), I hope you understand this just means this process (vktm) read a subset of the entire executable, and more importantly: the RSS size does not reflect physical memory uniquely allocated by this process.

If we look at the shared memory segments, it’s interesting to see what happens during normal life of a database session. I think it is needless to say that you should calculate shared memory outside of process memory usage, since it’s a distinct memory set that is truly shared by all the processes that are created for the instance.

This is a session which has just started:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     348     348 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848    2316    2316 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592       0       0 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]

This instance has a SGA set to 4G. Because the session just started, it only touched 2316(KB) of the SGA. Next, I do a big (buffered!) full table scan, requiring the session to put a lot of blocks into the buffercache. After the scan, look at the shared memory segment using pmap again:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     384     384 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848 2279040 2279040 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592   66564   66564 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]

The session has touched half of the SGA shared memory segment (visible in the RSS column of the 6th line). This is logical if you understand what is going on: the process does a buffered table scan, which means the blocks read from disk need to be stored in the buffer cache, which is one of the memory structures in the Linux shared memory segments. However, if you look strictly at the top utility output of a database that has just started up, you see the RSS size of the all the processes that work with the buffercache growing. This phenomenon has lead to a repetitive question on the Oracle Linux forums if Oracle database processes are leaking memory. Of course the correct answer is that the RSS size just grows because the process just touches more of the shared memory (=SGA) that has been mapped into its address space. It will stop increasing once it touched all the memory it could touch.

%MEM is the RES/RSS size expressed as a percentage of the total physical memory.

SHR is the amount of shared memory. The manpage of top says ‘It simply reflects memory that could be potentially shared with other processes’. Do not confuse this with shared memory segments mapped into the process’ address space. Empirical tests show the SHR value always seems to be lower than the RSS size, which means it seems to track the RSS value of memory, and shows RSS (touched) memory that could be shared (which seems to contain both touched memory from the shared memory segments, as well as privately mapped executable and libraries). At least from the perspective of an Oracle performance and troubleshooting perspective I can’t see any benefit from using this value.

The conclusion of this part on memory usage is that both the virtual set size (VIRT/VSZ) and resident set size (RES/RSS) are no figures you can add up to indicate physical memory usage of a single process or a group of processes.

The virtual set size gives you the total amount of virtual memory available to a single process, which includes the executable, and potentially shared libraries, anonymous memory mappings and files mapped into the address space and shared memory. In a lot of cases, you get an indication of the total SGA size of the instance, because the shared memory segments which contain the SGA are entirely mapped into the address space of the process.

The resident set size shows you how much of the memory and files mapped into the address space are actually “touched”, and directly usable. Some of the memory usage result in memory pages private to the process, because of writing to the memory and the Copy On Write mechanism of the Linux memory manager. A lot of other memory mappings can be used, increasing the resident set size, while these are shared with other processes. A third potential component is the actual usage of memory as a result of anonymous memory mappings (versus the total allocation, which can be much more), which are private to the process.

This blogpost is about finding the actual amount of memory a process is taking. In order to do so, this post dives into the memory mechanisms of Linux. The examples in this article are taken from an Oracle Linux version 6.6 server, with kernel 2.6.39-400.243.1 (UEK2). This is written with the Oracle database processes in mind, but actually uses examples of a processes running ‘cat’, which means the contents of this post are absolutely not limited to Oracle database processes.

Let’s start off with a simple example. Let’s look at our own memory map. In order to do so, I use the ‘cat’ executable and the ‘maps’ entry in the proc pseudo-filesystem. This is how that is done, including the result:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Most people know that in order to execute ‘cat’, the shell forks and executes the cat command in a new process.

What we see, is the executable (/bin/cat), the heap ([heap]), two dynamic libraries (/lib64/ld-2.12.so and /lib64/libc-2.12.so), the stack ([stack]), two entries called [vdso] (virtual dynamically linked shared objects) and [vsyscall] (virtual syscall), and anonymous memory allocations (00000000 00:00 0 allocations without a marker to indicate a process function).

In order to understand the libraries, we first need to know something about the executable itself, using the command ‘file’:

$ file /bin/cat
/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped

There actually is a great deal of information to be seen from this one line:
This is an ELF format executable. Another executable format is COFF, which is used on the Windows platform.
The executable is 64 bits (x86-64).
The most important part for this article: the executable is dynamically linked, and uses shared libraries.
The last word ‘stripped’ deserves some explanation too: the executables is stripped, which means a lot of symbolic information (function names for example) are removed from the executable. One reason for doing so is to make the executable smaller. If you look at the oracle executable, you will see it’s not stripped. Except for the Oracle XE executable, which is stripped.

We now established this is a dynamically linked executable. The next step is to see the libraries it is using. This is done with the ‘ldd’ (loader dependencies) executable:

$ ldd /bin/cat
	linux-vdso.so.1 =>  (0x00007fffa9388000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f6ffa3dd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6ffa772000)

Please mind these are the dependencies for executing /bin/cat, which does not necessarily means you see all those in the address space of the process executing the /bin/cat executable. Two you don’t see is linux-vsdo.so.1, which is used for virtual dynamic shared objects, which essentially means from the oracle perspective that some system calls can be executed fully in userspace, most notably for Oracle engineers: gettimeofday(). Fellow Oaky James Morle wrote a nice article explaining this.
The other one is /lib64/ld-linux-x86-64.so.2, which is the dynamic loader needed for executing the executable. The last one is libc.so.6, which “truly” is a library that is dynamically loaded for use during execution. The other ones discussed earlier are necessary for the instantiating the execution, not so much during the execution.

If we now look back to the maps output, we see the libc.so.6 library, and we see a ld-2.12.so library. The ld-2.12.so library is the dynamic loader library, to provide dynamic loading function on runtime.

Okay, we now gained a some understanding on executables and the libraries. Now let’s look a bit more in detail to the executable in maps:

00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat

Why is the executable mentioned twice?
The answer is this is because of the way an ELF object is built up. In order to look deeper into this ELF binary, we can use the ‘readelf’ executable:

$ readelf -l /bin/cat

Elf file type is EXEC (Executable file)
Entry point 0x401850
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000a204 0x000000000000a204  R E    200000
  LOAD           0x000000000000a208 0x000000000060a208 0x000000000060a208
                 0x0000000000000648 0x0000000000001000  RW     200000
  DYNAMIC        0x000000000000a3e8 0x000000000060a3e8 0x000000000060a3e8
                 0x0000000000000190 0x0000000000000190  RW     8
  NOTE           0x000000000000021c 0x000000000040021c 0x000000000040021c
                 0x0000000000000044 0x0000000000000044  R      4
  GNU_EH_FRAME   0x0000000000009414 0x0000000000409414 0x0000000000409414
                 0x000000000000027c 0x000000000000027c  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8

 Section to Segment mapping:
  Segment Sections...
   01     .interp
   02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame
   03     .ctors .dtors .jcr .data.rel.ro .dynamic .got .got.plt .data .bss
   04     .dynamic
   05     .note.ABI-tag .note.gnu.build-id
   06     .eh_frame_hdr

When ‘readelf’ is invoked with the -l option, readelf displays the information that is contained in the ELF segment headers of the executable. This is a lot of detail into which I don’t go too deep.

The biggest distinction between the two entries of the executable in maps is the first entry is readonly executable (r-xp), and the second entry is read-write (rw-p). If we now look at the readelf -l output, we see the same flags in the FLAGS column, now indicated by R, W and E for Read, Write and Execute. If we look in the segment sections part, we see section 02 containing a lot of entries, and if we look up at the headers, and we count from zero, we see “LOAD”, and the entry being flagged as ‘R E’. This memory area often is referred to as the code segment.

The some of the types of memory allocations in section 02 are:
– Dynamic linking information (.interp, .gnu.hash, .dynsyms, dynstr)
– C runtime code (.init, .fini)
– Relocation information (.rela.dyn, .rela.plt)
– String constants (.rodata)
– Machine instructions (.text)
– Procedure Linkage Table (.plt)

The next section that deserves attention is section 03. If we look in the headers, count from zero, we see this section is a “LOAD” section too, but the entry being flagged as ‘RW ‘. This memory area often is referred to as the data segment.

The some of the types of memory allocations in section 03 are:
– Dynamic linking into (.dynamic)
– Relocated pointer values to external symbols (.got)
– Procedure linkage table of the global offset table (.got.plt)
– C runtime data (.ctors, .dtors)

As you can see with these two, section 02 contains information which can actually be readonly (the executable itself is never changed on runtime), which is why it’s readonly. Section 03 contains information which by the nature of dynamic linking, really can’t be readonly, because we don’t know how the library looks like which the executable needs to dynamically load via the dynamic loader; the only thing that is required from the library is that it contains the functions the executable needs from that library.

At this point I think it should be clear why there are two sections for a single executable. But how about the libraries?

Actually, simplified, a library is an executable with only functions, and not a program to run any of these functions. Libraries also can use other shared libraries for its functions. This means you can use the above mentioned executables (file, ldd, readelf) to examine libraries, and get quite similar results. The programmer can choose how the memory segments look like and how these are divided, which is the reason the dynamic loader library (ld-2.12.so) has 3 entries, and the libc library (libc-2.12.so) has 4. Although 4 or 3 sections seems to be extremely common for libraries, as well as 2 for executables.

Okay, now that we gained some understanding on the memory segments for a process, let’s continue on actual memory usage. When a process is forked, Linux tries to do the least possible and save as much memory as possible. To do so, the process gets its own virtual memory address space which is a duplicate of the process which called fork, including its memory. However, what really happens is the memory areas are the ones from the forking process, linked via pointers. Only when a change is needed, the process starts truly allocating its own memory, of course not using the memory from the forking process. This technique is often referred to as COW (Copy On Write). The ‘pointer trick’ is made possible by the kernel, and the use of virtual memory: every process allocates at the same (virtual) memory addresses, which are kept separate from other processes by translating them to different physical locations, or the same locations for copy on write. Of course the read-only sections as we saw in maps will never change, and as a result only will be in memory at a single place, and potentially referenced/used by a lot of processes. But even the writable sections are only physically allocated as soon as there truly is a write action.

The ‘just in time’ principle of memory allocation is also used for anonymous memory allocations via mmap(), which are most of the Oracle database processes PGA allocations. When a process allocates memory via mmap(), it results in an anonymous memory section, and is visible in maps. However, it isn’t allocated yet, only when the memory truly is getting used it is physically allocated.

I hope you start to see at this point that you can see the memory areas a process is using, but you can not tell how much of that is really, uniquely allocated for that process. Also, and maybe even more importantly, just adding memory figures of Oracle databases processes will do (a lot of) double counting. In a future blogpost or posts I will dive into how bits and pieces of the working can actually be seen from for example the /proc/PID/smaps meta-file, and how this knowledge can be applied for sizing Oracle databases.

I was testing Oracle Goldengate on a non-clustered Oracle database with ASM. With ASM, you need to have the grid infrastructure installed. The cluster ware for the single node install is called ‘oracle restart’.

The most convenient way to have Goldengate running at startup that I could find, was using the Oracle Grid Infrastructure Agents. These agents are not installed by default, you need to download these from the Oracle Technology Network. The download is with the grid infrastructure downloads in the database section.

The installation is very simple: unzip the xagpack_6.zip file, and run the xagsetup.sh script with the arguments ‘–install’ and ‘–directory’ arguments. The directory argument needs to point to a directory outside of the grid infrastructure home. My first choice would be to have it in the infrastructure home, but the install script does not allow that.

$ ./xagsetup.sh --install --directory /u01/app/oracle/product/xag
Installing Oracle Grid Infrastructure Agents on: ogg-dest

The control utility, ‘agctl’ can be run without any environment variables set to point to the directory where the utility is installed, or the grid infrastructure home.

$ /u01/app/oracle/product/xag/bin/agctl query releaseversion
The Oracle Grid Infrastructure Agents release version is 6.1.1

To add a golden gate resource, add it via the agctl utility:

$ /u01/app/oracle/product/xag/bin/agctl add goldengate gg1 --gg_home /u01/app/oracle/product/12.1.2/oggcore_1 --instance_type target --oracle_home /u01/app/oracle/product/ --databases ora.dest.db

This is a very simple example, where a cluster resource called ‘gg1’ is created, for which we point out the golden gate home, the database ORACLE_HOME and the database cluster resource (not the database name as some documentation from Oracle says).

In my case, golden gate was not running. After the golden gate cluster resource was added, it’s offline:

      1        OFFLINE OFFLINE                               STABLE

Now start the resource using the ‘agctl’ utility:

/u01/app/oracle/product/xag/bin/agctl start goldengate gg1

(please mind you could do exactly the same with crsctl: crsctl start res xag.gg1.goldengate)
Starting does not result in any message. Let’s look at the golden gate cluster resource:

      1        ONLINE  ONLINE       ogg-dest                 STABLE

What it does, is start the golden gate manager (mgr) process:

$ /u01/app/oracle/product/12.1.2/oggcore_1/ggsci << H
> info all
> H

Oracle GoldenGate Command Interpreter for Oracle
Version OGGCORE_12.
Linux, x64, 64bit (optimized), Oracle 12c on Aug  7 2014 10:21:34
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2014, Oracle and/or its affiliates. All rights reserved.

GGSCI (ogg-dest.local) 1>
Program     Status      Group       Lag at Chkpt  Time Since Chkpt

REPLICAT    ABENDED     REP1        00:00:00      17:55:53

Let’s add monitoring of the replicat. In real life scenario’s it’s the extracts or replicats that are important, these perform the actual work golden gate is supposed to do!

/u01/app/oracle/product/xag/bin/agctl modify goldengate gg1 --monitor_replicats rep1

The cluster ware checks the golden gate resource type status every 30 seconds. So if you check the cluster immediately after modifying the golden gate resource, it could not have performed the check with the added check for the replicat.

      1        ONLINE  INTERMEDIATE ogg-dest                 ER(s) not running :

It now neatly reports the replicat group rep1 not running! I guess ‘ER’ means Extract and Replicat.
Now start the replicat, and check the status again:

      1        ONLINE  ONLINE       ogg-dest                 STABLE

A few final notes. This is the simplest possible setup. I couldn’t find a lot of information on the infrastructure agents except for some Oracle provided ones, which is why I created this blog. If you apply this on a cluster, I would urge you to look into the oracle provided ones, and create the full setup with golden gate writing its files on a cluster filesystem, using an application VIP, etcetera, so golden gate could be started or failed over to another host.

Goldengate is not the only thing this agents can incorporate into the cluster ware. The current version (6.1.1) supports, besides golden gate: tomcat, apache, JDE, mysql, various types of peoplesoft servers, siebel and web logic.

Deinstalling is very simple, but might put you on your wrong foot. The agents are deinstalled by using the setup script (xagsetup.sh) with the ‘–deinstall’ argument. This of course is quite normal, but it deinstalls the agents from the location where you call xagsetup.sh. This means that if you run xagsetup.sh from the location where you unzipped it for installation, it will ‘deinstall’, which means clean out, from that directory, leaving the true installation alone. The documentation states that upgrades to the agents are done by removing them and installing the new version.

This article is written with examples taken from an (virtualised) Oracle Linux 6u6 X86_64 operating system, and Oracle database version However, I think the same behaviour is true for Oracle 11 and 10 and earlier versions.

Probably most readers of this blog are aware that a “map” of mapped memory for a process exists for every process in /proc, in a pseudo file called “maps”. If I want to look at my current process’ mappings, I can simply issue:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 786125                             /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 786125                             /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0080a000-0080b000 rw-p 0000a000 fc:00 786125                             /bin/cat
01243000-01264000 rw-p 00000000 00:00 0                                  [heap]
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/ld-2.12.so
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/ld-2.12.so
345b220000-345b221000 rw-p 00020000 fc:00 276143                         /lib64/ld-2.12.so
345b221000-345b222000 rw-p 00000000 00:00 0
345b800000-345b98a000 r-xp 00000000 fc:00 276144                         /lib64/libc-2.12.so
345b98a000-345bb8a000 ---p 0018a000 fc:00 276144                         /lib64/libc-2.12.so
345bb8a000-345bb8e000 r--p 0018a000 fc:00 276144                         /lib64/libc-2.12.so
345bb8e000-345bb8f000 rw-p 0018e000 fc:00 276144                         /lib64/libc-2.12.so
345bb8f000-345bb94000 rw-p 00000000 00:00 0
7f8f69686000-7f8f6f517000 r--p 00000000 fc:00 396081                     /usr/lib/locale/locale-archive
7f8f6f517000-7f8f6f51a000 rw-p 00000000 00:00 0
7f8f6f524000-7f8f6f525000 rw-p 00000000 00:00 0
7fff2b5a5000-7fff2b5c6000 rw-p 00000000 00:00 0                          [stack]
7fff2b5fe000-7fff2b600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

What we see, is the start and end address, the rights (rwx), absence of rights is shown with a ‘-‘, and an indication of the mapped memory region is (p)rivate or (s)hared. In this example, there are no shared memory regions. Then an offset of the mapped file, then the device (major and minor device number). In our case sometimes this is ‘fc:00’. If you wonder what device this might be:

$ echo "ibase=16; FC" | bc
$ ls -l /dev | egrep 252,\ *0
brw-rw---- 1 root disk    252,   0 Mar 23 14:19 dm-0
$ sudo dmsetup info /dev/dm-0
Name:              vg_oggdest-lv_root
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      252, 0
Number of targets: 2
UUID: LVM-q4nr4HQXgotaaJFaGF1nzd4eZPPTohndgz553dw6O5pTlvM0SQGLFsdp170pgHuw

So, this is a logical volume lv_root (in the volume group vg_oggdest).

Then the inode number (if a file was mapped, if anonymous memory was mapped the number 0 is shown), and then the path if a file was mapped. This is empty for anonymous mapped memory (which is memory which is added to a process using the mmap() call). Please mind there are also special regions like: [heap],[stack],[vdso] and [vsyscall].

Okay, so far I’ve shown there is a pseudo file called ‘maps’ which shows mapped memory and told a bit about the fields in the file. Now let’s move on to the actual topic of this blog: the Oracle database SGA memory, and the indicator this is deleted!

In this example I pick the maps file of the PMON process of an Oracle database. Of course the database must use system V shared memory, not shared memory in /dev/shm (which is typically what you see when Oracle’s automatic memory (AMM) feature is used). This is a snippet from the maps file of the pmon process on my server:

 cat /proc/2895/maps
00400000-1093f000 r-xp 00000000 fc:00 1326518                            /u01/app/oracle/product/
10b3e000-10dbf000 rw-p 1053e000 fc:00 1326518                            /u01/app/oracle/product/
10dbf000-10df0000 rw-p 00000000 00:00 0
12844000-1289d000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:04 111902723                          /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:04 111902723                          /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:04 111935492                          /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:04 111968261                          /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:04 112001030                          /SYSV6ce0e164 (deleted)
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/ld-2.12.so
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/ld-2.12.so

If you look closely, you see the oracle executable first, with two entries, one being readonly (r-xp), the other being read-write (rw-p). The first entry is readonly because it is shared with other processes, which means that there is no need for all the processes to load the Oracle database executable in memory, it shares the executable with other process. There’s much to say about that too, which should be done in another blogpost.

After the executable there are two anonymous memory mappings, of which one is the process’ heap memory.

Then we see what this blogpost is about: there are 5 mappings which are shared (r–s and rw-s). These are the shared memory regions of the Oracle database SGA. What is very odd, is that at the end of the lines it says “(deleted)”.

Of course we all know what “deleted” means. But what does it mean in this context? Did somebody delete the memory segments? Which actually can be done with the ‘ipcrm’ command…

If you go look at the maps of other Oracle processes and other databases you will see that every database’s shared memory segment are indicated as ‘(deleted)’.

Word of warning: only execute the steps below on a test environment, do NOT do this in a production situation.

In order to understand this, the best way to see what actually is happening, is starting up the Oracle database with a process which is traced with the ‘strace’ utility with the ‘-f’ option set (follow). Together with the ‘-o’ option this will produce a (long) file with all the system calls and the arguments of the calls which happened during startup:

$ strace -f -o /tmp/oracle_startup.txt sqlplus / as sysdba

Now start up the database. Depending on your system you will notice the instance startup takes longer. This is because for every system call, strace needs to write a line in the file /tmp/oracle_start.txt. Because of this setup, stop the database as soon as it has started, on order to stop the tracing from crippling the database performance.

Now open the resulting trace file (/tmp/oracle_startup.txt) and filter it for the system calls that are relevant (calls with ‘shm’ in their name):

$ grep shm /tmp/oracle_startup.txt | less

Scroll through the output until you see a line alike ‘shmget(IPC_PRIVATE, 4096, 0600) = 130777091’:

4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130777091
4545  shmat(130777091, 0, 0)            = ?
4545  shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130777091, IPC_RMID, 0)    = 0
4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130809859
4545  shmat(130809859, 0, 0)            = ?
4545  shmctl(130809859, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130809859, IPC_RMID, 0)    = 0

What we see here is a (filtered) sequence of systems calls that could explain the status deleted of the shared memory segments. If you look up what process id is in front of these shm system calls, you will see it’s the foreground process starting up the instance. If you look closely, you’ll that there is a sequence which is repeated often:

1. shmget(IPC_PRIVATE, 4096, 0600) = 130777091
The system call shmget allocates a shared memory segment of 4 kilobyte, rights set to 600. The return value is the shared memory identifier of the requested shared memory segment.

2. shmat(130777091, 0, 0) = ?
The system call shmat attaches the a shared memory segment to the process’ address space. The first argument is the shared memory identifier, the second argument is the address to attach the segment to. If the argument is zero, like in the call above, it means the operating system is tasked with finding a suitable (non used) address. The third argument is for flags, the value zero here means no flags are used. The returncode (here indicated with a question mark) is the address at which the segment is attached. This being a question mark means strace is not able to read the address, which is a shame, because we can’t be 100% certain at which memory address this shared memory segment is mapped.

3. shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
The system call shmctl with the argument IPC_STAT has the function to read the (kernel) shared memory information of the shared memory identifier indicated by the first argument, and write it at the memory location in the third argument in a struct called shmid_ds.

4. shmdt(0x7f406f2ba000) = 0
With this system call, the shared memory segment is detached from the process’ address space. For the sake of the investigation, I assumed that the address in this call is the address which is returned by the shmat() call earlier.

5. shmctl(130777091, IPC_RMID, 0) = 0
This is another shared memory control system call, concerning our just created shared memory segment (shared memory identifier 130777091), with the command ‘IPC_RMID’. This is what the manpage says about IPC_RMID:

       IPC_RMID  Mark the segment to be destroyed.  The segment will only  actually  be  destroyed
                 after the last process detaches it (i.e., when the shm_nattch member of the asso-
                 ciated structure shmid_ds is zero).  The caller must be the owner or creator,  or
                 be privileged.  If a segment has been marked for destruction, then the (non-stan-
                 dard) SHM_DEST flag of the shm_perm.mode field in the associated  data  structure
                 retrieved by IPC_STAT will be set.

What I thought this means was:
It looked like to me the database instance starts building up its shared memory segments per 4096 page. Because IPC_RMID only marks the segment to be destroyed, and because it will only be truly destroyed when there are no processes attached to the shared memory segment, it looked like to me the background processes were pointed to the shared memory segment which was marked destroyed (in some way I hadn’t discovered yet), which meant the shared memory segment would actually survive and all database processes can use it. If ALL the database processes would be killed for any reason, for example with a shutdown abort, the processes would stop being connected to the shared memory segment, which would mean the shared memory segment would vanish automatically, because it was marked for destruction.
Sounds compelling, right?

Well…I was wrong! The sequence of creating and destroying small shared memory segments is done, but it turns out these are truly destroyed with the shmctl(…,IPC_RMID,…) call. I don’t know why the sequence of creating shared memory segments is happening.

I started looking for the actual calls that create the final, usable shared memory segments in the /tmp/oracle_startup.txt file. This is actually quite easy to do; first look up the shared memory segment identifiers using the sysresv utility (make sure the database’s ORACLE_HOME and ORACLE_SID are set):

$ sysresv
...a lot of other output...
Shared Memory:
197394436	0x00000000
197427205	0x00000000
197361667	0x00000000
197459974	0x6ce0e164
1015811 	0xd5cdbca4
Oracle Instance alive for sid "dest"

Actually the ‘sysresv’ utility (system remove system V memory I think is what the name means) has the task of removing memory segments if there is no instance left to use them. It will not remove the memory segments if it finds the instance alive. It prints out a lot of information as a bonus.

Now that we got the shared memory identifiers, simply search in the trace file generated by strace, and search for the creation of the memory segment with the identifiers: (please mind searching with ‘less’ is done with the forward slash)

$ less /tmp/oracle_startup.txt
9492  shmget(IPC_PRIVATE, 905969664, IPC_CREAT|IPC_EXCL|0640) = 197394436
9492  shmat(197394436, 0x60400000, 0)   = ?
9492  times(NULL)                       = 430497743
9492  write(4, " Shared memory segment allocated"..., 109) = 109
9492  write(4, "\n", 1)                 = 1

Aha! here we see shmget() again, but now with a size (905969664) that looks much more like a real shared memory segment size used by the database! After the shared memory identifier is created, the process attaches it to its addressing space with shmat() to a specific memory address: 0x60400000.

The next thing to do, is to look for any shmctl() call for this identifier. Oracle could still do the trick of marking the segment for destruction…
…But…there are no shmctl() calls for this identifier, nor for any of the other identifiers shown with the sysresv utility. This is rather odd, because Linux shows them as “(deleted)”. There ARE dozens of shmat() calls, of the other (background) processes forked from the starting process when they attach to the shared memory segments.

So, conclusion at this point is Linux shows the shared memory segments as deleted in ‘maps’, but the Oracle database does not mark the segments for destruction after creation. This means that either Linux is lying, or something mysterious is happening in the Oracle executable which I didn’t discover yet.

I could only think of one way to verify what is truly happening here. That is to create a program myself that uses shared memory, so I have 100% full control over what is happening, and can control every distinct step.

This is what I came up with:

#include <stdio.h>
#include <sys/shm.h>
#include <sys/stat.h>

int main ()
  int segment_id;
  char* shared_memory;
  struct shmid_ds shmbuffer;
  int segment_size;
  const int shared_segment_size = 0x6400;

  /* Allocate a shared memory segment.  */
  segment_id = shmget (IPC_PRIVATE, shared_segment_size,
                     IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
  printf ("1.shmget done\n");
  /* Attach the shared memory segment.  */
  shared_memory = (char*) shmat (segment_id, 0, 0);
  printf ("shared memory attached at address %p\n", shared_memory);
  printf ("2.shmat done\n");
  /* Determine the segment's size. */
  shmctl (segment_id, IPC_STAT, &shmbuffer);
  segment_size  =               shmbuffer.shm_segsz;
  printf ("segment size: %d\n", segment_size);
  printf ("3.shmctl done\n");
  /* Write a string to the shared memory segment.  */
  sprintf (shared_memory, "Hello, world.");
  /* Detach the shared memory segment.  */
  shmdt (shared_memory);
  printf ("4.shmdt done\n");

  /* Deallocate the shared memory segment.  */
  shmctl (segment_id, IPC_RMID, 0);
  printf ("5.shmctl ipc_rmid done\n");

  return 0;

(I took the code from this site, and modified it a bit for my purposes)
If you’ve got a linux system which is setup with the preinstall rpm, you should be able to copy this in a file on your (TEST!) linux database server, in let’s say ‘shm.c’, and compile it using ‘cc shm.c -o smh’. This will create an executable ‘shm’ from this c file.

This program does more or less the same sequence we saw earlier:
1. Create a shared memory identifier.
2. Attach to the shared memory identifier.
3. Get information on the shared memory segment in a shmid_ds struct.
4. Detach the shared memory segment.
5. Destroy it using shmctl(IPC_RMID).

What I did was have two terminals open, one to run the shm program, and one to look for the results of the steps.

Step 1. (shmget)

$ ./shm
1. shmget done

When looking with ipcs, you can see the shared memory segment which is created because of the shmget() call:

$ ipcs -m

------ Shared Memory Segments --------
0x00000000 451608583  oracle     600        25600      0

when looking in the address space of the process running the shm program, the shared memory segment is not found. This is exactly what I expect, because it’s only created, not attached yet.

Step 2. (shmat)

shared memory attached at address 0x7f3c4aa6e000
2.shmat done

Of course the shared memory segment is still visible with ipcs:

0x00000000 451608583  oracle     600        25600      1

And we can see from ipcs in the last column (‘1’) that one process attached to the segment. Of course exactly what we suspected.
But now that we attached the shared memory to the addressing space, it should be visible in maps:

7f3c4aa6e000-7f3c4aa75000 rw-s 00000000 00:04 451608583                  /SYSV00000000 (deleted)

Bingo! The shared memory segment is visible, as it should be, because we just attached it with shmat(). But look: it’s deleted already according to Linux!

However I am pretty sure, as in 100% sure, that I did not do any attempts to mark the shared memory segment destroyed or do anything else to make it appear to be deleted. So, this means maps lies to us.

So, the conclusion is the shared memory Oracle uses is not deleted, it’s something that Linux shows us, and is wrong. When looking at the maps output again, we can see the shared memory identifier is put at the place of the inode number. This is handy, because it allows you to take the identifier, and look with ipcs for shared memory segments and understand which specific shared memory segment a process is using. It probably means that maps tries to look up the identifier number as inode number, which it will not be able to find, and then comes to the conclusion that it’s deleted.

However, this is speculation. Anyone with more or better insight is welcome to react on this article.

Sometimes you need to see the difference between two pieces of console output. When I research, this can be two stacktraces, but also /proc//maps and smaps output; really anything. Of course, there’s diff, but the diff output is not very visual. Also, diff doesn’t do diffing between more than two files.

This can be done reasonably simple in vim. Here’s how to do that:
1. start vi; vi
2. do a vertical split using a new buffer; :vnew
3. open the first (left side) file; :r path/file or goto insert mode (esc i) and paste text.
4. goto the second window: ctrl+w ctrl+w
5. open the second (right side) file; :r path/file or goto insert mode (esc i) and paste text.
6. diff the two windows; :windo diffthis
7. turn diff mode off; :windo diffoff

You can also expand your diffing to three windows:
1. goto the rightside; ctrl+w l
2. change new window placement to the right side; :set splitright
3. do another virtual split; :vnew
4. open another file or paste text
5. diff again; : windo diffthis

I wrote this down for myself, but hopefully this helps other people too.

Every DBA working with the Oracle database must have seen memory dumps in tracefiles. It is present in ORA-600 (internal error) ORA-7445 (operating system error), system state dumps, process state dumps and a lot of other dumps.

This is how it looks likes:

Dump of memory from 0x00007F06BF9A9E00 to 0x00007F06BF9ADE00
7F06BF9A9E00 0000C215 0000001F 00000CC1 0401FFFF  [................]
7F06BF9A9E10 000032F3 00010003 00000002 442B0000  [.2............+D]
7F06BF9A9E20 2F415441 31323156 4F2F3230 4E494C4E  [ATA/V12102/ONLIN]
7F06BF9A9E30 474F4C45 6F72672F 315F7075 3735322E  [ELOG/group_1.257]
7F06BF9A9E40 3336382E 36313435 00003338 00000000  [.863541683......]
7F06BF9A9E50 00000000 00000000 00000000 00000000  [................]

The first column is the memory location in hexadecimal.
The second to fifth columns represent the actual memory values in hexadecimal.
The sixth column shows an ASCII representation of the memory contents. If a position does not represent an ASCII character, a dot (“.”) is printed.

Actually, the values in the second to fifth column are grouped in four columns. This is how the values in a column look like:
{hex val}{hex val}{hex val}{hex val}, for example: 00010203 means: 0, 1, 2, 3.

In the ASCII representation (sixth column) the spaces after every four values are not put in.

However, look at the following line:

7F06BF9A9E10 000032F3 00010003 00000002 442B0000  [.2............+D]

And focus on the last four characters:
“..+D” (two non-printables, plus, D)
Now look at the corresponding memory contents from the dump:
“442B0000” This is: “44 2B 00 00”, which should correspond to “. . + D”.
There is something the matter here: the plus and the D seem to be represented by “00”. That’s not correct.

Let’s see what “442B0000” actually represents in ASCI:

$ echo -e "\x44\x2B\x00\x00"

Ah! That looks backwards! Let’s take a full line and see what that gives:
(This is the line with memory address 0x7F06BF9A9E20)

$ echo -e "\x2F\x41\x54\x41 \x31\x32\x31\x56 \x4F\x2F\x32\x30 \x4E\x49\x4C\x4E"
/ATA 121V O/20 NILN

So if you want to look at the actual memory contents, you need to start with the column on the left side, read the values from right to left, then go the next column, etc.

Actual, I asked my friend Philippe Fierens for a trace file from a SPARC (big endian) platform, to see if the endianness of the platform was causing this. I test my stuff on Linux, which is little endian.

Here’s a little snippet:

Dump of memory from 0xFFFFFFFF7D977E00 to 0xFFFFFFFF7D97BE00
FFFFFFFF7D977E00 15C20000 00000001 00000000 00000104  [................]
FFFFFFFF7D977E10 F4250000 00000000 0B200400 E2EB8A3D  [.%....... .....=]
FFFFFFFF7D977E20 44475445 53540000 32F6D98B 00000590  [DGTEST..2.......]
FFFFFFFF7D977E30 00004000 00000001 00000000 00000000  [..@.............]
FFFFFFFF7D977E40 00000000 00000000 00000000 00000000  [................]

Let’s test the line from address 0xFFFFFFFF7D977E20:

[oracle@bigmachine [v12102] trace]$ echo -e "\x44\x47\x54\x45 \x53\x54\x00\x00 \x32\xF6\xD9\x8B \x00\x00\x05\x90"
DGTE ST 2� �

So, the endianness determines how the raw memory contents should be read.


Get every new post delivered to your Inbox.

Join 2,793 other followers

%d bloggers like this: