Archive

Tag Archives: memory

In the article oracle memory troubleshooting using analysis on heap dumps I introduced heap_analyze.awk.

The reason the tool exists is because I am using it myself. Therefore, I ran into additional things that I wanted the tool to do. I added some stuff, which is that significant, that I decided to make another blogpost to introduce the new features.

1. Percentages
In order to get an idea of the relative size of the summarised topic, I added a percentage. For example:

top 5 allocation by total size per alloc reason per heap
==================================================================================================
heap             alloc_reason            #chunks       total_size   %
--------------------------------------------------------------------------------------------------
sga heap(1,0)    perm                         19        182947456  72
sga heap(1,0)                                 43         14488960   5
sga heap(1,0)    SO private sga               18         14284168   5
sga heap(1,0)    KGLHD                      5787          4318400   1
sga heap(1,0)    KSFD SGA I/O b                1          4190416   1

2. Enhanced perm (permanent) memory descriptions
It seems that for PGA heap dumps, sometimes there is a description for memory area’s that are perm (permanent memory, memory allocated for the lifetime of the process). This is how that’s visible in the dump:

PERMANENT CHUNKS:
  Chunk     7fcbcad6c020 sz=    20632    perm      "perm           "  alo=20632
            7fcbcad6c020 sz=    20632    cpmlst    "callback hsn   "

I must say I don’t know what ‘cpmlst’ means, so if anyone knows or has a good guess, please let me know. However, the two addresses and sizes are an exact match, so I now change the alloc_reason for the cpmlst text.
This is helpful because there is a quite some memory allocated as perm. Sadly, this is not done for SGA dumps.

3. Excel mode
If you want to store and compare different dumps, one way of doing that is pasting the output in Microsoft excel. Once you set the ‘Text To Columns’ option to space as a separator, it will put the information in its own cells. But there are a few problems with that, one of them is that the heap names and alloc_reasons can have spaces in them, so that the placement of the figures can vary. I created excel mode for that.
If you set excel mode (set the variable excel_mode to 1 on line 5):

#!/bin/awk -f
BEGIN {
  printf "Heap dump analyzer v0.2 by Frits Hoogland\n";
  group_sum_nr=5;
  excel_mode=1;
}

In this mode, the horizontal lines (with ‘-‘ and ‘=’) are omitted in the output, and all spaces are changed to underscores, so a table stays consistent when pasted in excel:

heap             alloc_reason            #chunks       total_size   %
sga_heap(1,0)    perm___________              19        182947456  72
sga_heap(1,0)    _______________              43         14488960   5
sga_heap(1,0)    SO_private_sga_              18         14284168   5
sga_heap(1,0)    KGLHD__________            5787          4318400   1
sga_heap(1,0)    KSFD_SGA_I/O_b_               1          4190416   1

4. New table which shows memory by type
Another way of looking at memory in a heap is by grouping it by type. This allows you to very quickly see if a certain type of chunk is dominating a heap:

heap                        type    #chunks         min_size         max_size       total_size   %
--------------------------------------------------------------------------------------------------
top call heap               free          7            16408            65512           390064  99
top call heap           recreate          2              992              992             1984   0
top call heap               perm          2              120              904             1024   0

This is an overview of the top call heap of a session that is not active, and therefore most of it should be empty, which is true for this dump.

Once again, get the awk script here: https://gitlab.com/FritsHoogland/oracle_memory_analyze/blob/master/heap_analyze.awk

This post is an introduction to pinatrace annotate version 2, which is a tool to annotate the output of the Intel Pin tools ‘pinatrace’ tool.

The pinatrace tool generates a file with every single memory access of a process. Please realise what this means: this is every single read from main memory or write to main memory from the CPU. This allows you to get an understanding what happens within a C function. This means you can determine what information or data is accessed in what function. Needless to say this is a tool for internals investigations and research, not something for normal daily database maintenance and support. Also, the performance of the process that you attached to is severely impacted, and it can only be turned off by stopping the process. Do not use this on a production database, use this at your own risk for research and investigational purposes only.

The general way to use the pin tool is to run something together with pin, for example ‘ls’:

$ ./pin -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -- ls

This generates an output file that looks like this:

#
# Memory Access Trace Generated By Pin
#
0x00007f49de60f173: W 0x00007ffd160c46f8  8     0x7f49de60f178
0x00007f49de60f840: W 0x00007ffd160c46f0  8                  0
0x00007f49de60f844: W 0x00007ffd160c46e8  8                  0
0x00007f49de60f846: W 0x00007ffd160c46e0  8                  0
0x00007f49de60f848: W 0x00007ffd160c46d8  8                  0
0x00007f49de60f84a: W 0x00007ffd160c46d0  8                  0
0x00007f49de60f84f: W 0x00007ffd160c46c8  8                  0
0x00007f49de60f85f: R 0x00007f49de82fe00  8                0xe
0x00007f49de60f866: W 0x00007f49de82fbd8  8      0x5f25f53b3dc
0x00007f49de60f877: R 0x00007f49de82ff90  8           0x221e00
0x00007f49de60f881: W 0x00007f49de8309a8  8     0x7f49de82fe00
0x00007f49de60f888: W 0x00007f49de830998  8     0x7f49de60e000

This on itself is not very helpful. Let me explain what you are seeing: the first column is the instruction pointer address, in other words: the place in the executable in which the memory transfer was initiated, ‘W’ and ‘R’ are self explanatory, the third column is the memory address to which the read or write is addressed, the fourth column is the size of the memory IO, and the fifth is the value.

Luckily, there is another way to run pin, which is attaching to an already running process. This is what is helpful in using the memory access trace with an Oracle process. The way this is done, is quite simple; first get the linux process number, then run pin specifying the process number:

$ ps -ef | grep [L]OC
oracle    7250  7249  0 09:42 ?        00:00:00 oracletest (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ./pin -pid 7250 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

There is an important caveat here: because pin inserts itself into the running process, the pin library (source/tools/SimpleExamples/obj-intel64/pinatrace.so in this case) must be available to this process. So if you downloaded, extracted and linked pin tools as root, and then as root try to attach to an oracle process, it will not work because the oracle process can not reach the library. So: extract and link the pin tools as oracle.

The output of the pinatrace (a file called pinatrace.out by default) is in $ORACLE_HOME/dbs when you attach to an Oracle database process.

Back to the pinatrace output: what if we could see the C function that was at the instruction pointer address, the name of the Oracle memory area for the memory address and for the fifth column the name of the Oracle memory area again in case the value is actually a pointer? That is exactly what my pinatrace annotate oracle tool does!

How to use it: first clone the pinatrace_annotate repository:

git clone https://gitlab.com/FritsHoogland/pinatrace_annotate.git

Then set the O_BINARY to your oracle binary at line 6 of pinatrace_annotate_oracle.sh.

The next thing is to generate the oracle memory area CSV files. I added a script to generate the memory ranges for the PGA too, however you should be aware this is very fluent, and it requires an execution of the oracle PID you point it to before the v$process_memory_detail gets populated. Sometimes I skip the PGA detail step because it doesn’t add too much value.
To run the scripts for generating the memory area CSV files:

$ sqlplus / as sysdba
...
SQL> @get_memory_ranges
...
	76 14,20,@1	   8645 		    SYS 			   sqlplus@memory-presentation.local (TNS V1-V3)
...
enter pid number to get pga details: 76
old   1: alter session set events 'immediate trace name pga_detail_get level &pid'
new   1: alter session set events 'immediate trace name pga_detail_get level 76'

Session altered.

execute something (eg. select * from dual) in the target process, then press enter

Now head over to the session you want the PGA memory details from, and execute something like ‘select * from dual’, and press enter in the sqlplus session in which you ran get_memory_ranges.sql. This will then run for a while, it generates pipe-delimited files that sqlite can parse.

After the CSV files have been generated, create the sqlite database for use by the tool:

$ ./pinatrace_annotate_oracle.sh -c
generate symbols from /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/oracle.
create and populate table symbol_ranges with symbols.
create index.
create table memory_ranges.
import memory_ranges.csv
import memory_ranges_xtables.csv
import memory_ranges_pga.csv
create index.

Now the pinatrace_annotate_oracle.sh tool is all set for usage!

Example usage
Let’s use an actual example to see where the memory trace provides benefit: latches. Nowadays, latches are not the issue these were in the oracle 8 days. In these days, the latches protecting the database buffers double linked lists, the ‘cache buffers chains’ latches, were not gotten in shared mode. That meant especially index root blocks could get latch contention if there was much concurrency.

Some great work has been done on the internals of latches by Andrey Nikolaev; see his latch internals presentation. His work is done on Solaris with Dtrace, I did some of that on linux (part 1, part 2 and part 3).

Let’s replay a willing to wait latch get and see how that looks like in the annotated pinatrace output! Setup the host with the pin tools (link it to get the pinatrace.so library), and clone pinatrace_annotate and configure it to point to the oracle binary you are using.

1. session 1:

$ sqlplus / as sysdba
SQL> select distinct sid from v$mystat;

2. session 2:

$ sqlplus / as sysdba
SQL> @get_memory_ranges

identify session 1, and fill out the pid.
take note of the O/S pid.
it will ask you to execute something in the session for which the pid was entered.

3. session 1:

SQL> select * from dual;

4. session 2:
press enter, and wait for @get_memory_ranges to finish.

SQL> exit
$ ./pinatrace_annotate_oracle.sh -c
$ sqlplus / as sysdba
SQL> select addr from v$latch where name = 'cache table scan latch';

note the latch address, it’s 0x6004A400 in my case.

SQL> oradebug setmypid
SQL> oradebug call kslgetl 0x6004A400 0 0 2442

result: ‘Function returned 1’ indicating a successful latch get.

5. session 1:

SQL> oradebug setmypid

6. session 3:
go to the directory where pin is extracted and linked, and go to the directory where the pin executable is located. Execute the following and change the pid for the O/S pid obtained in step 2.

$ ./pin -pid 8645 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -o pina_latch.txt

now the memory trace is enabled for pid 8645

7. session 1:

SQL> oradebug call kslgetl 0x6004A400 1 0 2442

this will hang, it is waiting for the latch which is taken by session 2.

8. session 2:

SQL> oradebug call kslfre 0x6004A400

this will return ‘Function returned 0’ indicating it freed the latch.

9. session 1:
the oradebug call will now return ‘Function returned 1’ indicating it has gotten the latch.

10. session 3:

$ cp /u01/app/oracle/product/12.2.0.1/dbhome_1/dbs/pina_latch.txt ~/pinatrace_annotate/pina_latch.txt

as indicated, the pinatrace output file will be in $ORACLE_HOME/dbs, and with this I copy the current state of the trace to the pinatrace_annotate directory where my pinatrace annotate tool is cloned.

$ cd ~/pinatrace_annotate
$ ./pinatrace_annotate_oracle.sh pina_latch.txt > pina_latch_annotated.txt

This will run for a while…

Once pinatrace_annotate_oracle.sh is done, open the pina_latch_annotated.txt file. Short guide on how this looks like; the first two lines in my file are:

function+offset:memory address(annotation+offset):Read/Write:size:value hex/dec(pointer annotation+offset)
??:0x00007fff224e6498():R:8:0x11060785/285607813()

The first line is the header, which is displayed once, and essentially explains it all. The second line starts with ‘??’. As indicated by the header, the first field is the function. Any call to a function (‘symbol’) that is not in the oracle executable, like functions in shared libraries, do not have an address in the executable. In fact, the place at which the shared libraries are loaded, and thus their functions are located is randomised by default, which explains why the executable does not have the address.

The second column is the address memory is read from or written into. In the second line, there is nothing in between the parentheses, meaning this address is not in a known memory range.

The ‘R’ or ‘W’ is obvious, and the number is the amount of memory read or written too.

The fifth field shows the content of the read or write in hexadecimal and in decimal. Here again there are parenthesis which are empty. I look up the value in the memory area’s, and put the memory area, if the value is in a known memory area, as annotation including offset. What you should realise is that I don’t know if a value is in fact a memory area and thus a pointer, or that the value coincidentally the same as a known memory area. I do think understanding if the value is a known memory area is handy and useful, in the C language pointers are heavily used.

Another example:

nsbasic_brc+548:0x00007f33d3026690(pga|Other, Fixed Uga, Fixed UGA heap+88712 ):R:8:0x7f33d3019600/139860560156160(pga|Other, Fixed Uga, Fixed UGA heap+3532
0 )

Here we see an example where the function is a function in the oracle executable, nsbasic_brc, which at offset 548 of the function read 8 bytes from memory address 0x00007f33d3026690, which was a known memory area at the time of running 0_get_pga_detail.sql when ‘select * from dual’ was run, which was in the PGA, in the Fixed UGA heap at offset 88712, and the value it read was 0x7f33d3019600, which is a memory address in the same chunk of memory at offset 3532, and therefore likely to be a pointer.

Now let’s look further into the annotated pinatracefile. If you search for ‘kslgetl’, you will find the actual latch get executed with oradebug. This is how that looks like on my system:

kslgetl+2:0x00007fff224e4ec0():W:8:0x7fff224e4ff0/140733768945648()
kslgetl+13:0x00007fff224e4ea0():W:8:0x7fff224e5870/140733768947824()
kslgetl+17:0x00007f33d8095f40(pga|Other, free memory, top call heap+243776 ):R:8:0x7f33d8095f40/139860644552512(pga|Other, free memory, top call heap+243776 )
kslgetl+26:0x00007fff224e4e98():W:8:0x1/1()
kslgetl+33:0x00007fff224e4e90():W:8:0/0()
kslgetl+37:0x00007fff224e4ea8():W:8:0x4/4()
kslgetl+44:0x00007f33d8087e78(pga|Other, free memory, top call heap+186232 ):R:8:0x7a63d268/2053362280(shared pool|permanent memor,duration 1,cls perm+5702248 )
kslgetl+58:0x00007fff224e4eb0():W:8:0x7fff224e57d0/140733768947664()
kslgetl+65:0x00007fff224e4e60():W:8:0/0()
kslgetl+69:0x000000006004a40f(fixed sga|(parent)latch:cache table scan latch+15 fixed sga|var:kcb_table_scan_latch_+15 ):R:1:0x37/55()
kslgetl+80:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):R:8:0/0()
kslgetl+96:0x000000006004a410(fixed sga|(parent)latch:cache table scan latch+16 fixed sga|var:kcb_table_scan_latch_+16 ):R:4:0x1/1()
kslgetl+103:0x000000007a63d5f0(shared pool|X$KSUPR.KSLLALOW+0 shared pool|permanent memor,duration 1,cls perm+5703152 ):R:4:0/0()
kslgetl+109:0x0000000011141524():R:4:0x1cffe/118782()
kslgetl+122:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0x6004a400/1610916864(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 )
kslgetl+131:0x00007f33d8087ea8(pga|Other, free memory, top call heap+186280 ):R:8:0x4b/75()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):W:8:0x22/34()
kslgetl+1898:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0/0()
kslgetl+1909:0x00000000600119b4(fixed sga|var:kcbccc_+0 ):R:1:0/0()
kslgetl+1938:0x00007fff224e4e60():R:8:0/0()
kslgetl+1942:0x00007fff224e4e38():W:8:0x10a6084b/279316555()

The two highlighted lines show that the function read the beginning of the latch memory and found a number, indicating the oracle process id that has the latch taken already. In other words, this means the process found a latch that has already been taken by another process, oracle pid 34.

If you scroll down further you see kslgetl called kslges, which is how a willing to wait latch works if the latch it wants to take is already taken, and then further down, you will see this pattern:

kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()

This is what is the spinning for the latch looks like from a memory access perspective. So that means that spinning on a latch does not mean the entire latching function (kslgetl or kslges) is called to try to take the latch, but rather when kslgetl finds the latch to be taken, it calls kslges, which does some housekeeping, and then in a really tight loop reads the latch’ memory at offset 0 to detect when it is freed. We can see it is a loop by looking at the offset numbers of the function, these are 930, 937 and 944, which are the offsets we see over and over.

In fact, if I grep for that line and count the number of lines, we can see this version of oracle (12.2.0.1) still does spin 20’000 times for a latch in willing to wait mode:

grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | wc -l
20001

Yes, this shows 20001 executions, remember we freed the latch in the other session?

$ grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | tail -3
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0/0()

The last line shows kslges finding the latch in a free state, which is after it has been woken from sleeping on a semaphore.

A modern version of Oracle doesn’t sleep for an arbitrary time and then spin again, it goes to sleep on a semaphore after it spun 20’000 times and then waits until it gets posted. That can be seen in the pinatrace output too, but if you want to look at the function flow, it might be better to generate a debugtrace trace, the pinatrace trace shows what information is used within a function, and might be too verbose. In fact, the function flow towards the semaphore call is kslges > skgpwwait > sskgpwwait.

Conclusion
pinatrace_annotate_oracle.sh is a tool to annotate a tracefile generated by the pin tool ‘pinatrace’ when running the oracle binary. It shows information about any memory access done by a process, which allows you to understand what information is obtained in what function of Oracle. This is a tool only for research and debug purposes.

Prometheus is an open source systems monitoring and alerting toolkit originally build at Soundcloud. This blogpost shows how to install the needed components to do visualisation of linux system statistics via Grafana.

Addition June 29, 2018: a really quick simple install is provided in this blogpost: very quick install of prometheus, node exporter and grafana This uses an ansible script that does most of the installation and configuration for you.

The setup consists of 3 components:
node_exporter, an exporter of system and hardware metrics.
prometheus, a metric collection and persistence layer.
grafana, the visualisation layer.

1. Preparation
The needed components are installed in the home directory of the user ‘prometheus’. In order for that user exist, it must obviously first be created:

# useradd prometheus
# su - prometheus
$

This installation guide uses Oracle Linux 7.3, but should work for RHEL or Centos too.

2. Node exporter
The next thing to do is install the node exporter. Please mind new version do come out, so you might want to verify the latest release on

$ curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz"
$ mkdir -p Prometheus/node_exporter
$ cd $_
$ tar xzf ../../node_exporter-0.14.0.linux-amd64.tar.gz

Now become root and create a unit file to automatically startup the node exporter using systemd:

# echo "[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

[Install]
WantedBy=default.target" > /etc/systemd/system/node_exporter.service

And make systemd start the node exporter:

# systemctl daemon-reload
# systemctl enable node_exporter.service
# systemctl start node_exporter.service

Next you can verify if the node exporter is running by using ‘systemctl status node_exporter.service:

# systemctl status node_exporter.service
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:20:54 UTC; 7s ago
 Main PID: 3017 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─3017 /home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - hwmon" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - infiniband" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - textfile" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - conntrack" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - diskstats" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - entropy" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - loadavg" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - sockstat" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - wifi" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg="Listening on :9100" source="node_exporter.go:186"

Additionally, you can go to hostname:9100, and look if that page says ‘node exporter’, and has a link called ‘metric’, which has all the metrics.

3. Prometheus
After we installed node_exporter to provide measurements, we must install the software that can fetch that information and store it. That is what prometheus does. First, become the prometheus user again, and install prometheus. Here too is important to realise that newer versions will come out after this article has been written:

# su - prometheus
$ curl -LO "https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz"
$ cd Prometheus
$ tar xzf ../prometheus-1.7.1.linux-amd64.tar.gz
$ cd prometheus-1.7.1.linux-amd64
$ echo "scrape_configs:

  - job_name: 'prometheus'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9100']"> prometheus.yml

This downloaded and unzipped prometheus, and created prometheus scrape config to fetch data from prometheus itself and the node exporter. Now become root, and install the systemd unit file for prometheus:

# echo "[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus.yml -storage.local.path=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/data

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/prometheus.service

And make systemd start prometheus:

# systemctl daemon-reload
# systemctl enable prometheus.service
# systemctl start prometheus.service

And verify prometheus is running:

# systemctl status prometheus.service
● prometheus.service - Prometheus Server
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:36:55 UTC; 9s ago
     Docs: https://prometheus.io/docs/introduction/overview/
 Main PID: 22656 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─22656 /home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1....

Jul 31 15:36:55 test.local systemd[1]: Started Prometheus Server.
Jul 31 15:36:55 test.local systemd[1]: Starting Prometheus Server...
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting prometheus (version=1.7.1, branch=mast...n.go:88"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Build context (go=go1.8.3, user=root@0aa1b7fc43...n.go:89"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Host details (Linux 3.10.0-514.26.2.el7.x86_64 ...n.go:90"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading configuration file /home/prometheus/Pro....go:252"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading series map and head chunks..." source="....go:428"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="0 series loaded." source="storage.go:439"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Listening on :9090" source="web.go:259"
Hint: Some lines were ellipsized, use -l to show in full.

Additionally you can go to hostname:9090/targets and verify both node_exporter and prometheus report state=UP.

At this point, system metrics are fetched and stored. All we need to do, is visualise it. An excellent tool for doing so is grafana. This is how grafana is installed:

4. Grafana
This webpage shows installation instructions and a link to the latest version. During the time of writing of this blogpost, the latest version was 4.1.1. This is how grafana is installed: (please mind installation and systemd require root privileges)

# yum install https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.4.1-1.x86_64.rpm

Next up make systemd handle grafana and start it:

# systemctl daemon-reload
# systemctl enable grafana-server.service
# systemctl start grafana-server.service

And check if grafana is running:

# systemctl status grafana-server.service
● grafana-server.service - Grafana instance
   Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:43:11 UTC; 1min 58s ago
     Docs: http://docs.grafana.org
 Main PID: 22788 (grafana-server)
   CGroup: /system.slice/grafana-server.service
           └─22788 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile= cfg:default.paths.logs=/var/log/grafana cfg:default.path...

Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Starting plugin search" logger=plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=warn msg="Plugin dir does not exist" logger=plugins dir=/...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Plugin dir created" logger=plugins dir=/var/lib...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Alerting" logger=alerting.engine
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing CleanUpService" logger=cleanup
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Stream Manager"
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing HTTP Server" logger=http.server ad...socket=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Hint: Some lines were ellipsized, use -l to show in full.

5. Grafana configuration
Next, we need to hook up grafana with prometheus. First, go to hostname:3000.
– Login with admin/admin
– Click ‘add datasource’
– Name: prometheus, Type: Prometheus
– Http settings: http://localhost:9090, select Access: ‘proxy’.
– Click ‘save and test’. This should result in ‘success’ and ‘datasource updated.’

Now click on the grafana symbol in the left upper corner, dashboards, import. Enter ‘2747’ at ‘grafana.com dashboard’. This will say ‘Linux memory’, select the prometheus datasource which you just defined, and click import.

This should result in a dashboard the shows you the linux memory area’s (click on the picture to get a better view!):

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

There are many posts about the amount of memory that is taken by the Oracle database executables and the database SGA and PGA. The reason for adding yet another one on this topic is a question I recently gotten, and the complexities which surrounds memory usage on modern systems. The intention for this blogpost is to show a tiny bit about page sharing of linux for private pages, then move on to shared pages, and discuss how page allocation looks like with Oracle ASMM (sga_target or manual memory).

The version of linux in this blogpost is Oracle Linux 7.2, using kernel: 4.1.12-37.6.3.el7uek.x86_64 (UEK4)
The version of the Oracle database software is 12.1.0.2.160719 (july 2016).

Memory usage of virtual memory systems is complicated. For that reason I see a lot of people getting very confused about this topic. Let me state a very simple rule: the memory actively being used on a system should fit in physical memory. Swap (a file or partition), increases total virtual memory, but really only is a safety net for saving your system from an out of memory situation at the cost of moving pages from and to disk. Because modern linux kernels have swappiness (willingness to swap) to a non-zero value, it’s not uncommon to have some swap being used, despite physical memory not being oversubscribed. A system stops performing as soon as paging in and out starts to occur, and for that reason should not happen.

1. Private pages for linux executables
When an executable is executed on linux from the shell, the shell executes a fork() call to create a new process, which is implemented as a clone() system call on linux. Using the clone() system call, the virtual memory space of the newly created process is shared (readonly) with it’s parent. This includes the private allocations! Once the child process needs to write in it’s memory space, it will page fault and create it’s own version, abandoning the version of its parent.

Can we actually prove this is happening? Yes, the /proc/ filesystem gives an insight to a process’ virtual memory space.
Let’s start off with a very simple example: we execute ‘cat /proc/self/maps’ to see our own address space:

[oracle@oracle-linux ~]$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
0060c000-0060d000 rw-p 0000c000 fb:00 201666243                          /usr/bin/cat
00e41000-00e62000 rw-p 00000000 00:00 0                                  [heap]
7f69729be000-7f6978ee5000 r--p 00000000 fb:00 576065                     /usr/lib/locale/locale-archive
7f6978ee5000-7f6979099000 r-xp 00000000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979099000-7f6979298000 ---p 001b4000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979298000-7f697929c000 r--p 001b3000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929c000-7f697929e000 rw-p 001b7000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929e000-7f69792a3000 rw-p 00000000 00:00 0
7f69792a3000-7f69792c4000 r-xp 00000000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794b9000-7f69794bc000 rw-p 00000000 00:00 0
7f69794c3000-7f69794c4000 rw-p 00000000 00:00 0
7f69794c4000-7f69794c5000 r--p 00021000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c5000-7f69794c6000 rw-p 00022000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0
7ffdab1c7000-7ffdab1e8000 rw-p 00000000 00:00 0                          [stack]
7ffdab1ea000-7ffdab1ec000 r--p 00000000 00:00 0                          [vvar]
7ffdab1ec000-7ffdab1ee000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here’s a lot to see, but we see the cat executable at 0x00400000. The reason for three memory allocations are (linux/ELF) executables uses different sections with specific functions. A full overview of these can be obtained using the readelf executable. A simpler overview of an executable, which matches the above three memory allocations for the cat executable can be obtained using ‘size -B’ (the size executable, -B means ‘berkeley style’, which is default):

[oracle@oracle-linux ~]$ size -B /usr/bin/cat
   text	   data	    bss	    dec	    hex	filename
  43905	   1712	   2440	  48057	   bbb9	/usr/bin/cat

This describes the three memory sections an linux executable can have: text (the machine instructions, alias ‘the program’), data (all initialised variables declared in the program) and BSS (uninitialised data).
The first section always is the text allocation (not sure if it’s impossible to have the text section not being the first allocation, I have never seen it different). If you look at the memory flags, ‘r-xp’, this totally makes sense: ‘r-‘ meaning: read(only, followed by a’-‘ instead of a ‘w’), ‘x’: executable and ‘p’: this is a private allocation. The next allocation is the data section. We don’t execute variables, we read them, which is reflected in the flags: ‘r–p’. But what if we change the value of a variable? That is where the third section is for: changed values of initialised variables. This can be seen from the flag of this section: ‘rw-p’, read, write and private. The fourth allocation lists [heap], this is a mandatory allocation in every process’ memory space, which holds (small) memory allocations, this is NOT the BSS section. In this case, the BSS section does not seem to be allocated.

By having memory allocations for /usr/lib64/ld-2.17.so we can see this is a dynamically linked executable. You can also see this by executing ‘file’ on the executable:

[oracle@oracle-linux ~]$ file /usr/bin/cat
/usr/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=3207edc47638918ceaeede21947a20a4a496cf63, stripped

If a linux executable is dynamically linked, you can see the libraries that are loaded by the dynamic linker/loader using the ldd utility:

[oracle@oracle-linux ~]$ ldd /usr/bin/cat
       	linux-vdso.so.1 =>  (0x00007ffceb3e4000)
       	libc.so.6 => /lib64/libc.so.6 (0x00007fd46fb7e000)
       	/lib64/ld-linux-x86-64.so.2 (0x000055d5253c9000)

This output shows the dynamic loader (/lib64/ld-linux-x86-64.so.2), and two libraries the dynamic loader loads: libc.so.6 and linux-vdso.so.1. The first one, libc, is the standard C library. The second one, linux-vdso is for virtual dynamic shared object, which is an optimisation for certain system calls to be executed in user space (notably gettimeofday()).
The other allocations that exist in our example are anonymous mappings (usually done by programs using the mmap() call):

7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0

And some allocations for system purposes, like stack, var, vdso and vsyscall.

Now that you have become familiar with some basic linux memory address space specifics, let’s take it a little further. It’s possible to see more about the memory segments using the proc filesystem smaps file:

[oracle@oracle-linux ~]$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
Size:                 44 kB
Rss:                  44 kB
Pss:                  44 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           44 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
Size:                  4 kB
Rss:                   4 kB
...etc...

Per allocation there are a lot of properties to be seen. ‘Size’ is the full size, ‘Rss’ is the resident set size, alias the amount of data of this segment that is truly resident for this process in it’s address space. ‘Pss’ is fairly unknown, and is the proportional size of this segment. The way it is proportional is that if pages in this allocation are shared with other processes, the size of these pages are divided by the number processes it is shared with. In this case, we have loaded the text segment of the cat executable into the process’ address space, which all is resident (size and rss are the same) and it’s not shared with any process (rss equals pss). There are many more properties, but these are out of scope for this blogpost.

Now let’s move on to Oracle. If you look at the maps output of the pmon process for example, you’ll see:

[oracle@oracle-linux 14153]$ cat maps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8f000-10de8000 rw-p 1058f000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10de8000-10e19000 rw-p 00000000 00:00 0
1190f000-11930000 rw-p 00000000 00:00 0                                  [heap]
...

Here we see the Oracle executable, with a text segment, a readonly data segment and a read/write data segment, and we see an anonymous mapping directly following the data segments. That’s the BSS segment!
However, what is more interesting to see, is the properties of the distinct memory allocations in smaps:

[oracle@oracle-linux 14153]$ cat smaps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:             267704 kB
Rss:               40584 kB
Pss:                 819 kB
Shared_Clean:      40584 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:        40584 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:                136 kB
Rss:                 124 kB
...

If we look at the text segment for the oracle binary, we see the total text size is 267704 kB (size), but resident (truly available for this process in its address space) is only 40584 kB (rss), and because the oracle executable’s text segment is shared with a lot of processes, the proportional size is only 819 kB (pss).

If you want to understand how much memory is taken in the system, the size is telling the total size of the segment, but it doesn’t say anything on true memory usage. The rss size tells the amount of pages for the segment that is paged in to the address space of every process, and can (and is, for oracle) different for every process. The pss size is the proportional size for every process. Probably the only way to tell the true amount of memory taken by executables and libraries is to add up all the pss sizes. Any other value only tells something about the process’ point of view on memory usage, but not overall, true consumed space because that would lead to counting too much.

This is different for anonymous allocations. Since anonymous allocations are created when a process is run, I’ve only seen them initialised purely private. For that reason rss and pss sizes are equal, because every process initialises it strictly for itself. This too works in a lazy allocation way. When memory is allocated, the size is defined, but is only really allocated once it’s truly used, which is expressed by a difference between size and rss.

2. shared pages
The Oracle databases relies on shared caches and data structures, which are put into what is called the SGA, the system global area. The main components of the SGA are the shared pool (shared structures), log buffer (change vectors to be written to disk to persist changes) and the buffer cache, amongst others. With any memory management option (manual management, ASMM (automatic shared memory management, sga_target) and AMM (automatic memory management, memory_target)) there is a SGA. Depending on the memory option, these are visible in a different way.

When manual memory or ASMM is used, shared memory is allocated as system V shared memory. The ‘classic’ way of looking at system V shared memory is using ipcs -m (m is for shared memory, you can also use s for semaphores and q for message queues):

[oracle@oracle-linux ~]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 655360     oracle     600        2932736    124
0x00000000 688129     oracle     600        905969664  62
0x00000000 720898     oracle     600        139673600  62
0x5f921964 753667     oracle     600        20480      62

Please mind that if you have more than one instance active, or an ASM instance active, you will see more shared memory allocations.
Apparently, the oracle database allocates a couple of shared memory segments. If you want to understand what these memory allocations are for, you can use the oradebug ipc command to see what their functions are:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
IPC information written to the trace file

This generates a trace file in the ‘trace’ directory in the diagnostics destination. Here is how this looks like (partial output with content of interest to this blogpost):

 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002cbe70 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2   655360 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cc000 00000000002cc000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000036000000 Minimum Subarea size 00400000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0   688129 0x00000060400000 0x00000060400000 0x00000060400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000036000000 0000000036000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1   720898 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008534000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000005000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3   753667 0x0000009ec00000 0x0000009ec00000 0x0000009ec00000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000005000 0000000000005000 default       readwrite

The first allocation is ‘fixed size’, alias the fixed SGA, the second allocation is the ‘variable size’, which contains the shared pool and the buffercache, the third allocation is the ‘redo buffers’ and the fourth is the ‘skgm overhead’ alias the index into the shared memory structures for this instance.

Because any memory allocation is visible in maps and smaps, this method can be used for shared memory too, to see how the shared memory segments are mapped into the process address space. All oracle database server processes have the shared memory segments for the instance mapped into their address space. The usage is different per process, so the amount of shared memory paged into the address space will be different:

...
12bcd000-12bee000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:05 688129                             /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:05 720898                             /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:05 753667                             /SYSV5f921964 (deleted)
7f473004e000-7f47301d4000 r-xp 00000000 fb:02 212635773                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
...

Shared memory is easily identified by the ‘s’, at which “normal” private memory mappings have ‘p’. If you want to know more about the process’ perspective of the shared memory, we can use smaps, just like with private memory mappings (virtual memory space of pmon):

60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd sh mr mw me ms sd
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
Size:               2860 kB
Rss:                 392 kB
Pss:                  36 kB
Shared_Clean:          0 kB
Shared_Dirty:        372 kB
Private_Clean:         0 kB
Private_Dirty:        20 kB
Referenced:          392 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms sd

These two shared memory segments are belonging to the fixed sga. The reason for two segments is the first page (0x1000 equals 4096, alias a single linux page) is readonly (r–s). The other fixed SGA segment is read write (rw-s). Here we see that from the process’ perspective it really doesn’t matter much if a piece of mapped memory is shared or private; it’s exactly handled the same way, which means the full segment is mapped into the process’ virtual memory space, but only once pages are touched (alias truly used), the process registers the address in its pagetable, and the pages become resident (as can be seen in the difference between the total size and the rss). The sole purpose of shared memory is it is shared between process. That the pages are shared is very well visible with the difference between rss and pss size. Its also easy to spot this shared memory segment is created from small pages; MMUPageSize and KernelPageSize is 4kB.

However, this yields an interesting question: shared memory does not belong to any single process. Does that mean that if a shared memory segment is created, it is truly allocated, or can shared memory be lazy allocated as well? Please mind that above statistics are the process’ perspective, not the kernel’s perspective.

One way to see the state of shared memory system wide, is using the ‘-u’ flag with the ipcs command:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  255684
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This is a really useful view! What we can see from the output from this command, is that nearly all pages allocated as shared memory are resident. By having statistics for shared memory pages allocated and resident we can conclude that shared memory too could be allocated in a lazy, alias on demand. Also, there is a difference between resident and allocated, which indicates lazy allocation too.

Inside the database I am aware of two parameters that could influence shared pages usage; pre_page_sga and _touch_sga_pages_during_allocation, see my article on these. However, what is interesting, is that these parameters are different for the instance I am testing with for this blogpost, which is running on a VM:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: page
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%page%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME   						   VALUE       								  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
olap_page_pool_size    				   0   									  TRUE 	   FALSE      FALSE
pre_page_sga   					   TRUE        								  TRUE 	   FALSE      FALSE
use_large_pages        				   TRUE        								  TRUE 	   FALSE      FALSE
_max_largepage_alloc_time_secs 			   10  									  TRUE 	   FALSE      FALSE
_olap_page_pool_expand_rate    			   20  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hi     				   50  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hit_target     			   100 									  TRUE 	   FALSE      FALSE
_olap_page_pool_low    				   262144      								  TRUE 	   FALSE      FALSE
_olap_page_pool_pressure       			   90  									  TRUE 	   FALSE      FALSE
_olap_page_pool_shrink_rate    			   50  									  TRUE 	   FALSE      FALSE
_realfree_heap_pagesize        			   65536       								  TRUE 	   FALSE      FALSE
_realfree_pq_heap_pagesize     			   65536       								  TRUE 	   FALSE      FALSE
_session_page_extent   				   2048        								  TRUE 	   FALSE      FALSE
_touch_sga_pages_during_allocation     		   FALSE       								  TRUE 	   FALSE      FALSE

14 rows selected.

In the database I created on my VM, pre_page_sga equals to TRUE and _touch_sga_pages_during_allocation to FALSE, which is the exact inverse of the settings of a database (PSU 160419) on a huge machine. Perhaps these parameters are dynamically set based on size of the SGA and logic (if _touch_sga_pages_during_allocation is TRUE, it makes sense to set pre_page_sga to FALSE, as it’s function has been performed by the bequeathing session.

However, having pre_page_sga set to TRUE it makes sense almost all SGA (shared) pages are allocated, because pre_page_sga (at least in Oracle 12, not sure about earlier versions, because the Oracle description of this parameter is different from what happens in Oracle 12) spawns a background process (sa00) that scans SGA pages, which means it pages them, resulting in the actual allocation. Let’s test this by setting pre_page_sga to false, it should lead to way lesser shared memory pages allocated, which will eventually be allocated as database processes are paging them in:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force;

And then look at ipcs -mu again:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  92696
pages swapped   0
Swap performance: 0 attempts   	 0 successes

As expected, only the bare necessary pages are resident after startup force, all the other shared pages will be slowly paged in as foreground and background processes touching SGA pages during execution.

How would that work when we set sga_max_size to a different value than sga_target? If the pages beyond the sga_target are never allocated, you could control the amount of SGA pages used by setting sga_target, but ‘reserve’ extra memory to use by setting sga_max_size higher, which is never allocated, so it is not wasted. Let’s setup the instance:

SQL> alter system set pre_page_sga=true scope=spfile;
SQL> show spparameter sga_target

SID    	 NAME  			       TYPE    	   VALUE
-------- ----------------------------- ----------- ----------------------------
*      	 sga_target    		       big integer 1000M
SQL> ! ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  102512
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This sets the pre_page_sga parameter from the spfile, which means the instance will spawn a process to touch SGA pages on next startup.
Currently, the sga_target for sizing the SGA is set to 1000M in the spfile.
ipcs tells us we got 256005 pages are allocated, which makes sense: 256005*4=1024020k, which is slightly more than the set 1000M, which means essentially sga_target equals pages allocated.

SQL> alter system set sga_max_size=2g scope=spfile;
SQL> startup force;
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

This sets sga_max_size to double the amount of sga_target, and ‘startup force’ bounces the instance.

SQL> show parameter sga_target

NAME   				     TYPE      	 VALUE
------------------------------------ ----------- ------------------------------
sga_target     			     big integer 1008M

Here we see the actual parameter in the database is set to 1008M. Now let’s look at the ipcs -mu values again:

> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  521923
pages swapped   0
Swap performance: 0 attempts   	 0 successes

521923*4=2087692. So (almost) all the memory set for sga_max_size is allocated. In fact, if you look at the values at instance startup values reported above, you see ‘Total System Global Area’ showing the 2G, it’s all SGA, so it’s all touched because of pre_page_sga being set to TRUE. So the next test would be to have pre_page_sga being set to FALSE:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

All memory is still declared SGA, as we can see. However, by having _touch_sga_pages_during_allocation set to FALSE and pre_page_sga set to FALSE, we should see only the actual used SGA pages being allocated:

SQL> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  91692
pages swapped   0
Swap performance: 0 attempts   	 0 successes

The above output shows the shared memory status directly after I restart my instance, so this is not only less than sga_max_size, it is even less than sga_target (91692*4=336768, ~ 336M). This will grow up to sga_target, because these pages will get paged in by the database processes.

How does this look like when we add in huge pages? In Oracle 12.1.0.2.160719 in my instance the parameter to tell oracle to allocate huge pages if there are any (‘use_large_pages’) is set to TRUE. This will make Oracle use large pages if any are available. This is true, even if there are not enough huge pages to satisfy the entire SGA; Oracle will just allocate all that can be allocated, and create a new shared memory segment using small pages for the remainder of the needed shared memory.

Sadly, it seems per memory segment statistics like rss, pss, shared and private clean and dirty, etc. are not implemented for huge pages:

[oracle@oracle-linux [testdb] ~]$ cat /proc/$(pgrep pmon)/smaps
...
61000000-d8000000 rw-s 00000000 00:0e 688129                             /SYSV00000000 (deleted)
Size:            1949696 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms de ht sd
...

This is the main shared memory segment, allocated from huge pages (as can be seen with KernelPageSize and MMUPageSize), which means it’s the segment holding the shared pool and buffercache. This can also be seen by the size: 1949696 kB, which is nearly the 2G of sga_max_size.

However, we can just use the global information on system V shared memory (ipcs -mu) and we can use the huge page information in /proc/meminfo:

[oracle@oracle-linux [testdb] ~]$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:    1100
HugePages_Free:      880
HugePages_Rsvd:      805
HugePages_Surp:        0
Hugepagesize:       2048 kB

The statistics of interest are:
hugepages_total: the total number of huge pages allocated. warning: huge pages memory allocated by the kernel is NOT available for allocation of regular sized pages (which means you can starve your processes and the kernel for normal pages by setting the number of huge pages too high).
hugepages_free: the number of huge pages which are not used currently. warning: this includes allocated but not yet initialised pages, which hugepages_rsvd shows.
hugepages_rsvd: the number of huge pages allocated but not yet initialised.
hugepages_surp: the number of huge pages allocated (truly allocated and not yet initialised) greater than the total number of huge pages set. this value can be greater than zero if the kernel setting vm.nr_overcommit_hugepages is greater than zero. The value of this setting is zero by default, and at least for usage with the Oracle database, this value should remain zero.

The same information can be obtained using ipcs -mu, but with a twist:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524803
pages resident  122881
pages swapped   0
Swap performance: 0 attempts   	 0 successes

Some of you might get the twist on this by looking at the number.
It turns out ipcs has no facility for huge pages, it just reports the number of pages as if these were 4 kB.
524803*4 (kB) / 1024 (to make it MB) = 2050.

Now going back to the goal of looking into this: I told shared memory is allocated and paged at startup time when _touch_sga_pages_during_allocation is set to TRUE (set to false as default value in my current database), and it could be explicitly paged by the background process sa00 after startup of the instance when pre_page_sga is set to TRUE. When both are set to false, shared memory allocated from default sized 4kB pages is allocated only when it’s used. In the above examples with huge pages, the tests were done with pre_page_sga set to false. This shows exactly the same ‘lazy allocation’ behaviour as 4kB pages.

When ‘extra’ memory is reserved from the operating system by setting sga_max_size to a higher value than sga_target, this will all be allocated and paged if either _touch_sga_pages_during_allocation or pre_page_sga is set to TRUE, which doesn’t make sense; if the memory is taken, you might as well use it. However, this is different if both _touch_sga_pages_during_allocation and pre_page_sga are set to false. All memory beyond sga_target up to sga_max_size is allocated, but never touched, and thus never paged in, so never truly allocated. Please mind linux itself understands this perfectly (aiming at huge pages and ‘reserved’ pages), however the system V ipc kernel settings do not; you need to set the shared memory values high enough to facilitate the total sum of sga_max_size values, not the truly used sizes as indicated by the sum of sga_target values.

The inspiration for this investigation came from a question on my blog. However, the question was about memory_target and memory_max_target and AIX. I do not have an AIX system at hand. I did not investigate the implementation of memory_target and memory_max_target on AIX. So I can’t comment on that. What I can say, is that on Linux, you really, really should use automatic shared memory management (ASMM) alias setting sga_target or setting it manually (and set huge pages!). If you are used to these memory management settings on databases not on AIX, it probably makes sense to use that on AIX too, even if the automatic memory management (AMM) alias setting memory_target is implemented brilliantly on AIX, for the sake of predictability and standardisation.

This article is written with examples taken from an (virtualised) Oracle Linux 6u6 X86_64 operating system, and Oracle database version 12.1.0.2.1. However, I think the same behaviour is true for Oracle 11 and 10 and earlier versions.

Probably most readers of this blog are aware that a “map” of mapped memory for a process exists for every process in /proc, in a pseudo file called “maps”. If I want to look at my current process’ mappings, I can simply issue:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 786125                             /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 786125                             /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0080a000-0080b000 rw-p 0000a000 fc:00 786125                             /bin/cat
01243000-01264000 rw-p 00000000 00:00 0                                  [heap]
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/ld-2.12.so
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/ld-2.12.so
345b220000-345b221000 rw-p 00020000 fc:00 276143                         /lib64/ld-2.12.so
345b221000-345b222000 rw-p 00000000 00:00 0
345b800000-345b98a000 r-xp 00000000 fc:00 276144                         /lib64/libc-2.12.so
345b98a000-345bb8a000 ---p 0018a000 fc:00 276144                         /lib64/libc-2.12.so
345bb8a000-345bb8e000 r--p 0018a000 fc:00 276144                         /lib64/libc-2.12.so
345bb8e000-345bb8f000 rw-p 0018e000 fc:00 276144                         /lib64/libc-2.12.so
345bb8f000-345bb94000 rw-p 00000000 00:00 0
7f8f69686000-7f8f6f517000 r--p 00000000 fc:00 396081                     /usr/lib/locale/locale-archive
7f8f6f517000-7f8f6f51a000 rw-p 00000000 00:00 0
7f8f6f524000-7f8f6f525000 rw-p 00000000 00:00 0
7fff2b5a5000-7fff2b5c6000 rw-p 00000000 00:00 0                          [stack]
7fff2b5fe000-7fff2b600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

What we see, is the start and end address, the rights (rwx), absence of rights is shown with a ‘-‘, and an indication of the mapped memory region is (p)rivate or (s)hared. In this example, there are no shared memory regions. Then an offset of the mapped file, then the device (major and minor device number). In our case sometimes this is ‘fc:00’. If you wonder what device this might be:

$ echo "ibase=16; FC" | bc
252
$ ls -l /dev | egrep 252,\ *0
brw-rw---- 1 root disk    252,   0 Mar 23 14:19 dm-0
$ sudo dmsetup info /dev/dm-0
Name:              vg_oggdest-lv_root
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      252, 0
Number of targets: 2
UUID: LVM-q4nr4HQXgotaaJFaGF1nzd4eZPPTohndgz553dw6O5pTlvM0SQGLFsdp170pgHuw

So, this is a logical volume lv_root (in the volume group vg_oggdest).

Then the inode number (if a file was mapped, if anonymous memory was mapped the number 0 is shown), and then the path if a file was mapped. This is empty for anonymous mapped memory (which is memory which is added to a process using the mmap() call). Please mind there are also special regions like: [heap],[stack],[vdso] and [vsyscall].

Okay, so far I’ve shown there is a pseudo file called ‘maps’ which shows mapped memory and told a bit about the fields in the file. Now let’s move on to the actual topic of this blog: the Oracle database SGA memory, and the indicator this is deleted!

In this example I pick the maps file of the PMON process of an Oracle database. Of course the database must use system V shared memory, not shared memory in /dev/shm (which is typically what you see when Oracle’s automatic memory (AMM) feature is used). This is a snippet from the maps file of the pmon process on my server:

 cat /proc/2895/maps
00400000-1093f000 r-xp 00000000 fc:00 1326518                            /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b3e000-10dbf000 rw-p 1053e000 fc:00 1326518                            /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10dbf000-10df0000 rw-p 00000000 00:00 0
12844000-1289d000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:04 111902723                          /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:04 111902723                          /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:04 111935492                          /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:04 111968261                          /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:04 112001030                          /SYSV6ce0e164 (deleted)
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/ld-2.12.so
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/ld-2.12.so
...

If you look closely, you see the oracle executable first, with two entries, one being readonly (r-xp), the other being read-write (rw-p). The first entry is readonly because it is shared with other processes, which means that there is no need for all the processes to load the Oracle database executable in memory, it shares the executable with other process. There’s much to say about that too, which should be done in another blogpost.

After the executable there are two anonymous memory mappings, of which one is the process’ heap memory.

Then we see what this blogpost is about: there are 5 mappings which are shared (r–s and rw-s). These are the shared memory regions of the Oracle database SGA. What is very odd, is that at the end of the lines it says “(deleted)”.

Of course we all know what “deleted” means. But what does it mean in this context? Did somebody delete the memory segments? Which actually can be done with the ‘ipcrm’ command…

If you go look at the maps of other Oracle processes and other databases you will see that every database’s shared memory segment are indicated as ‘(deleted)’.

Word of warning: only execute the steps below on a test environment, do NOT do this in a production situation.

In order to understand this, the best way to see what actually is happening, is starting up the Oracle database with a process which is traced with the ‘strace’ utility with the ‘-f’ option set (follow). Together with the ‘-o’ option this will produce a (long) file with all the system calls and the arguments of the calls which happened during startup:

$ strace -f -o /tmp/oracle_startup.txt sqlplus / as sysdba

Now start up the database. Depending on your system you will notice the instance startup takes longer. This is because for every system call, strace needs to write a line in the file /tmp/oracle_start.txt. Because of this setup, stop the database as soon as it has started, on order to stop the tracing from crippling the database performance.

Now open the resulting trace file (/tmp/oracle_startup.txt) and filter it for the system calls that are relevant (calls with ‘shm’ in their name):

$ grep shm /tmp/oracle_startup.txt | less

Scroll through the output until you see a line alike ‘shmget(IPC_PRIVATE, 4096, 0600) = 130777091’:

...
4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130777091
4545  shmat(130777091, 0, 0)            = ?
4545  shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130777091, IPC_RMID, 0)    = 0
4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130809859
4545  shmat(130809859, 0, 0)            = ?
4545  shmctl(130809859, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130809859, IPC_RMID, 0)    = 0
...

What we see here is a (filtered) sequence of systems calls that could explain the status deleted of the shared memory segments. If you look up what process id is in front of these shm system calls, you will see it’s the foreground process starting up the instance. If you look closely, you’ll that there is a sequence which is repeated often:

1. shmget(IPC_PRIVATE, 4096, 0600) = 130777091
The system call shmget allocates a shared memory segment of 4 kilobyte, rights set to 600. The return value is the shared memory identifier of the requested shared memory segment.

2. shmat(130777091, 0, 0) = ?
The system call shmat attaches the a shared memory segment to the process’ address space. The first argument is the shared memory identifier, the second argument is the address to attach the segment to. If the argument is zero, like in the call above, it means the operating system is tasked with finding a suitable (non used) address. The third argument is for flags, the value zero here means no flags are used. The returncode (here indicated with a question mark) is the address at which the segment is attached. This being a question mark means strace is not able to read the address, which is a shame, because we can’t be 100% certain at which memory address this shared memory segment is mapped.

3. shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
The system call shmctl with the argument IPC_STAT has the function to read the (kernel) shared memory information of the shared memory identifier indicated by the first argument, and write it at the memory location in the third argument in a struct called shmid_ds.

4. shmdt(0x7f406f2ba000) = 0
With this system call, the shared memory segment is detached from the process’ address space. For the sake of the investigation, I assumed that the address in this call is the address which is returned by the shmat() call earlier.

5. shmctl(130777091, IPC_RMID, 0) = 0
This is another shared memory control system call, concerning our just created shared memory segment (shared memory identifier 130777091), with the command ‘IPC_RMID’. This is what the manpage says about IPC_RMID:

       IPC_RMID  Mark the segment to be destroyed.  The segment will only  actually  be  destroyed
                 after the last process detaches it (i.e., when the shm_nattch member of the asso-
                 ciated structure shmid_ds is zero).  The caller must be the owner or creator,  or
                 be privileged.  If a segment has been marked for destruction, then the (non-stan-
                 dard) SHM_DEST flag of the shm_perm.mode field in the associated  data  structure
                 retrieved by IPC_STAT will be set.

What I thought this means was:
It looked like to me the database instance starts building up its shared memory segments per 4096 page. Because IPC_RMID only marks the segment to be destroyed, and because it will only be truly destroyed when there are no processes attached to the shared memory segment, it looked like to me the background processes were pointed to the shared memory segment which was marked destroyed (in some way I hadn’t discovered yet), which meant the shared memory segment would actually survive and all database processes can use it. If ALL the database processes would be killed for any reason, for example with a shutdown abort, the processes would stop being connected to the shared memory segment, which would mean the shared memory segment would vanish automatically, because it was marked for destruction.
Sounds compelling, right?

Well…I was wrong! The sequence of creating and destroying small shared memory segments is done, but it turns out these are truly destroyed with the shmctl(…,IPC_RMID,…) call. I don’t know why the sequence of creating shared memory segments is happening.

I started looking for the actual calls that create the final, usable shared memory segments in the /tmp/oracle_startup.txt file. This is actually quite easy to do; first look up the shared memory segment identifiers using the sysresv utility (make sure the database’s ORACLE_HOME and ORACLE_SID are set):

$ sysresv
...a lot of other output...
Shared Memory:
ID		KEY
197394436	0x00000000
197427205	0x00000000
197361667	0x00000000
197459974	0x6ce0e164
Semaphores:
ID		KEY
1015811 	0xd5cdbca4
Oracle Instance alive for sid "dest"

Actually the ‘sysresv’ utility (system remove system V memory I think is what the name means) has the task of removing memory segments if there is no instance left to use them. It will not remove the memory segments if it finds the instance alive. It prints out a lot of information as a bonus.

Now that we got the shared memory identifiers, simply search in the trace file generated by strace, and search for the creation of the memory segment with the identifiers: (please mind searching with ‘less’ is done with the forward slash)

$ less /tmp/oracle_startup.txt
9492  shmget(IPC_PRIVATE, 905969664, IPC_CREAT|IPC_EXCL|0640) = 197394436
9492  shmat(197394436, 0x60400000, 0)   = ?
9492  times(NULL)                       = 430497743
9492  write(4, " Shared memory segment allocated"..., 109) = 109
9492  write(4, "\n", 1)                 = 1

Aha! here we see shmget() again, but now with a size (905969664) that looks much more like a real shared memory segment size used by the database! After the shared memory identifier is created, the process attaches it to its addressing space with shmat() to a specific memory address: 0x60400000.

The next thing to do, is to look for any shmctl() call for this identifier. Oracle could still do the trick of marking the segment for destruction…
…But…there are no shmctl() calls for this identifier, nor for any of the other identifiers shown with the sysresv utility. This is rather odd, because Linux shows them as “(deleted)”. There ARE dozens of shmat() calls, of the other (background) processes forked from the starting process when they attach to the shared memory segments.

So, conclusion at this point is Linux shows the shared memory segments as deleted in ‘maps’, but the Oracle database does not mark the segments for destruction after creation. This means that either Linux is lying, or something mysterious is happening in the Oracle executable which I didn’t discover yet.

I could only think of one way to verify what is truly happening here. That is to create a program myself that uses shared memory, so I have 100% full control over what is happening, and can control every distinct step.

This is what I came up with:

#include <stdio.h>
#include <sys/shm.h>
#include <sys/stat.h>

int main ()
{
  int segment_id;
  char* shared_memory;
  struct shmid_ds shmbuffer;
  int segment_size;
  const int shared_segment_size = 0x6400;

  /* Allocate a shared memory segment.  */
  segment_id = shmget (IPC_PRIVATE, shared_segment_size,
                     IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
  printf ("1.shmget done\n");
  getchar();
  /* Attach the shared memory segment.  */
  shared_memory = (char*) shmat (segment_id, 0, 0);
  printf ("shared memory attached at address %p\n", shared_memory);
  printf ("2.shmat done\n");
  getchar();
  /* Determine the segment's size. */
  shmctl (segment_id, IPC_STAT, &shmbuffer);
  segment_size  =               shmbuffer.shm_segsz;
  printf ("segment size: %d\n", segment_size);
  printf ("3.shmctl done\n");
  getchar();
  /* Write a string to the shared memory segment.  */
  sprintf (shared_memory, "Hello, world.");
  /* Detach the shared memory segment.  */
  shmdt (shared_memory);
  printf ("4.shmdt done\n");
  getchar();

  /* Deallocate the shared memory segment.  */
  shmctl (segment_id, IPC_RMID, 0);
  printf ("5.shmctl ipc_rmid done\n");
  getchar();

  return 0;
}

(I took the code from this site, and modified it a bit for my purposes)
If you’ve got a linux system which is setup with the preinstall rpm, you should be able to copy this in a file on your (TEST!) linux database server, in let’s say ‘shm.c’, and compile it using ‘cc shm.c -o smh’. This will create an executable ‘shm’ from this c file.

This program does more or less the same sequence we saw earlier:
1. Create a shared memory identifier.
2. Attach to the shared memory identifier.
3. Get information on the shared memory segment in a shmid_ds struct.
4. Detach the shared memory segment.
5. Destroy it using shmctl(IPC_RMID).

What I did was have two terminals open, one to run the shm program, and one to look for the results of the steps.

Step 1. (shmget)

$ ./shm
1. shmget done

When looking with ipcs, you can see the shared memory segment which is created because of the shmget() call:

$ ipcs -m

------ Shared Memory Segments --------
0x00000000 451608583  oracle     600        25600      0

when looking in the address space of the process running the shm program, the shared memory segment is not found. This is exactly what I expect, because it’s only created, not attached yet.

Step 2. (shmat)

shared memory attached at address 0x7f3c4aa6e000
2.shmat done

Of course the shared memory segment is still visible with ipcs:

0x00000000 451608583  oracle     600        25600      1

And we can see from ipcs in the last column (‘1’) that one process attached to the segment. Of course exactly what we suspected.
But now that we attached the shared memory to the addressing space, it should be visible in maps:

...
7f3c4aa6e000-7f3c4aa75000 rw-s 00000000 00:04 451608583                  /SYSV00000000 (deleted)
...

Bingo! The shared memory segment is visible, as it should be, because we just attached it with shmat(). But look: it’s deleted already according to Linux!

However I am pretty sure, as in 100% sure, that I did not do any attempts to mark the shared memory segment destroyed or do anything else to make it appear to be deleted. So, this means maps lies to us.

So, the conclusion is the shared memory Oracle uses is not deleted, it’s something that Linux shows us, and is wrong. When looking at the maps output again, we can see the shared memory identifier is put at the place of the inode number. This is handy, because it allows you to take the identifier, and look with ipcs for shared memory segments and understand which specific shared memory segment a process is using. It probably means that maps tries to look up the identifier number as inode number, which it will not be able to find, and then comes to the conclusion that it’s deleted.

However, this is speculation. Anyone with more or better insight is welcome to react on this article.

Every DBA working with the Oracle database must have seen memory dumps in tracefiles. It is present in ORA-600 (internal error) ORA-7445 (operating system error), system state dumps, process state dumps and a lot of other dumps.

This is how it looks likes:

Dump of memory from 0x00007F06BF9A9E00 to 0x00007F06BF9ADE00
7F06BF9A9E00 0000C215 0000001F 00000CC1 0401FFFF  [................]
7F06BF9A9E10 000032F3 00010003 00000002 442B0000  [.2............+D]
7F06BF9A9E20 2F415441 31323156 4F2F3230 4E494C4E  [ATA/V12102/ONLIN]
7F06BF9A9E30 474F4C45 6F72672F 315F7075 3735322E  [ELOG/group_1.257]
7F06BF9A9E40 3336382E 36313435 00003338 00000000  [.863541683......]
7F06BF9A9E50 00000000 00000000 00000000 00000000  [................]

The first column is the memory location in hexadecimal.
The second to fifth columns represent the actual memory values in hexadecimal.
The sixth column shows an ASCII representation of the memory contents. If a position does not represent an ASCII character, a dot (“.”) is printed.

Actually, the values in the second to fifth column are grouped in four columns. This is how the values in a column look like:
{hex val}{hex val}{hex val}{hex val}, for example: 00010203 means: 0, 1, 2, 3.

In the ASCII representation (sixth column) the spaces after every four values are not put in.

However, look at the following line:

7F06BF9A9E10 000032F3 00010003 00000002 442B0000  [.2............+D]

And focus on the last four characters:
“..+D” (two non-printables, plus, D)
Now look at the corresponding memory contents from the dump:
“442B0000” This is: “44 2B 00 00”, which should correspond to “. . + D”.
There is something the matter here: the plus and the D seem to be represented by “00”. That’s not correct.

Let’s see what “442B0000” actually represents in ASCI:

$ echo -e "\x44\x2B\x00\x00"
D+

Ah! That looks backwards! Let’s take a full line and see what that gives:
(This is the line with memory address 0x7F06BF9A9E20)

$ echo -e "\x2F\x41\x54\x41 \x31\x32\x31\x56 \x4F\x2F\x32\x30 \x4E\x49\x4C\x4E"
/ATA 121V O/20 NILN

So if you want to look at the actual memory contents, you need to start with the column on the left side, read the values from right to left, then go the next column, etc.

Endianness
Actual, I asked my friend Philippe Fierens for a trace file from a SPARC (big endian) platform, to see if the endianness of the platform was causing this. I test my stuff on Linux, which is little endian.

Here’s a little snippet:

Dump of memory from 0xFFFFFFFF7D977E00 to 0xFFFFFFFF7D97BE00
FFFFFFFF7D977E00 15C20000 00000001 00000000 00000104  [................]
FFFFFFFF7D977E10 F4250000 00000000 0B200400 E2EB8A3D  [.%....... .....=]
FFFFFFFF7D977E20 44475445 53540000 32F6D98B 00000590  [DGTEST..2.......]
FFFFFFFF7D977E30 00004000 00000001 00000000 00000000  [..@.............]
FFFFFFFF7D977E40 00000000 00000000 00000000 00000000  [................]

Let’s test the line from address 0xFFFFFFFF7D977E20:

[oracle@bigmachine [v12102] trace]$ echo -e "\x44\x47\x54\x45 \x53\x54\x00\x00 \x32\xF6\xD9\x8B \x00\x00\x05\x90"
DGTE ST 2� �

So, the endianness determines how the raw memory contents should be read.

This is the 4th post in a series of posts on PGA behaviour of Oracle. Earlier posts are: here (PGA limiting for Oracle 12), here (PGA limiting for Oracle 11.2) and the quiz on using PGA with AMM, into which this blogpost dives deeper.

As laid out in the quiz blogpost, I have a database with the following specifics:
-Oracle Linux x86_64 6u6.
-Oracle database 11.2.0.4 PSU 4
-Oracle database (single instance) with the following parameter set: memory_target=1G. No other memory related parameters set.

In this setup, I run the pga_filler script (source code here), which creates a collection until the session statistic ‘session pga memory’ exceeds the grow_until variable, which for this case I set to 2100000000 (approximately 2.1G).

So: the instance is set to have AMM (memory_target) with a size of 1GB, which is supposed to be the total amount memory which this instance uses, and a session runs a PL/SQL procedure which only stops if it has allocated 2.1GB, which is clearly more than configured with the memory_target parameter. Please mind a collection, which the anonymous procedure uses to allocate memory, is outside of the memory areas for which Oracle can move data to the assigned temporary tablespace (sort, hash and bitmap memory areas).

After startup of the instance with only memory_target set to 1G, the memory partitioning looks like this:

SYS@v11204 AS SYSDBA> select component, current_size/power(1024,2), last_oper_type from v$memory_dynamic_components where current_size != 0;

COMPONENT							 CURRENT_SIZE/POWER(1024,2) LAST_OPER_TYP
---------------------------------------------------------------- -------------------------- -------------
shared pool										168 STATIC
large pool										  4 STATIC
java pool										  4 STATIC
SGA Target										612 STATIC
DEFAULT buffer cache									424 INITIALIZING
PGA Target										412 STATIC

This is how v$pgastat looks like:

SYS@v11204 AS SYSDBA> select * from v$pgastat;

NAME								      VALUE UNIT
---------------------------------------------------------------- ---------- ------------
aggregate PGA target parameter					  432013312 bytes
aggregate PGA auto target					  318200832 bytes
global memory bound						   86402048 bytes
total PGA inuse 						   78572544 bytes
total PGA allocated						   90871808 bytes
maximum PGA allocated						   93495296 bytes
total freeable PGA memory					    2818048 bytes
process count								 57
max processes count							 58
PGA memory freed back to OS					    3211264 bytes
total PGA used for auto workareas					  0 bytes
maximum PGA used for auto workareas					  0 bytes
total PGA used for manual workareas					  0 bytes
maximum PGA used for manual workareas					  0 bytes
over allocation count							  0
bytes processed 						    8479744 bytes
extra bytes read/written						  0 bytes
cache hit percentage							100 percent
recompute count (total) 						 18

SYS@v11204 AS SYSDBA> show parameter pga

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target		     big integer 0

Okay, so far so good. v$memory_dynamic_components shows the PGA Target being 412M, and v$pgastat shows the aggregate PGA target setting being 412M too. I haven’t set pga_aggregate_target (as shown with ‘show parameter pga’), because I am using memory_target/AMM for the argument I hear the most in favour of it: one knob to tune.

Next up, I start the pga_filler script, which means the session starts to allocate PGA.

I keep a close watch using v$pgastat:

SYS@v11204 AS SYSDBA> select * from v$pgastat;

NAME								      VALUE UNIT
---------------------------------------------------------------- ---------- ------------
aggregate PGA target parameter					  432013312 bytes
aggregate PGA auto target					  124443648 bytes
global memory bound						   86402048 bytes
total PGA inuse 						  296896512 bytes
total PGA allocated						  313212928 bytes
maximum PGA allocated						  313212928 bytes

This shows the pga_filler script in progress by looking at v$pgastat from another session. The total amount of PGA allocated has grown to 313212928 (298M) here.

A little while later, the amount of PGA taken has grown beyond the PGA target (only relevant rows):

total PGA inuse 						  628974592 bytes
total PGA allocated						  645480448 bytes
maximum PGA allocated						  645480448 bytes

However, when looking at the memory components using v$memory_dynamic_components, it gives the impression PGA memory is still 412M:

SYS@v11204 AS SYSDBA> select component, current_size/power(1024,2), last_oper_type from v$memory_dynamic_components where current_size != 0;

COMPONENT							 CURRENT_SIZE/POWER(1024,2) LAST_OPER_TYP
---------------------------------------------------------------- -------------------------- -------------
shared pool										168 STATIC
large pool										  4 STATIC
java pool										  4 STATIC
SGA Target										612 STATIC
DEFAULT buffer cache									424 INITIALIZING
PGA Target										412 STATIC

You could argue PGA is explicitly mentioned as ‘PGA Target’, but then: the total of the memory area’s (PGA Target+SGA Target) do show a size that roughly sums up to be equal to the memory_target.

A little while later, this is what v$pgastat is showing:

total PGA inuse 						  991568896 bytes
total PGA allocated						 1008303104 bytes
maximum PGA allocated						 1008303104 bytes

Another glimpse at v$memory_dynamic_components shows the same output as above, PGA Target at 412M. This is the point where it get’s a bit weird: the total amount of PGA memory (according to v$pgastat) shows it’s almost 1G, memory_target is set at 1G, and yet v$memory_dynamic_components show no change at all.

Again a little further in time:

total PGA inuse 						 1325501440 bytes
total PGA allocated						 1342077952 bytes
maximum PGA allocated						 1342077952 bytes

Okay, here it get’s really strange: there’s more memory allocated for PGA memory alone than has been set with memory_target for both PGA and SGA memory structures. Also, v$memory_dynamic_components shows no change in SGA memory structures or exchange of memory from SGA to PGA memory.

If v$pgastat is correct, and memory_target actively limits the total amount of both SGA and PGA, then the session must allocate memory out of thin air! But I guess you already came to the conclusion too that either v$pgastat is incorrect, or memory_target does not limit memory allocations (as at least I think it would do).

Let’s dump the PGA heap of the active process to see the real memory allocations of this process:

SYS@v11204 AS SYSDBA> oradebug setospid 9041
Oracle pid: 58, Unix process pid: 9041, image: oracle@bigmachine.local (TNS V1-V3)
SYS@v11204 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11204 AS SYSDBA> oradebug dump heapdump 1
Statement processed.

(9041 is the PID of the process running PL/SQL)

Now look into (the relevant) data of the PGA heap dump:

[oracle@bigmachine [v11204] trace]$ grep Total\ heap\ size v11204_ora_9041.trc
Total heap size    =1494712248
Total heap size    =    65512
Total heap size    =  1638184

Okay, this is clear: the process actually took 1494712248 (=1425M) plus a little more memory. So, memory_target isn’t that much of a hard setting after all.

But where does this memory come from? There ought to be a sort of combined memory effort together with the SGA for memory, right? That was the memory_target promise!

Let’s take a look at the actual memory allocations of a new foreground process in /proc/PID/maps:

[oracle@bigmachine [v11204] trace]$ less /proc/11405/maps
00400000-0bcf3000 r-xp 00000000 fc:02 405855559                          /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/oracle
0bef2000-0c0eb000 rw-p 0b8f2000 fc:02 405855559                          /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/oracle
0c0eb000-0c142000 rw-p 00000000 00:00 0
0c962000-0c9c6000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:10 351997                             /dev/shm/ora_v11204_232652803_0
60001000-60400000 rw-s 00001000 00:10 351997                             /dev/shm/ora_v11204_232652803_0
...
9fc00000-a0000000 rw-s 00000000 00:10 352255                             /dev/shm/ora_v11204_232685572_252
a0000000-a0400000 rw-s 00000000 00:10 354306                             /dev/shm/ora_v11204_232718341_0
3bb3000000-3bb3020000 r-xp 00000000 fc:00 134595                         /lib64/ld-2.12.so
3bb321f000-3bb3220000 r--p 0001f000 fc:00 134595                         /lib64/ld-2.12.so
3bb3220000-3bb3221000 rw-p 00020000 fc:00 134595                         /lib64/ld-2.12.so
3bb3221000-3bb3222000 rw-p 00000000 00:00 0
3bb3400000-3bb3401000 r-xp 00000000 fc:00 146311                         /lib64/libaio.so.1.0.1
...
3bb5e16000-3bb5e17000 rw-p 00016000 fc:00 150740                         /lib64/libnsl-2.12.so
3bb5e17000-3bb5e19000 rw-p 00000000 00:00 0
7f018415a000-7f018416a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018416a000-7f018417a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018417a000-7f018418a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018418a000-7f018419a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018419a000-7f01841aa000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841aa000-7f01841ba000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841ba000-7f01841ca000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841ca000-7f01841da000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841da000-7f01841ea000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841ea000-7f01841fa000 rw-p 00000000 00:05 1030                       /dev/zero
7f01841fa000-7f018420a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018420a000-7f018421a000 rw-p 00000000 00:05 1030                       /dev/zero
7f018421a000-7f018422a000 rw-p 00000000 00:05 1030                       /dev/zero
7f68d497b000-7f68d4985000 r-xp 00000000 fc:02 268585089                  /u01/app/oracle/product/11.2.0.4/dbhome_1/lib/libnque11.so
...

When I run the pga_filler anonymous PL/SQL block, and strace (system call trace) utility, I see (snippet):

mmap(0x7f0194f7a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194f7a000
mmap(0x7f0194f8a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194f8a000
mmap(0x7f0194f9a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194f9a000
mmap(0x7f0194faa000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194faa000
mmap(0x7f0194fba000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194fba000
mmap(0x7f0194fca000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194fca000
mmap(0x7f0194fda000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194fda000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 6, 0xea000) = 0x7f0194e6a000
mmap(0x7f0194e6a000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194e6a000
mmap(0x7f0194e7a000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194e7a000
mmap(0x7f0194e9a000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194e9a000
mmap(0x7f0194eba000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 6, 0) = 0x7f0194eba000

So, when looking back, it’s very easy to spot the SGA memory, which resides in /dev/shm in my case, and looks like ‘/dev/shm/ora_v11204_232652803_0’ in the above /proc/PID/maps snippet.
This means that the mmap() calls are simply, as anyone would have guessed by now, the PGA memory allocations. In the maps snippet these are visible as being mapped to /dev/zero.
When looking at the mmap() call, at the 5th argument, which is the number 6, we look at a file descriptor. In /proc/PID/fd the file descriptors can be seen, and file descriptor 6 is /dev/zero, as you probably suspected. This way the allocated memory is initial set to zero.

By now, the pga_filler script finishes:

TS@v11204 > @pga_filler
begin pga size : 3908792
last  pga size : 2100012216
begin uga size : 1607440
last  uga size : 2000368
parameter pat  : 0

Taking the entire 2.1G I made the collection to grow to. With memory_target set to 1G.

Conclusion
The first conclusion I made is that PGA memory is very much different than SGA/shared memory. Anyone with a background in Oracle operating-system troubleshooting will find this quite logical. However, the “promise” AMM/memory_target made, in my interpretation, is that the memory would be used seamless. This is simply not the case. Shared memory is in /dev/shm, and PGA is mmaped/allocated as private memory.

Still, this wouldn’t be that much of an issue if memory_target would limit memory in a rigid way, and memory could, and actually would, very easily float between PGA and SGA. It simply doesn’t.

Why don’t we see Oracle trying to reallocate memory? This is the point where I can only guess.

– Probably, Oracle would try to grow the shared pool if it has problems allocating memory for SQL, library cache, etc. This probably hasn’t happened in my test.
– Probably, Oracle would try to grow the buffer cache if it can calculate a certain benefit from enlarging it. This probably hasn’t happened in my test.
– The other SGA area’s (large and java pool) probably are grown if these are used, and need more space for allocations. This probably didn’t happen in my test.
– For the PGA, a wild guess is the memory manager calculates using the workarea sizes (sort, hash and bitmap areas), which are not noticeably used in my test.

Another conclusion and opinion is AMM/memory_target is not a set once and forget option. In fact, it isn’t that much of a difference from using ASMM from a DBA perspective: you carefully need to understand the SGA size, and you carefully need to (try to) manage the PGA memory. Or reasoned the other way around: the only way you can sensibly set memory_target is if you know the correct SGA size and the PGA usage. Also having Oracle manage the memory area’s automatically is not unique to AMM: Oracle will reallocate (inside the SGA) if it finds it necessary, with AMM, ASMM and even manual set memory area’s. But the big dis-advantage of AMM (at least on linux, not sure about other operating systems) is that huge pages can’t be used, which has a severe impact on “real life” databases, in my experience. (Solaris CAN use huge pages with AMM(!)).

A final word: of course I tested a very specific situation. In most real-life cases there will be multiple sessions, and the PGA manageable memory areas will be used. However, the point I try to make is memory_target is simply not a way to very easily make your database be hard limited to the value set. Probably, in real life, the real amount of memory used by the instance will in the area of the value set with memory_target, but this will be subject to what memory areas you are exactly using. Of course it can differ in a spectaculair way if collections or alike structures are used by a large number of sessions.

This is a series of blogposts on how the Oracle database makes use of PGA. Earlier posts can be found here (PGA limiting for Oracle 12) and here (PGA limiting for Oracle 11.2).

Today a little wednesday fun: a quiz.

What do you think will happen in the following situation (leave a response as comment please!):

-Oracle Linux x86_64 6u6.
-Oracle database 11.2.0.4 PSU 4
-Oracle database (single instance) with the following parameter set: memory_target=1G. No other memory related parameters set.

Run the pga_filler script (which can be found here (PGA limiting for Oracle 12)), with grow_until set to 2100000000 (approximately 2.1G).

I’ll try to create a blogpost on the outcome and an explanation on short notice!

%d bloggers like this: