Archive

Tag Archives: tuning

This post is about a change in how the time is measured for the event ‘log file parallel write’. This is important for the performance tuning of any change activity in an Oracle database, because with the default commit settings, a foreground session that commits changes waits in the wait event ‘log file sync’, which is a wait on logwriter activity, for which the wait event ‘log file parallel write’ always has been the indicator of the time spend on IO.

Log file sync
First things first: a foreground session normally waits on the wait event ‘log file sync’ when it commits waiting for its change vectors to be written to the online redologfile(s) by the logwriter. It is wrong to always assume a ‘log file sync’ will be present. If, somehow, the logwriter manages to increase the ON DISK SCN to or beyond the foreground session’s commit SCN, there will be no ‘log file sync’ wait event.

I found the absence of the ‘log file sync’ wait event by artificially slowing the committing session down. However, with the improvements in parallelism in the Oracle code, and even more with the improvements on the hardware layer for lowering write latency (especially with persistent memory), this might be a situation that is actually visible in the near future. This is true for at least version 11.2.0.4 and higher (probably earlier too, but these are versions that I verified).

Log file parallel write
However, this post is about the event that in a lot of cases is the culprit for the ‘log file sync’ time: it’s about the ‘log file parallel write’ wait event, which takes place in the logwriter (when the logwriter itself writes, it could delegate that to the logwriter worker processes, LGnn), which normally would be measured by Oracle database tuners to validate logwriter IO latency in case of foreground sessions showing high ‘log file sync’ waits indicating waiting on the logwriter.

I was looking at the logwriter code path using the ‘debug trace’ tool from intel pin tools and used Tanel Poder’s snapper using begin and end snapshots on the logwriter process on Linux, non ASM, Oracle 19.5. I put the logwriter into single logwriter mode (“_use_single_log_writer”=true) and performed redo, which is bound to be picked up by the logwriter.

There I noticed the statistics indicated a write to have been taking place, whilst there was no ‘log file parallel write’ wait event. Weird… So I dug deeper and looked at the functions taking place. These are the functions that I found: https://gitlab.com/snippets/1926207

The important bits are:
– In the function ksfdaio() the IO request is submitted:
ksfdaio>ksfdgo>ksfd_skgfqio>skgfqio>skgfr_lio_listio64>io_submit
– And in the function ksfdblock() the IO request result is read (“reaped”):
ksfdblock>ksfdpoll>ksfdwtio>skgfrwat>skgfrliopo>io_getevents

But no wait event is called *at all!*

This is a clear change from earlier versions, where the logwriter write would always yield a ‘log file parallel write’ wait event. But when did this change happen? To answer that question, I installed version 12.1.0.2 (no patches) to look at the code path and found: https://gitlab.com/snippets/1926211

The important bits are:
– The wait interface is invoked: kslwtbctx()
– In the function ksfdgo() the IO request is submitted:
ksfd_sbio>ksfdgo>ksfd_skgfqio>skgfqio>skgfr_lio_listio64>io_submit
– And in the function ksfdwtio() to IO request result is read (“reaped”):
ksfd_sbio>ksfdwtio>skgfrwat>skgfospo>skgfrliopo>io_getevents
– And the wait interface is ended: kslwtectx()

Ah! So a clear change! So when did this change happen? I installed 12.2.0.1 and found the same ‘waiting’ codepath, then installed 18.3, and found the waiting codepath. So the change probably happened in Oracle 19! So I installed 19.3, and found the waiting codepath once again!! So was it a change in PSU 19.4 or 19.5? I installed 19.5, and found the waiting codepath too.

O.M.G. Why am I seeing different behaviour between two databases on the exact same version? Well, I set some undocumented parameters, so I replicated these in my newly installed database…and it didn’t change.

What else is different? Well…one is a multi-tenant database, and the other isn’t…but surely that doesn’t change the IO codepath?? Of course this is easy to check, so I dropped the non-CDB database and created a CDB one, and…it now showed the same “non waiting” codepath.

So, apparently, using the oracle database in multi-tenant mode changes the IO codepath behaviour. I checked the codepath going down in versions, and it turns out this change appeared in 12.2.0.1.0, so essentially in Oracle 12.2 (Oracle 18 and 19 are actually Oracle 12.2.0.2 and Oracle 12.2.0.3). To be clear on this: I didn’t find this change in 12.1.0.2.

Does this mean the wait events ‘log file parallel write’ does not show up at all? No, if the call io_getevents after io_submit does not return all submitted IOs, it must wait for it, and at that point invoke the wait interface and then go into a blocking io_getevents call, so the actual waiting is timed. This is shown in this snippet: https://gitlab.com/snippets/1926212

This shows ksfdblock>ksfdpoll>ksfdwtio>skgfrwat>skgfrliopo>io_getevents, which returns 0 (what means that it didn’t found any IOs in the completion queue), which then returns up to the ksfdpoll function, invokes the wait interface, kslwtbctx, and then dives to io_getevents again, but now in blocking mode to wait for the IO.

I also replayed the tests on ASM, which shows the exact same (non-)waiting behaviour.

Conclusion
For Oracle versions 12.1.0.2 up to 19.5, the wait event ‘log file parallel write’ includes both the submission of the IO request (io_submit call) and the waiting for the submitted IOs to finish (io_getevents call). This means the waiting time of the event in this case is the total latency of the IO request, including operating system processing.

Starting from version 12.2, only if the multi-tenant option is used, this is changed, and the wait event ‘log file parallel write’ only is shown if the submitted IOs are not available after they are submitted, and thus the logwriter process has to wait for them. If the wait is shown, the time in the event is the actual waiting time for the IOs to finish from the perspective of the logwriter process, not the IO latency time.

Of course if you have multiple log groups on disks with different latencies, this wait event waits for all IOs to finish, which means it times the slowest IO of all submitted IOs.

I was asked some time ago what the Oracle database event ‘TCP socket (KGAS)’ means. This blogpost is a deep dive into what this event times in Oracle database 12.1.0.2.180717.

This event is not normally seen, only when TCP connections are initiated from the database using packages like UTL_TCP, UTL_SMTP and the one used in this article, UTL_HTTP.

A very basic explanation is this event times the time that a database foreground session spends on TCP connection management and communicating over TCP, excluding client and database link (sqlnet) networking. If you trace the system calls, you see that mostly that is working with a (network) socket. Part of the code in the oracle database that is managing that, sits in the kernel code layer kgas, kernel generic (of which I am quite sure, and then my guess:) asynchronous services, which explains the naming of the event.

This is what the Oracle online manual (https://docs.oracle.com/database/121/REFRN/GUID-203ACA60-9912-4493-9B79-EA4CDE89A78D.htm#REFRN00642 – Oracle is notorious for breaking links) says about ‘TCP socket (KGAS)’:

C.3.157 TCP Socket (KGAS)
A session is waiting for an external host to provide requested data over a network socket. The time that this wait event tracks does not indicate a problem, and even a long wait time is not a reason to contact Oracle Support. It naturally takes time for data to flow between hosts over a network, and for the remote aspect of an application to process any request made to it. An application that communicates with a remote host must wait until the data it will read has arrived. In addition, on Microsoft Windows, a separate thread monitors the arrival of traffic. This thread spends most of its life in waits tracked by the TCP Socket (KGAS) wait event.

Wait Time: The total elapsed time for the network connection to be established or for data to arrive from over the network

Parameter Description
P0 For Oracle internal use only. Values 8, 9, and 10 occur within the special thread present on Microsoft Windows; other P0 values occur in normal user sessions.

P1 For Oracle internal use only

Overall, the basic explanation that Oracle provides is mostly correct. I think the text saying to not contact Oracle support is not relevant, but maybe there is a need to relieve Oracle support. In my tests, I found that the full TCP connection lifecycle (creation, usage and removal) is put under this event, for which the text seems to emphasise on waiting for a remote host, which would be the most logical culprit for wait times, but other issues could lead to wait times additionally. This means the wait event itself is not explaining what it is showing, outside of TCP connection management.

The wait time explanation is nearly complete. If it would say something like ‘all TCP connection management and usage’ it would have fully covered it, it now excludes disconnecting and sending, because it explicitly mentions creating the connecting and receiving (waiting for data).

I do not understand what is meant with P0 and P1. I think it is p1 and p2 of the wait event, but naming it P0 and P1 is weird. When looking at the explanation it reads to me ‘we do not wish to explain anything to you’.

So, that means I am going to find this out myself….

If you are interested in this, or do want to write articles like this too, I urge you to replay this on your own system.

First of all, create a small setup which you can use to actually execute UTL_HTTP. The example essentially is taken from father of code examples, Tim Hall/Oracle base. Thank you Tim!
First setup the database to allow a user (in my case ‘ts’) to create the objects and use the network:

grant create sequence to ts;
grant create procedure to ts;
grant create table to ts;
grant alter session to ts;
begin
  dbms_network_acl_admin.create_acl (
    acl => 'anything.xml',
    description => 'allow anything',
    principal => 'TS',
    is_grant => true,
    privilege => 'connect'
  );
end;
begin
  dbms_network_acl_admin.assign_acl (
    acl => 'anything.xml',
    host => '*'
  );
end;

Then connect as the actual user (ts), and create the objects and the procedure that uses UTL_HTTP:

drop table http_clob_test;
create table http_clob_test (
        id number(10),
        url varchar2(255),
        data clob,
        constraint http_clob_test_pk primary key (id)
);
drop sequence http_clob_test_seq;
create sequence http_clob_test_seq;
CREATE OR REPLACE PROCEDURE load_html_from_url (p_url  IN  VARCHAR2) AS
  -- examples by tim hall
  -- https://oracle-base.com/articles/misc/retrieving-html-and-binaries-into-tables-over-http
  l_http_request   UTL_HTTP.req;
  l_http_response  UTL_HTTP.resp;
  l_clob           CLOB;
  l_text           VARCHAR2(32767);
BEGIN
  DBMS_LOB.createtemporary(l_clob, FALSE);

  -- Make a HTTP request and get the response.
  l_http_request  := UTL_HTTP.begin_request(p_url);
  l_http_response := UTL_HTTP.get_response(l_http_request);

  -- Copy the response into the CLOB.
  BEGIN
    LOOP
      UTL_HTTP.read_text(l_http_response, l_text, 32766);
      DBMS_LOB.writeappend (l_clob, LENGTH(l_text), l_text);
    END LOOP;
  EXCEPTION
    WHEN UTL_HTTP.end_of_body THEN
      UTL_HTTP.end_response(l_http_response);
  END;

  -- Insert the data into the table.
  INSERT INTO http_clob_test (id, url, data)
  VALUES (http_clob_test_seq.NEXTVAL, p_url, l_clob);

  -- Relase the resources associated with the temporary LOB.
  DBMS_LOB.freetemporary(l_clob);
EXCEPTION
  WHEN OTHERS THEN
    UTL_HTTP.end_response(l_http_response);
    DBMS_LOB.freetemporary(l_clob);
    RAISE;
END load_html_from_url;
/

The first thing to do is turn on sql_trace at level 8 to include waits:

set serverout on
alter session set events 'sql_trace level 8';
exec load_html_from_url('http://orafun.info/');
alter session set events 'sql_trace off';

If you look at the relevant piece, which means where it shows the wait events, it looks like this:

WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 128265  =2  =0  =0 obj#=662 tim=86395107497
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 395  =5  =0  =0 obj#=662 tim=86395110191
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 150  =6  =0  =0 obj#=662 tim=86395111115
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 131998  =6  =0  =0 obj#=662 tim=86395243764
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 269  =4  =0  =0 obj#=662 tim=86395245182
WAIT #139864521752120: nam='direct path write temp' ela= 4137 file number=201 first dba=257795 block cnt=1 obj#=662 tim=86395250494
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 352  =3  =2  =0 obj#=662 tim=86395251294

What is shown here is some quite spectacular differences in elapsed time. Also, the only way to understand what is actually done flagged as ‘TCP Socket (KGAS)’ is the value following ‘ela’, which is the event p1 value.
The pattern is:

- 2
- 5
- 6
- 6
- 4
- 3

It’s relatively simple to guess what a few of these are:

- 2 - connect
- 5 - send
- 6 - \
- 6 - |   receiving ?
- 4 - /
- 3 - close

But if you include the timing, there must be more into play:

- 2 - ela= 128265   connect
- 5 - ela= 395      send
- 6 - ela= 150      \
- 6 - ela= 131998   |   receiving ?
- 4 - ela= 14       /
- 3 - ela= 177     close

2/connect: In order to build up a connection, a tcp connection needs to be created and established. That takes some time.
5/send: Sending from the perspective of a userland process is writing into a socket, which will get send by the operating system independently. This means sending from a userland process normally takes relative little time, because it’s not waiting for actually sending it.
6,4/receive: At this time, this doesn’t make sense to me.
3/close: Closing for a userland process is a simple, swift task. The operating system will keep the port open for some time, etc. but this is not visible for the user land application.

Let’s pull an old trick out of the hat: use strace (system call tracing) with an emphasis on writing on an oracle session that has SQL trace with waits enabled set. This will show the system calls executed, and show exactly when the oracle engine ends a wait, so we can reasonably well establish a relation between wait events and system calls. I say “reasonably well”, because we can’t see when Oracle truly started timing the wait event (kslwtbctx), only the output to trace file as part of ending the wait event (kslwtectx).

The way I done it, is using the command ‘strace -e write=all -p 18513 -o utl_http_test.txt’. Obviously 18513 is the process ID of the database foreground process. The results of the strace are in utl_http_test.txt.

Now open utl_http_test.txt and search for KGAS. The full output is way too much text, let me show some of the output which I think is noteworthy. Again: this is selective, partial output.
I do maintain the order in which the calls are visible.

1. TCP Socket (KGAS) p1=2, earlier annotated as ‘connect’

-- try to find a socket that has been created by nscd (name server caching deamon)
-- two times?
--
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
--
-- obtain file status of resolv.conf (hostname resolving configuration file)
--
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
--
-- open and read host.conf (another hostname resolving configuration file)
--
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "multi on\n", 4096)             = 9
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open and read resolv.conf (hostname resolving configuration)
--
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "# Generated by NetworkManager\nse"..., 4096) = 93
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open /etc/hosts (ip address to hostname mapping locally)
--
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=200, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "127.0.0.1   localhost localhost."..., 4096) = 200
read(9, "", 4096)                       = 0
close(9)
--
-- at this point two dynamic loadable libraries are read: libnss_dns.so.2 and libresolv.so.2
--
-- this is the DNS lookup of orafun.info
-- again, this is done twice, just like the use of /var/run/nscd/socket above?
--
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 0)    = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "=#\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  3d 23 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  =#...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 01 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 5000)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [45])                = 0
recvfrom(9, "=#\201\200\0\1\0\1\0\0\0\0\6orafun\4info\0\0\1\0\1\300\f\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 45
close(9)                                = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 4971) = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "o=\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  6f 3d 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  o=...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 1c 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 4970)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [109])               = 0
recvfrom(9, "o=\201\200\0\1\0\0\0\1\0\0\6orafun\4info\0\0\34\0\1\300\f\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 109
close(9)                                = 0
--
-- an epoll is created at file descriptor 9 (epoll: I/O event notification facility)
--
epoll_create(82)                        = 9
fcntl(9, F_SETFD, FD_CLOEXEC)           = 0
--
-- an IPV6 socket is created at file descriptor 11, 
-- bound to the IPV6 equivalent of localhost (::1),
-- destination port 0, source port 63257,
-- and is NOT connected.
--
socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 11
bind(11, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(11, {sa_family=AF_INET6, sin6_port=htons(63257), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
getpeername(11, 0x7ffdea6ba0f8, 0x7ffdea6ba1c8) = -1 ENOTCONN (Transport endpoint is not connected)
getsockopt(11, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
getsockopt(11, SOL_SOCKET, SO_RCVBUF, [262144], [4]) = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
--
-- File descriptor 11 is added to the epoll at file descriptor 9.
--
epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=3110993336, u64=139864426020280}}) = 0
--
-- A connection is created to the true destination (orafun.info/18.218.92.122).
-- This connection gets file descriptor 12.
-- Destination port 80 (http), source port 11751.
--
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 12
fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(12, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("18.218.92.122")}, 16) = -1 EINPROGRESS (Operation now in progress)
times(NULL)                             = 438106227
mmap(NULL, 786432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34b959b000
poll([{fd=12, events=POLLOUT}], 1, 60000) = 1 ([{fd=12, revents=POLLOUT}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(12, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(12, F_SETFL, O_RDWR)              = 0
getsockname(12, {sa_family=AF_INET, sin_port=htons(11751), sin_addr=inet_addr("10.0.2.15")}, [16]) = 0
getsockopt(12, SOL_SOCKET, SO_SNDBUF, [87040], [4]) = 0
getsockopt(12, SOL_SOCKET, SO_RCVBUF, [374400], [4]) = 0
setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(12, F_SETFD, FD_CLOEXEC)          = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=2
--
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 32 38 32  36 35 20 20 3d 32 20 20  la= 128265  =2   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 31 30 37 34 39 37     tim=86395107497  |

So yes, I am not sure if all of this is in the wait event, but there is a lot of stuff happening to build a connection to the remote server.

In order to find out why the lookup which is tried via the NSCD socket at the beginning, and later via DNS, is done twice, I ran the same procedure again and used tcpdump to look at the actual network traffic. This explained a lot:

# tcpdump -n host 10.0.2.3 and port 53
09:14:02.923389 IP 10.0.2.15.16819 > 10.0.2.3.domain: 15651+ A? orafun.info. (29)
09:14:02.948791 IP 10.0.2.3.domain > 10.0.2.15.16819: 15651 1/0/0 A 18.218.92.122 (45)
09:14:02.952304 IP 10.0.2.15.54590 > 10.0.2.3.domain: 28477+ AAAA? orafun.info. (29)
09:14:02.979534 IP 10.0.2.3.domain > 10.0.2.15.54590: 28477 0/1/0 (109)

In other words: first a DNS A record is requested (TCPv4 DNS name lookup), which results in the IPv4 ip address, then a DNS AAAA record is requested (TCPv6 DNS name lookup), which resulted in no ip address. In other words: orafun.info only has an IPv4 ip address. So the two lookups actually do have a function.

2. TCP Socket (KGAS) p1=5, earlier annotated as ‘send’
These are the systemcalls that are visible and quite probably related to the send wait event:

--
-- file descriptor 12 holding the connection to the destination server is added to the epoll at file descriptor 9
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- Then the http get request is sent to the destination server at its normal file descriptor, 12.
--
write(12, "GET / HTTP/1.1\r\nHost: orafun.inf"..., 56) = 56
 | 00000  47 45 54 20 2f 20 48 54  54 50 2f 31 2e 31 0d 0a  GET / HTTP/1.1.. |
 | 00010  48 6f 73 74 3a 20 6f 72  61 66 75 6e 2e 69 6e 66  Host: orafun.inf |
 | 00020  6f 0d 0a 43 6f 6e 6e 65  63 74 69 6f 6e 3a 20 63  o..Connection: c |
 | 00030  6c 6f 73 65 0d 0a 0d 0a                           lose....         |
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=5
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 39 35 20  20 3d 35 20 20 3d 30 20  la= 395  =5  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  30 31 39 31              =86395110191     |

3. TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 0, so it doesn't block.
--
epoll_wait(9, [], 82, 0)                = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 35 30 20  20 3d 36 20 20 3d 30 20  la= 150  =6  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  31 31 31 35              =86395111115     |

I have been thinking a lot about this seemingly weird call. It calls epoll_wait, but indicates it doesn’t want to wait (timeout=0), and even if epol_wait would have returned anything, indicated by a return code > 0, the epoll_event pointer is not set (indicated by []). The epoll file descriptor is used, but the only working file descriptor in the epoll is file descriptor 12, which has just been sent a http GET command, so the functionality of epoll is used.

This doesn’t make sense, unless you think about the asynchronous IO implementation of Oracle (see a lot of my earlier investigations), for which (in the case of asynchronous IO) io_getevents was called in a similar matter, timeout set to 0, to be able to do more requests while earlier IO requests are executed by the kernel. So my current theory here is that if multiple requests are happening, this mechanism provides a way to handle them.

If you have a simple single request, like in my case, this systemcall seems redundant. And because it queries the epoll file descriptor right after the request, it returns zero events, because there hardly has been any time after sending the http GET request.

4. Second TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 30000 (milliseconds).
--
epoll_wait(9, [{EPOLLIN, {u32=3110998864, u64=139864426025808}}], 82, 30000) = 1
--
-- And this is the second wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 33 31 39  39 38 20 20 3d 36 20 20  la= 131998  =6   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 32 34 33 37 36 34     tim=86395243764  |

This is the second time epoll_wait is called, and this one is blocking, because timeout has been set to 30000 milliseconds. If you look at the ela time, this took some time, and this now makes perfect sense: this system calls waits for an event to become available in the epoll, so it waits for the response of the remote http server. Please mind this call just notifies the userland process that the response is ready, the received data yet has to be read.

5. TCP Socket (KGAS) p1=4, earlier annotated as ‘receive’

--
-- At this point we know there is a response. First the original file descriptor is removed from the epoll:
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6b9710) = 0
--
-- The the response is read from file descriptor 12:
--
read(12, "HTTP/1.1 200 OK\r\nServer: nginx/1"..., 4096) = 2687
--
-- Then file descriptor 12 is added to the epoll again.
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=4
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 32 36 39 20  20 3d 34 20 20 3d 30 20  la= 269  =4  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 34  35 31 38 32              =86395245182     |

So, what p1 set to 4 actually means, is that once the connection did return data, which is checked using epoll, and visible with p1 set to 6, it is read into the process. This is also the reason this takes very little time, this is the time to read data from kernelspace to user space, and to manage the connection’s file descriptor. It is taken off the epoll in order not to disturb it, and it is added again because there could be another request.

6. TCP Socket (KGAS) p1=3, earlier annotated as ‘close’

--
-- file descriptor 12 removed from the epoll
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6bac20) = 0
--
-- file descriptor 12 is closed, closing the network connection
--
close(12)                               = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=3
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 35 32 20  20 3d 33 20 20 3d 32 20  la= 352  =3  =2  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 35  31 32 39 34              =86395251294     |

I don’t think this part holds any surprises. The network file descriptor is first removed from the epoll, and then it is closed, ending the TCP connection that was setup to perform a http request (in my case, I didn’t test, but I believe you will see the same with for example a SMTP connection, or any other type of TCP connection).

Summary

The basic message of this article is not surprising, nor does it conflict with current knowledge. Whenever you see a wait event ‘TCP Socket (KGAS)’, it means a foreground process is performing TCP networking via PLSQL. This wait event is a single event for creating, sending, receiving and closing a connection.

The true information of this article is how you can use the p1 value of the event to learn what actually the foreground is doing, and thus should give you more information to troubleshoot in the case of long waiting times.

TCP Socket (KGAS) p1 values:
1 - ?
2 - perform DNS lookup and create connection
3 - close connection
4 - copy TCP response into process space
5 - send request
6 - wait for TCP response to become available
7 - ?
8 - ? \
9 - ? | According to documentation, windows only in a 'special thread'.
10- ? /

There are many posts about the amount of memory that is taken by the Oracle database executables and the database SGA and PGA. The reason for adding yet another one on this topic is a question I recently gotten, and the complexities which surrounds memory usage on modern systems. The intention for this blogpost is to show a tiny bit about page sharing of linux for private pages, then move on to shared pages, and discuss how page allocation looks like with Oracle ASMM (sga_target or manual memory).

The version of linux in this blogpost is Oracle Linux 7.2, using kernel: 4.1.12-37.6.3.el7uek.x86_64 (UEK4)
The version of the Oracle database software is 12.1.0.2.160719 (july 2016).

Memory usage of virtual memory systems is complicated. For that reason I see a lot of people getting very confused about this topic. Let me state a very simple rule: the memory actively being used on a system should fit in physical memory. Swap (a file or partition), increases total virtual memory, but really only is a safety net for saving your system from an out of memory situation at the cost of moving pages from and to disk. Because modern linux kernels have swappiness (willingness to swap) to a non-zero value, it’s not uncommon to have some swap being used, despite physical memory not being oversubscribed. A system stops performing as soon as paging in and out starts to occur, and for that reason should not happen.

1. Private pages for linux executables
When an executable is executed on linux from the shell, the shell executes a fork() call to create a new process, which is implemented as a clone() system call on linux. Using the clone() system call, the virtual memory space of the newly created process is shared (readonly) with it’s parent. This includes the private allocations! Once the child process needs to write in it’s memory space, it will page fault and create it’s own version, abandoning the version of its parent.

Can we actually prove this is happening? Yes, the /proc/ filesystem gives an insight to a process’ virtual memory space.
Let’s start off with a very simple example: we execute ‘cat /proc/self/maps’ to see our own address space:

[oracle@oracle-linux ~]$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
0060c000-0060d000 rw-p 0000c000 fb:00 201666243                          /usr/bin/cat
00e41000-00e62000 rw-p 00000000 00:00 0                                  [heap]
7f69729be000-7f6978ee5000 r--p 00000000 fb:00 576065                     /usr/lib/locale/locale-archive
7f6978ee5000-7f6979099000 r-xp 00000000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979099000-7f6979298000 ---p 001b4000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979298000-7f697929c000 r--p 001b3000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929c000-7f697929e000 rw-p 001b7000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929e000-7f69792a3000 rw-p 00000000 00:00 0
7f69792a3000-7f69792c4000 r-xp 00000000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794b9000-7f69794bc000 rw-p 00000000 00:00 0
7f69794c3000-7f69794c4000 rw-p 00000000 00:00 0
7f69794c4000-7f69794c5000 r--p 00021000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c5000-7f69794c6000 rw-p 00022000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0
7ffdab1c7000-7ffdab1e8000 rw-p 00000000 00:00 0                          [stack]
7ffdab1ea000-7ffdab1ec000 r--p 00000000 00:00 0                          [vvar]
7ffdab1ec000-7ffdab1ee000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here’s a lot to see, but we see the cat executable at 0x00400000. The reason for three memory allocations are (linux/ELF) executables uses different sections with specific functions. A full overview of these can be obtained using the readelf executable. A simpler overview of an executable, which matches the above three memory allocations for the cat executable can be obtained using ‘size -B’ (the size executable, -B means ‘berkeley style’, which is default):

[oracle@oracle-linux ~]$ size -B /usr/bin/cat
   text	   data	    bss	    dec	    hex	filename
  43905	   1712	   2440	  48057	   bbb9	/usr/bin/cat

This describes the three memory sections an linux executable can have: text (the machine instructions, alias ‘the program’), data (all initialised variables declared in the program) and BSS (uninitialised data).
The first section always is the text allocation (not sure if it’s impossible to have the text section not being the first allocation, I have never seen it different). If you look at the memory flags, ‘r-xp’, this totally makes sense: ‘r-‘ meaning: read(only, followed by a’-‘ instead of a ‘w’), ‘x’: executable and ‘p’: this is a private allocation. The next allocation is the data section. We don’t execute variables, we read them, which is reflected in the flags: ‘r–p’. But what if we change the value of a variable? That is where the third section is for: changed values of initialised variables. This can be seen from the flag of this section: ‘rw-p’, read, write and private. The fourth allocation lists [heap], this is a mandatory allocation in every process’ memory space, which holds (small) memory allocations, this is NOT the BSS section. In this case, the BSS section does not seem to be allocated.

By having memory allocations for /usr/lib64/ld-2.17.so we can see this is a dynamically linked executable. You can also see this by executing ‘file’ on the executable:

[oracle@oracle-linux ~]$ file /usr/bin/cat
/usr/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=3207edc47638918ceaeede21947a20a4a496cf63, stripped

If a linux executable is dynamically linked, you can see the libraries that are loaded by the dynamic linker/loader using the ldd utility:

[oracle@oracle-linux ~]$ ldd /usr/bin/cat
       	linux-vdso.so.1 =>  (0x00007ffceb3e4000)
       	libc.so.6 => /lib64/libc.so.6 (0x00007fd46fb7e000)
       	/lib64/ld-linux-x86-64.so.2 (0x000055d5253c9000)

This output shows the dynamic loader (/lib64/ld-linux-x86-64.so.2), and two libraries the dynamic loader loads: libc.so.6 and linux-vdso.so.1. The first one, libc, is the standard C library. The second one, linux-vdso is for virtual dynamic shared object, which is an optimisation for certain system calls to be executed in user space (notably gettimeofday()).
The other allocations that exist in our example are anonymous mappings (usually done by programs using the mmap() call):

7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0

And some allocations for system purposes, like stack, var, vdso and vsyscall.

Now that you have become familiar with some basic linux memory address space specifics, let’s take it a little further. It’s possible to see more about the memory segments using the proc filesystem smaps file:

[oracle@oracle-linux ~]$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
Size:                 44 kB
Rss:                  44 kB
Pss:                  44 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           44 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
Size:                  4 kB
Rss:                   4 kB
...etc...

Per allocation there are a lot of properties to be seen. ‘Size’ is the full size, ‘Rss’ is the resident set size, alias the amount of data of this segment that is truly resident for this process in it’s address space. ‘Pss’ is fairly unknown, and is the proportional size of this segment. The way it is proportional is that if pages in this allocation are shared with other processes, the size of these pages are divided by the number processes it is shared with. In this case, we have loaded the text segment of the cat executable into the process’ address space, which all is resident (size and rss are the same) and it’s not shared with any process (rss equals pss). There are many more properties, but these are out of scope for this blogpost.

Now let’s move on to Oracle. If you look at the maps output of the pmon process for example, you’ll see:

[oracle@oracle-linux 14153]$ cat maps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8f000-10de8000 rw-p 1058f000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10de8000-10e19000 rw-p 00000000 00:00 0
1190f000-11930000 rw-p 00000000 00:00 0                                  [heap]
...

Here we see the Oracle executable, with a text segment, a readonly data segment and a read/write data segment, and we see an anonymous mapping directly following the data segments. That’s the BSS segment!
However, what is more interesting to see, is the properties of the distinct memory allocations in smaps:

[oracle@oracle-linux 14153]$ cat smaps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:             267704 kB
Rss:               40584 kB
Pss:                 819 kB
Shared_Clean:      40584 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:        40584 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:                136 kB
Rss:                 124 kB
...

If we look at the text segment for the oracle binary, we see the total text size is 267704 kB (size), but resident (truly available for this process in its address space) is only 40584 kB (rss), and because the oracle executable’s text segment is shared with a lot of processes, the proportional size is only 819 kB (pss).

If you want to understand how much memory is taken in the system, the size is telling the total size of the segment, but it doesn’t say anything on true memory usage. The rss size tells the amount of pages for the segment that is paged in to the address space of every process, and can (and is, for oracle) different for every process. The pss size is the proportional size for every process. Probably the only way to tell the true amount of memory taken by executables and libraries is to add up all the pss sizes. Any other value only tells something about the process’ point of view on memory usage, but not overall, true consumed space because that would lead to counting too much.

This is different for anonymous allocations. Since anonymous allocations are created when a process is run, I’ve only seen them initialised purely private. For that reason rss and pss sizes are equal, because every process initialises it strictly for itself. This too works in a lazy allocation way. When memory is allocated, the size is defined, but is only really allocated once it’s truly used, which is expressed by a difference between size and rss.

2. shared pages
The Oracle databases relies on shared caches and data structures, which are put into what is called the SGA, the system global area. The main components of the SGA are the shared pool (shared structures), log buffer (change vectors to be written to disk to persist changes) and the buffer cache, amongst others. With any memory management option (manual management, ASMM (automatic shared memory management, sga_target) and AMM (automatic memory management, memory_target)) there is a SGA. Depending on the memory option, these are visible in a different way.

When manual memory or ASMM is used, shared memory is allocated as system V shared memory. The ‘classic’ way of looking at system V shared memory is using ipcs -m (m is for shared memory, you can also use s for semaphores and q for message queues):

[oracle@oracle-linux ~]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 655360     oracle     600        2932736    124
0x00000000 688129     oracle     600        905969664  62
0x00000000 720898     oracle     600        139673600  62
0x5f921964 753667     oracle     600        20480      62

Please mind that if you have more than one instance active, or an ASM instance active, you will see more shared memory allocations.
Apparently, the oracle database allocates a couple of shared memory segments. If you want to understand what these memory allocations are for, you can use the oradebug ipc command to see what their functions are:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
IPC information written to the trace file

This generates a trace file in the ‘trace’ directory in the diagnostics destination. Here is how this looks like (partial output with content of interest to this blogpost):

 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002cbe70 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2   655360 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cc000 00000000002cc000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000036000000 Minimum Subarea size 00400000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0   688129 0x00000060400000 0x00000060400000 0x00000060400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000036000000 0000000036000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1   720898 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008534000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000005000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3   753667 0x0000009ec00000 0x0000009ec00000 0x0000009ec00000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000005000 0000000000005000 default       readwrite

The first allocation is ‘fixed size’, alias the fixed SGA, the second allocation is the ‘variable size’, which contains the shared pool and the buffercache, the third allocation is the ‘redo buffers’ and the fourth is the ‘skgm overhead’ alias the index into the shared memory structures for this instance.

Because any memory allocation is visible in maps and smaps, this method can be used for shared memory too, to see how the shared memory segments are mapped into the process address space. All oracle database server processes have the shared memory segments for the instance mapped into their address space. The usage is different per process, so the amount of shared memory paged into the address space will be different:

...
12bcd000-12bee000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:05 688129                             /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:05 720898                             /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:05 753667                             /SYSV5f921964 (deleted)
7f473004e000-7f47301d4000 r-xp 00000000 fb:02 212635773                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
...

Shared memory is easily identified by the ‘s’, at which “normal” private memory mappings have ‘p’. If you want to know more about the process’ perspective of the shared memory, we can use smaps, just like with private memory mappings (virtual memory space of pmon):

60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd sh mr mw me ms sd
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
Size:               2860 kB
Rss:                 392 kB
Pss:                  36 kB
Shared_Clean:          0 kB
Shared_Dirty:        372 kB
Private_Clean:         0 kB
Private_Dirty:        20 kB
Referenced:          392 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms sd

These two shared memory segments are belonging to the fixed sga. The reason for two segments is the first page (0x1000 equals 4096, alias a single linux page) is readonly (r–s). The other fixed SGA segment is read write (rw-s). Here we see that from the process’ perspective it really doesn’t matter much if a piece of mapped memory is shared or private; it’s exactly handled the same way, which means the full segment is mapped into the process’ virtual memory space, but only once pages are touched (alias truly used), the process registers the address in its pagetable, and the pages become resident (as can be seen in the difference between the total size and the rss). The sole purpose of shared memory is it is shared between process. That the pages are shared is very well visible with the difference between rss and pss size. Its also easy to spot this shared memory segment is created from small pages; MMUPageSize and KernelPageSize is 4kB.

However, this yields an interesting question: shared memory does not belong to any single process. Does that mean that if a shared memory segment is created, it is truly allocated, or can shared memory be lazy allocated as well? Please mind that above statistics are the process’ perspective, not the kernel’s perspective.

One way to see the state of shared memory system wide, is using the ‘-u’ flag with the ipcs command:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  255684
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This is a really useful view! What we can see from the output from this command, is that nearly all pages allocated as shared memory are resident. By having statistics for shared memory pages allocated and resident we can conclude that shared memory too could be allocated in a lazy, alias on demand. Also, there is a difference between resident and allocated, which indicates lazy allocation too.

Inside the database I am aware of two parameters that could influence shared pages usage; pre_page_sga and _touch_sga_pages_during_allocation, see my article on these. However, what is interesting, is that these parameters are different for the instance I am testing with for this blogpost, which is running on a VM:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: page
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%page%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME   						   VALUE       								  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
olap_page_pool_size    				   0   									  TRUE 	   FALSE      FALSE
pre_page_sga   					   TRUE        								  TRUE 	   FALSE      FALSE
use_large_pages        				   TRUE        								  TRUE 	   FALSE      FALSE
_max_largepage_alloc_time_secs 			   10  									  TRUE 	   FALSE      FALSE
_olap_page_pool_expand_rate    			   20  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hi     				   50  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hit_target     			   100 									  TRUE 	   FALSE      FALSE
_olap_page_pool_low    				   262144      								  TRUE 	   FALSE      FALSE
_olap_page_pool_pressure       			   90  									  TRUE 	   FALSE      FALSE
_olap_page_pool_shrink_rate    			   50  									  TRUE 	   FALSE      FALSE
_realfree_heap_pagesize        			   65536       								  TRUE 	   FALSE      FALSE
_realfree_pq_heap_pagesize     			   65536       								  TRUE 	   FALSE      FALSE
_session_page_extent   				   2048        								  TRUE 	   FALSE      FALSE
_touch_sga_pages_during_allocation     		   FALSE       								  TRUE 	   FALSE      FALSE

14 rows selected.

In the database I created on my VM, pre_page_sga equals to TRUE and _touch_sga_pages_during_allocation to FALSE, which is the exact inverse of the settings of a database (PSU 160419) on a huge machine. Perhaps these parameters are dynamically set based on size of the SGA and logic (if _touch_sga_pages_during_allocation is TRUE, it makes sense to set pre_page_sga to FALSE, as it’s function has been performed by the bequeathing session.

However, having pre_page_sga set to TRUE it makes sense almost all SGA (shared) pages are allocated, because pre_page_sga (at least in Oracle 12, not sure about earlier versions, because the Oracle description of this parameter is different from what happens in Oracle 12) spawns a background process (sa00) that scans SGA pages, which means it pages them, resulting in the actual allocation. Let’s test this by setting pre_page_sga to false, it should lead to way lesser shared memory pages allocated, which will eventually be allocated as database processes are paging them in:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force;

And then look at ipcs -mu again:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  92696
pages swapped   0
Swap performance: 0 attempts   	 0 successes

As expected, only the bare necessary pages are resident after startup force, all the other shared pages will be slowly paged in as foreground and background processes touching SGA pages during execution.

How would that work when we set sga_max_size to a different value than sga_target? If the pages beyond the sga_target are never allocated, you could control the amount of SGA pages used by setting sga_target, but ‘reserve’ extra memory to use by setting sga_max_size higher, which is never allocated, so it is not wasted. Let’s setup the instance:

SQL> alter system set pre_page_sga=true scope=spfile;
SQL> show spparameter sga_target

SID    	 NAME  			       TYPE    	   VALUE
-------- ----------------------------- ----------- ----------------------------
*      	 sga_target    		       big integer 1000M
SQL> ! ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  102512
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This sets the pre_page_sga parameter from the spfile, which means the instance will spawn a process to touch SGA pages on next startup.
Currently, the sga_target for sizing the SGA is set to 1000M in the spfile.
ipcs tells us we got 256005 pages are allocated, which makes sense: 256005*4=1024020k, which is slightly more than the set 1000M, which means essentially sga_target equals pages allocated.

SQL> alter system set sga_max_size=2g scope=spfile;
SQL> startup force;
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

This sets sga_max_size to double the amount of sga_target, and ‘startup force’ bounces the instance.

SQL> show parameter sga_target

NAME   				     TYPE      	 VALUE
------------------------------------ ----------- ------------------------------
sga_target     			     big integer 1008M

Here we see the actual parameter in the database is set to 1008M. Now let’s look at the ipcs -mu values again:

> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  521923
pages swapped   0
Swap performance: 0 attempts   	 0 successes

521923*4=2087692. So (almost) all the memory set for sga_max_size is allocated. In fact, if you look at the values at instance startup values reported above, you see ‘Total System Global Area’ showing the 2G, it’s all SGA, so it’s all touched because of pre_page_sga being set to TRUE. So the next test would be to have pre_page_sga being set to FALSE:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

All memory is still declared SGA, as we can see. However, by having _touch_sga_pages_during_allocation set to FALSE and pre_page_sga set to FALSE, we should see only the actual used SGA pages being allocated:

SQL> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  91692
pages swapped   0
Swap performance: 0 attempts   	 0 successes

The above output shows the shared memory status directly after I restart my instance, so this is not only less than sga_max_size, it is even less than sga_target (91692*4=336768, ~ 336M). This will grow up to sga_target, because these pages will get paged in by the database processes.

How does this look like when we add in huge pages? In Oracle 12.1.0.2.160719 in my instance the parameter to tell oracle to allocate huge pages if there are any (‘use_large_pages’) is set to TRUE. This will make Oracle use large pages if any are available. This is true, even if there are not enough huge pages to satisfy the entire SGA; Oracle will just allocate all that can be allocated, and create a new shared memory segment using small pages for the remainder of the needed shared memory.

Sadly, it seems per memory segment statistics like rss, pss, shared and private clean and dirty, etc. are not implemented for huge pages:

[oracle@oracle-linux [testdb] ~]$ cat /proc/$(pgrep pmon)/smaps
...
61000000-d8000000 rw-s 00000000 00:0e 688129                             /SYSV00000000 (deleted)
Size:            1949696 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms de ht sd
...

This is the main shared memory segment, allocated from huge pages (as can be seen with KernelPageSize and MMUPageSize), which means it’s the segment holding the shared pool and buffercache. This can also be seen by the size: 1949696 kB, which is nearly the 2G of sga_max_size.

However, we can just use the global information on system V shared memory (ipcs -mu) and we can use the huge page information in /proc/meminfo:

[oracle@oracle-linux [testdb] ~]$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:    1100
HugePages_Free:      880
HugePages_Rsvd:      805
HugePages_Surp:        0
Hugepagesize:       2048 kB

The statistics of interest are:
hugepages_total: the total number of huge pages allocated. warning: huge pages memory allocated by the kernel is NOT available for allocation of regular sized pages (which means you can starve your processes and the kernel for normal pages by setting the number of huge pages too high).
hugepages_free: the number of huge pages which are not used currently. warning: this includes allocated but not yet initialised pages, which hugepages_rsvd shows.
hugepages_rsvd: the number of huge pages allocated but not yet initialised.
hugepages_surp: the number of huge pages allocated (truly allocated and not yet initialised) greater than the total number of huge pages set. this value can be greater than zero if the kernel setting vm.nr_overcommit_hugepages is greater than zero. The value of this setting is zero by default, and at least for usage with the Oracle database, this value should remain zero.

The same information can be obtained using ipcs -mu, but with a twist:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524803
pages resident  122881
pages swapped   0
Swap performance: 0 attempts   	 0 successes

Some of you might get the twist on this by looking at the number.
It turns out ipcs has no facility for huge pages, it just reports the number of pages as if these were 4 kB.
524803*4 (kB) / 1024 (to make it MB) = 2050.

Now going back to the goal of looking into this: I told shared memory is allocated and paged at startup time when _touch_sga_pages_during_allocation is set to TRUE (set to false as default value in my current database), and it could be explicitly paged by the background process sa00 after startup of the instance when pre_page_sga is set to TRUE. When both are set to false, shared memory allocated from default sized 4kB pages is allocated only when it’s used. In the above examples with huge pages, the tests were done with pre_page_sga set to false. This shows exactly the same ‘lazy allocation’ behaviour as 4kB pages.

When ‘extra’ memory is reserved from the operating system by setting sga_max_size to a higher value than sga_target, this will all be allocated and paged if either _touch_sga_pages_during_allocation or pre_page_sga is set to TRUE, which doesn’t make sense; if the memory is taken, you might as well use it. However, this is different if both _touch_sga_pages_during_allocation and pre_page_sga are set to false. All memory beyond sga_target up to sga_max_size is allocated, but never touched, and thus never paged in, so never truly allocated. Please mind linux itself understands this perfectly (aiming at huge pages and ‘reserved’ pages), however the system V ipc kernel settings do not; you need to set the shared memory values high enough to facilitate the total sum of sga_max_size values, not the truly used sizes as indicated by the sum of sga_target values.

The inspiration for this investigation came from a question on my blog. However, the question was about memory_target and memory_max_target and AIX. I do not have an AIX system at hand. I did not investigate the implementation of memory_target and memory_max_target on AIX. So I can’t comment on that. What I can say, is that on Linux, you really, really should use automatic shared memory management (ASMM) alias setting sga_target or setting it manually (and set huge pages!). If you are used to these memory management settings on databases not on AIX, it probably makes sense to use that on AIX too, even if the automatic memory management (AMM) alias setting memory_target is implemented brilliantly on AIX, for the sake of predictability and standardisation.

This is the second part of a series of blogpost on Oracle database PGA usage. See the first part here. The first part described SGA and PGA usage, their distinction (SGA being static, PGA being variable), the problem (no limitation for PGA allocations outside of sort, hash and bitmap memory), a resolution for Oracle 12 (PGA_AGGREGATE_LIMIT), and some specifics about that (it doesn’t look like a very hard limit).

But this leaves out Oracle version 11.2. In reality, the vast majority of the database that I deal with at the time of writing is at version 11.2, and my guess is that this is not just the databases I deal with, but a general tendency. This could change in the coming time with the desupport of Oracle 11.2, however I suspect the installed base of Oracle version 12 to increase gradually and smoothly instead of in a big bang.

With version 11.2 there’s no PGA_AGGREGATE_LIMIT. This simply means there is no official way to limit the PGA. Full stop. However, there is an undocumented event to limit PGA usage: event 10261. This means that if you want to use this in a production database, you should ask Oracle support to bless the usage of it. On the other hand, Oracle corporation made this event public in an official white paper: Exadata consolidation best practices.

Let’s test event 10261! I’ve got the same table (T2) setup, a description how to set this up, and the anonymous PL/SQL code to allocate PGA using a collection is in the first part. I am using a database version 11.2.0.4 with PSU 4 applied. The reason for choosing this version is that if you run a serious business on Oracle 11.2, THAT should be the version you should be running on!
(disclaimer: everything shown in this blogpost is purely for educational purposes. Do test everything thoroughly before applying this to a production system. Behaviour can or may be different in your specific situation)
The reason for this disclaimer: Bernhard (@bdcbuning_gridit) tweeted that he was warned that when setting it at the instance level, it could crash the instance. I am not sure if this means setting it at runtime, this event is always evaluated at the instance level.

Okay, let’s replicate more or less the test done to Oracle version 12.1.0.2 in the first part. In this database PGA_AGGREGATE_SIZE is set to 500M, now let’s try to set the event to 600M, which means we set the PGA limit to 600M:
This is setting the event on runtime:

SYS@v11204 AS SYSDBA> alter system set events = '10261 trace name context forever, level 600000';

System altered.

This is setting the event in the spfile (which means you need a restart of the instance to activate this event, or the above syntax to set it on runtime):

SYS@v11204 AS SYSDBA> alter system set event = '10261 trace name context forever, level 600000' scope=spfile;

System altered.

The level is the amount of memory to which the PGA must be limited, in kilobytes.

Now start the anonymous PL/SQL block to fill up the PGA with a collection, again set to 900M:

TS@v11204 > @pga_filler
declare
*
ERROR at line 1:
ORA-10260: limit size (600000) of the PGA heap set by event 10261 exceeded
ORA-06512: at line 20

That’s nice! There’s actually a meaningful, describing error message which explains why this PL/SQL block ended!

Let’s look at the actual PGA memory used, as reported by v$pgastat:

SYS@v11204 AS SYSDBA> select value/power(1024,2) from v$pgastat where name = 'maximum PGA allocated';

VALUE/POWER(1024,2)
-------------------
	 676.078125

This is different than setting PGA_AGGREGATE_LIMIT, however there’s still more memory allocated than set as the limit (600000KB), but lesser (676M in 11.2.0.4 versus 1041M in 12.1.0.2). The outside visibility of the limiting happening is different too: there is NO notice of a process hitting the PGA limit set in the alert.log file nor the process’ trace file(!). Another difference is even SYS is limited, a test with the procedure running as SYS gotten me the ORA-10260 too, PGA_AGGREGATE_LIMIT does not limit SYS.

Event 10261 has got the same description to at least as low as version 11.2.0.1. Here’s a test with with the event 10261 set at version 11.2.0.3 to 600M:

TS@v11203 > @pga_filler
declare
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [723], [123552], [top uga heap], [], [], [], [], [], [], [], [], []
ORA-06512: at line 20

As has been detailed in the Oracle white paper, prior to version 11.2.0.4, an ORA-600 [723] is signalled when event 10261 is set, and more PGA memory is allocated as has been specified as limit. The amount of total allocated PGA is 677M, so roughly the same as with version 11.2.0.4.

Because this is a genuine ORA-600 (internal error, ‘OERI’), this gives messages in the alert.log file:

Tue Dec 16 10:40:09 2014
Errors in file /u01/app/oracle/diag/rdbms/v11203/v11203/trace/v11203_ora_8963.trc  (incident=9279):
ORA-00600: internal error code, arguments: [723], [123552], [top uga heap], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/v11203/v11203/incident/incdir_9279/v11203_ora_8963_i9279.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

The process’ trace file in the trace directory only points to the incident file, no further details are available there.
The incident trace file contains a complete diagnostics dump.

The behaviour is identical with Oracle 11.2.0.2.

Summary
The limiting of the total amount of PGA memory used must be done using an undocumented event prior to Oracle version 12. The event is 10261. The event is made known in an official white paper. Still I would open a service request with Oracle to ask blessing for setting this. This does not mean this functionality is not needed, I would deem it highly important in almost any environment, even when running a single database: this setting, when done appropriately, protects your system from over allocating memory, which could mean entering the swapping death-spiral. The protection means a process gets an ORA message, and the PGA allocation aborted and deallocated.

With version 11.2.0.4 hitting the limit as set with event 10261 is not published, outside of the process getting the ORA-10260.

With versions prior to 11.2.0.4 (11.2.0.3 and 11.2.0.2 verified) processes do get an ORA-600 [723], which is also visible in the alert.log, and incidents are created accordingly.

When a limit has been set using event 10261, it still means more memory is allocated than set as limit (approximately 677M when 600M is set), but this is way less than with the PGA_AGGREGATE_LIMIT (1041M when 600M is set) in my specific situation. Test this in your own environment when you start using this.

Important addendum:
A very good comment to emphasise on the behaviour of using/setting event 10261 by Alexander Sidorov: this event sets a limit per process, not for the entire instance!! (tested with 11.2.0.4 and 11.2.0.3)