Archive

Oracle XE

I was asked some time ago what the Oracle database event ‘TCP socket (KGAS)’ means. This blogpost is a deep dive into what this event times in Oracle database 12.1.0.2.180717.

This event is not normally seen, only when TCP connections are initiated from the database using packages like UTL_TCP, UTL_SMTP and the one used in this article, UTL_HTTP.

A very basic explanation is this event times the time that a database foreground session spends on TCP connection management and communicating over TCP, excluding client and database link (sqlnet) networking. If you trace the system calls, you see that mostly that is working with a (network) socket. Part of the code in the oracle database that is managing that, sits in the kernel code layer kgas, kernel generic (of which I am quite sure, and then my guess:) asynchronous services, which explains the naming of the event.

This is what the Oracle online manual (https://docs.oracle.com/database/121/REFRN/GUID-203ACA60-9912-4493-9B79-EA4CDE89A78D.htm#REFRN00642 – Oracle is notorious for breaking links) says about ‘TCP socket (KGAS)’:

C.3.157 TCP Socket (KGAS)
A session is waiting for an external host to provide requested data over a network socket. The time that this wait event tracks does not indicate a problem, and even a long wait time is not a reason to contact Oracle Support. It naturally takes time for data to flow between hosts over a network, and for the remote aspect of an application to process any request made to it. An application that communicates with a remote host must wait until the data it will read has arrived. In addition, on Microsoft Windows, a separate thread monitors the arrival of traffic. This thread spends most of its life in waits tracked by the TCP Socket (KGAS) wait event.

Wait Time: The total elapsed time for the network connection to be established or for data to arrive from over the network

Parameter Description
P0 For Oracle internal use only. Values 8, 9, and 10 occur within the special thread present on Microsoft Windows; other P0 values occur in normal user sessions.

P1 For Oracle internal use only

Overall, the basic explanation that Oracle provides is mostly correct. I think the text saying to not contact Oracle support is not relevant, but maybe there is a need to relieve Oracle support. In my tests, I found that the full TCP connection lifecycle (creation, usage and removal) is put under this event, for which the text seems to emphasise on waiting for a remote host, which would be the most logical culprit for wait times, but other issues could lead to wait times additionally. This means the wait event itself is not explaining what it is showing, outside of TCP connection management.

The wait time explanation is nearly complete. If it would say something like ‘all TCP connection management and usage’ it would have fully covered it, it now excludes disconnecting and sending, because it explicitly mentions creating the connecting and receiving (waiting for data).

I do not understand what is meant with P0 and P1. I think it is p1 and p2 of the wait event, but naming it P0 and P1 is weird. When looking at the explanation it reads to me ‘we do not wish to explain anything to you’.

So, that means I am going to find this out myself….

If you are interested in this, or do want to write articles like this too, I urge you to replay this on your own system.

First of all, create a small setup which you can use to actually execute UTL_HTTP. The example essentially is taken from father of code examples, Tim Hall/Oracle base. Thank you Tim!
First setup the database to allow a user (in my case ‘ts’) to create the objects and use the network:

grant create sequence to ts;
grant create procedure to ts;
grant create table to ts;
grant alter session to ts;
begin
  dbms_network_acl_admin.create_acl (
    acl => 'anything.xml',
    description => 'allow anything',
    principal => 'TS',
    is_grant => true,
    privilege => 'connect'
  );
end;
begin
  dbms_network_acl_admin.assign_acl (
    acl => 'anything.xml',
    host => '*'
  );
end;

Then connect as the actual user (ts), and create the objects and the procedure that uses UTL_HTTP:

drop table http_clob_test;
create table http_clob_test (
        id number(10),
        url varchar2(255),
        data clob,
        constraint http_clob_test_pk primary key (id)
);
drop sequence http_clob_test_seq;
create sequence http_clob_test_seq;
CREATE OR REPLACE PROCEDURE load_html_from_url (p_url  IN  VARCHAR2) AS
  -- examples by tim hall
  -- https://oracle-base.com/articles/misc/retrieving-html-and-binaries-into-tables-over-http
  l_http_request   UTL_HTTP.req;
  l_http_response  UTL_HTTP.resp;
  l_clob           CLOB;
  l_text           VARCHAR2(32767);
BEGIN
  DBMS_LOB.createtemporary(l_clob, FALSE);

  -- Make a HTTP request and get the response.
  l_http_request  := UTL_HTTP.begin_request(p_url);
  l_http_response := UTL_HTTP.get_response(l_http_request);

  -- Copy the response into the CLOB.
  BEGIN
    LOOP
      UTL_HTTP.read_text(l_http_response, l_text, 32766);
      DBMS_LOB.writeappend (l_clob, LENGTH(l_text), l_text);
    END LOOP;
  EXCEPTION
    WHEN UTL_HTTP.end_of_body THEN
      UTL_HTTP.end_response(l_http_response);
  END;

  -- Insert the data into the table.
  INSERT INTO http_clob_test (id, url, data)
  VALUES (http_clob_test_seq.NEXTVAL, p_url, l_clob);

  -- Relase the resources associated with the temporary LOB.
  DBMS_LOB.freetemporary(l_clob);
EXCEPTION
  WHEN OTHERS THEN
    UTL_HTTP.end_response(l_http_response);
    DBMS_LOB.freetemporary(l_clob);
    RAISE;
END load_html_from_url;
/

The first thing to do is turn on sql_trace at level 8 to include waits:

set serverout on
alter session set events 'sql_trace level 8';
exec load_html_from_url('http://orafun.info/');
alter session set events 'sql_trace off';

If you look at the relevant piece, which means where it shows the wait events, it looks like this:

WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 128265  =2  =0  =0 obj#=662 tim=86395107497
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 395  =5  =0  =0 obj#=662 tim=86395110191
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 150  =6  =0  =0 obj#=662 tim=86395111115
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 131998  =6  =0  =0 obj#=662 tim=86395243764
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 269  =4  =0  =0 obj#=662 tim=86395245182
WAIT #139864521752120: nam='direct path write temp' ela= 4137 file number=201 first dba=257795 block cnt=1 obj#=662 tim=86395250494
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 352  =3  =2  =0 obj#=662 tim=86395251294

What is shown here is some quite spectacular differences in elapsed time. Also, the only way to understand what is actually done flagged as ‘TCP Socket (KGAS)’ is the value following ‘ela’, which is the event p1 value.
The pattern is:

- 2
- 5
- 6
- 6
- 4
- 3

It’s relatively simple to guess what a few of these are:

- 2 - connect
- 5 - send
- 6 - \
- 6 - |   receiving ?
- 4 - /
- 3 - close

But if you include the timing, there must be more into play:

- 2 - ela= 128265   connect
- 5 - ela= 395      send
- 6 - ela= 150      \
- 6 - ela= 131998   |   receiving ?
- 4 - ela= 14       /
- 3 - ela= 177     close

2/connect: In order to build up a connection, a tcp connection needs to be created and established. That takes some time.
5/send: Sending from the perspective of a userland process is writing into a socket, which will get send by the operating system independently. This means sending from a userland process normally takes relative little time, because it’s not waiting for actually sending it.
6,4/receive: At this time, this doesn’t make sense to me.
3/close: Closing for a userland process is a simple, swift task. The operating system will keep the port open for some time, etc. but this is not visible for the user land application.

Let’s pull an old trick out of the hat: use strace (system call tracing) with an emphasis on writing on an oracle session that has SQL trace with waits enabled set. This will show the system calls executed, and show exactly when the oracle engine ends a wait, so we can reasonably well establish a relation between wait events and system calls. I say “reasonably well”, because we can’t see when Oracle truly started timing the wait event (kslwtbctx), only the output to trace file as part of ending the wait event (kslwtectx).

The way I done it, is using the command ‘strace -e write=all -p 18513 -o utl_http_test.txt’. Obviously 18513 is the process ID of the database foreground process. The results of the strace are in utl_http_test.txt.

Now open utl_http_test.txt and search for KGAS. The full output is way too much text, let me show some of the output which I think is noteworthy. Again: this is selective, partial output.
I do maintain the order in which the calls are visible.

1. TCP Socket (KGAS) p1=2, earlier annotated as ‘connect’

-- try to find a socket that has been created by nscd (name server caching deamon)
-- two times?
--
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
--
-- obtain file status of resolv.conf (hostname resolving configuration file)
--
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
--
-- open and read host.conf (another hostname resolving configuration file)
--
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "multi on\n", 4096)             = 9
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open and read resolv.conf (hostname resolving configuration)
--
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "# Generated by NetworkManager\nse"..., 4096) = 93
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open /etc/hosts (ip address to hostname mapping locally)
--
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=200, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "127.0.0.1   localhost localhost."..., 4096) = 200
read(9, "", 4096)                       = 0
close(9)
--
-- at this point two dynamic loadable libraries are read: libnss_dns.so.2 and libresolv.so.2
--
-- this is the DNS lookup of orafun.info
-- again, this is done twice, just like the use of /var/run/nscd/socket above?
--
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 0)    = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "=#\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  3d 23 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  =#...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 01 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 5000)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [45])                = 0
recvfrom(9, "=#\201\200\0\1\0\1\0\0\0\0\6orafun\4info\0\0\1\0\1\300\f\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 45
close(9)                                = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 4971) = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "o=\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  6f 3d 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  o=...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 1c 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 4970)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [109])               = 0
recvfrom(9, "o=\201\200\0\1\0\0\0\1\0\0\6orafun\4info\0\0\34\0\1\300\f\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 109
close(9)                                = 0
--
-- an epoll is created at file descriptor 9 (epoll: I/O event notification facility)
--
epoll_create(82)                        = 9
fcntl(9, F_SETFD, FD_CLOEXEC)           = 0
--
-- an IPV6 socket is created at file descriptor 11, 
-- bound to the IPV6 equivalent of localhost (::1),
-- destination port 0, source port 63257,
-- and is NOT connected.
--
socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 11
bind(11, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(11, {sa_family=AF_INET6, sin6_port=htons(63257), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
getpeername(11, 0x7ffdea6ba0f8, 0x7ffdea6ba1c8) = -1 ENOTCONN (Transport endpoint is not connected)
getsockopt(11, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
getsockopt(11, SOL_SOCKET, SO_RCVBUF, [262144], [4]) = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
--
-- File descriptor 11 is added to the epoll at file descriptor 9.
--
epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=3110993336, u64=139864426020280}}) = 0
--
-- A connection is created to the true destination (orafun.info/18.218.92.122).
-- This connection gets file descriptor 12.
-- Destination port 80 (http), source port 11751.
--
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 12
fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(12, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("18.218.92.122")}, 16) = -1 EINPROGRESS (Operation now in progress)
times(NULL)                             = 438106227
mmap(NULL, 786432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34b959b000
poll([{fd=12, events=POLLOUT}], 1, 60000) = 1 ([{fd=12, revents=POLLOUT}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(12, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(12, F_SETFL, O_RDWR)              = 0
getsockname(12, {sa_family=AF_INET, sin_port=htons(11751), sin_addr=inet_addr("10.0.2.15")}, [16]) = 0
getsockopt(12, SOL_SOCKET, SO_SNDBUF, [87040], [4]) = 0
getsockopt(12, SOL_SOCKET, SO_RCVBUF, [374400], [4]) = 0
setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(12, F_SETFD, FD_CLOEXEC)          = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=2
--
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 32 38 32  36 35 20 20 3d 32 20 20  la= 128265  =2   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 31 30 37 34 39 37     tim=86395107497  |

So yes, I am not sure if all of this is in the wait event, but there is a lot of stuff happening to build a connection to the remote server.

In order to find out why the lookup which is tried via the NSCD socket at the beginning, and later via DNS, is done twice, I ran the same procedure again and used tcpdump to look at the actual network traffic. This explained a lot:

# tcpdump -n host 10.0.2.3 and port 53
09:14:02.923389 IP 10.0.2.15.16819 > 10.0.2.3.domain: 15651+ A? orafun.info. (29)
09:14:02.948791 IP 10.0.2.3.domain > 10.0.2.15.16819: 15651 1/0/0 A 18.218.92.122 (45)
09:14:02.952304 IP 10.0.2.15.54590 > 10.0.2.3.domain: 28477+ AAAA? orafun.info. (29)
09:14:02.979534 IP 10.0.2.3.domain > 10.0.2.15.54590: 28477 0/1/0 (109)

In other words: first a DNS A record is requested (TCPv4 DNS name lookup), which results in the IPv4 ip address, then a DNS AAAA record is requested (TCPv6 DNS name lookup), which resulted in no ip address. In other words: orafun.info only has an IPv4 ip address. So the two lookups actually do have a function.

2. TCP Socket (KGAS) p1=5, earlier annotated as ‘send’
These are the systemcalls that are visible and quite probably related to the send wait event:

--
-- file descriptor 12 holding the connection to the destination server is added to the epoll at file descriptor 9
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- Then the http get request is sent to the destination server at its normal file descriptor, 12.
--
write(12, "GET / HTTP/1.1\r\nHost: orafun.inf"..., 56) = 56
 | 00000  47 45 54 20 2f 20 48 54  54 50 2f 31 2e 31 0d 0a  GET / HTTP/1.1.. |
 | 00010  48 6f 73 74 3a 20 6f 72  61 66 75 6e 2e 69 6e 66  Host: orafun.inf |
 | 00020  6f 0d 0a 43 6f 6e 6e 65  63 74 69 6f 6e 3a 20 63  o..Connection: c |
 | 00030  6c 6f 73 65 0d 0a 0d 0a                           lose....         |
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=5
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 39 35 20  20 3d 35 20 20 3d 30 20  la= 395  =5  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  30 31 39 31              =86395110191     |

3. TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 0, so it doesn't block.
--
epoll_wait(9, [], 82, 0)                = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 35 30 20  20 3d 36 20 20 3d 30 20  la= 150  =6  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  31 31 31 35              =86395111115     |

I have been thinking a lot about this seemingly weird call. It calls epoll_wait, but indicates it doesn’t want to wait (timeout=0), and even if epol_wait would have returned anything, indicated by a return code > 0, the epoll_event pointer is not set (indicated by []). The epoll file descriptor is used, but the only working file descriptor in the epoll is file descriptor 12, which has just been sent a http GET command, so the functionality of epoll is used.

This doesn’t make sense, unless you think about the asynchronous IO implementation of Oracle (see a lot of my earlier investigations), for which (in the case of asynchronous IO) io_getevents was called in a similar matter, timeout set to 0, to be able to do more requests while earlier IO requests are executed by the kernel. So my current theory here is that if multiple requests are happening, this mechanism provides a way to handle them.

If you have a simple single request, like in my case, this systemcall seems redundant. And because it queries the epoll file descriptor right after the request, it returns zero events, because there hardly has been any time after sending the http GET request.

4. Second TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 30000 (milliseconds).
--
epoll_wait(9, [{EPOLLIN, {u32=3110998864, u64=139864426025808}}], 82, 30000) = 1
--
-- And this is the second wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 33 31 39  39 38 20 20 3d 36 20 20  la= 131998  =6   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 32 34 33 37 36 34     tim=86395243764  |

This is the second time epoll_wait is called, and this one is blocking, because timeout has been set to 30000 milliseconds. If you look at the ela time, this took some time, and this now makes perfect sense: this system calls waits for an event to become available in the epoll, so it waits for the response of the remote http server. Please mind this call just notifies the userland process that the response is ready, the received data yet has to be read.

5. TCP Socket (KGAS) p1=4, earlier annotated as ‘receive’

--
-- At this point we know there is a response. First the original file descriptor is removed from the epoll:
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6b9710) = 0
--
-- The the response is read from file descriptor 12:
--
read(12, "HTTP/1.1 200 OK\r\nServer: nginx/1"..., 4096) = 2687
--
-- Then file descriptor 12 is added to the epoll again.
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=4
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 32 36 39 20  20 3d 34 20 20 3d 30 20  la= 269  =4  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 34  35 31 38 32              =86395245182     |

So, what p1 set to 4 actually means, is that once the connection did return data, which is checked using epoll, and visible with p1 set to 6, it is read into the process. This is also the reason this takes very little time, this is the time to read data from kernelspace to user space, and to manage the connection’s file descriptor. It is taken off the epoll in order not to disturb it, and it is added again because there could be another request.

6. TCP Socket (KGAS) p1=3, earlier annotated as ‘close’

--
-- file descriptor 12 removed from the epoll
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6bac20) = 0
--
-- file descriptor 12 is closed, closing the network connection
--
close(12)                               = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=3
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 35 32 20  20 3d 33 20 20 3d 32 20  la= 352  =3  =2  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 35  31 32 39 34              =86395251294     |

I don’t think this part holds any surprises. The network file descriptor is first removed from the epoll, and then it is closed, ending the TCP connection that was setup to perform a http request (in my case, I didn’t test, but I believe you will see the same with for example a SMTP connection, or any other type of TCP connection).

Summary

The basic message of this article is not surprising, nor does it conflict with current knowledge. Whenever you see a wait event ‘TCP Socket (KGAS)’, it means a foreground process is performing TCP networking via PLSQL. This wait event is a single event for creating, sending, receiving and closing a connection.

The true information of this article is how you can use the p1 value of the event to learn what actually the foreground is doing, and thus should give you more information to troubleshoot in the case of long waiting times.

TCP Socket (KGAS) p1 values:
1 - ?
2 - perform DNS lookup and create connection
3 - close connection
4 - copy TCP response into process space
5 - send request
6 - wait for TCP response to become available
7 - ?
8 - ? \
9 - ? | According to documentation, windows only in a 'special thread'.
10- ? /
Advertisements

This post was triggered upon rereading a blogpost by Mike Dietrich called databases need patched minimum april 2019. Mike’s blogpost makes it clear this is about databases that are connected using database links, and that:
– Newer databases do not need additional patching for this issue (11.2.0.4, 12.1.0.2, 12.2 and newer).
– Recent PSU patches contain a fix for certain older versions (11.1.0.7, 11.2.0.3 and 12.1.0.1).
– This means versions 11.2.0.2 and earlier 11.2 versions, 11.1.0.6 and earlier and anything at version 10 or earlier can not be fixed and thus are affected.

But what is the actual issue?

The first link in the article: Recommended patches and actions for Oracle databases versions 12.1.0.1, 11.2.0.3 and earlier – before June 2019 (Doc ID 2361478.1) provides essentially the same information as Mike’s blogpost, however it additionaly mentions that the interoperability of database clients with database servers is not affected.

Mike’s article mentions the following:
The alert refers to an SCN issue which came up a while ago. The system change number (SCN) is a logical, internal timestamp used by the Oracle Database. SCNs order events that occur within the database, which is necessary to satisfy the ACID properties of a transaction. The database uses SCNs to query and track changes.

So I guess it has something to do with SCNs. Most of the links are about SCNs. The MOS article that is most descriptive is: System Change Number (SCN), Headroom, Security and Patch Information (Doc ID 1376995.1).

This article describes a lot of details:
– SCNs are necessary for the database to keep changes organised. I got an article that explains that SCNs are not unique to a transaction, but are “just” granular enough to keep changes organised.
– SCNs are an ever increasing number. SCNs are never decreased!
– SCNs have a hard limit, which version specific, and is based on the number of bits for the number:
– – The general limit is 2^48.
– – From version 12.2 on, with compatibility set to 12.2 or higher, the limit is 2^63.
– SCNs have a per second increasing soft limit, called ‘RSL’ or ‘reasonable SCN limit’, which is version specific:
– – The general soft limit is 16384 (16k) SCNs per second calculated by the number of second from Januari 1st, 1988 times 16384.
– – From version 12.2 on, with compatibility set to 12.2 or higher, the limit is 98304 (96k) SCNs per second calculated by the number of seconds from Januari 1st, 2008 times 98304.
– The RSL can not be exceeded, if a database session tries to go beyond the soft limit, an ORA 600 [2252] is raised and the transaction is rolled back. This will not corrupt data (but obviously the data in the transaction is not applied).
– The difference between the current SCN and the RSL SCN is known as ‘SCN headroom’.
– There have been bugs that can increase SCNs at a higher rate, decreasing the SCN headroom or even reaching the soft limit, but these have all been solved in the Januari 2012 CPU/PSU/patch bundles.
– When databases communicate which each other via a database link, the SCNs of both databases are synchronised by picking the highest of the two.

But it doesn’t really explain why patches must be applied before June 2019. However, another article is more concrete on that: Recommended patching and actions for Oracle database versions 12.1.0.1, 11.2.0.3 and earlier – before June 2019 (Doc ID 2335265.1). The first interesting thing mentioned is:

3. What is the change introduced by the patches listed above?
These patches increase the database’s current maximum SCN (system change number) limit.
At any point in time, the Oracle Database calculates a “not to exceed” limit for the number of SCNs a database can have used, based on the number of seconds elapsed since 1988.

So, this means the patched database have a change (increase) in the RSL.

These recommended patches enable the databases to allow for a higher current maximum SCN limit. The rate at which this limit is calculated can be referred to as the “SCN rate” and these patches help allow higher SCN rates to enable databases to support many times higher transaction rates than earlier releases.

And this means the RSL is increased from the 16k per second since Januari 1988 to the 96k per second since Januari 2008.

Please note that the patches only increase the max limit but the current SCN is not impacted. So, if all your databases don’t have any major change in transaction rate, the current SCN would still remain below the current maximum SCN limit and database links between newer (or patched) and unpatched databases would continue to work. The patches provide the safety measure to ensure that you don’t have any issue with dblinks independent of any possible future change in your transaction rate.

With the patches applied, this change in current maximum SCN limit will happen automatically starting 23rd June 2019.

That is important information! So with the patch applied to some databases and not applied to some other databases and patched and non-patched databases have database links between them, everything should remain working. This is true for any database at this moment, because the change of the limit will happen on the 23rd of June 2019, which at the time of writing is in the future.

Once the change of limit has happened at the 23rd of June 2019, database links between older, non-patched versions of the database and newer or patched versions of the database could be affected if after the 23rd of June 2019 the SCN rate is increased in a newer or patched database and it runs beyond the headroom available in databases with the 16k rate, because the earlier mentioned SCN synchronisation then will fail because it’s beyond the 16k rate database headroom meaning it can not synchronise with the newer dataebase.

So the problem we are talking about here is two databases talking with each other over a database link, which have a different RSL, which could lead to a situation that one database is at an SCN number which is too high for the other older or non-patched database, meaning the communication will fail, which will persist until the older or non-patched databases is able to reach the SCN number of the newer database over time, if that is possible at all.

A thing that is not clear to me at this point: the documentation to me seems to suggest that Oracle version 12.2 with compatibility set to 12.2 or higher versions already allows the higher number of SCNs per second. (the above point: – – From version 12.2 on, with compatibility set to 12.2 or higher, the limit is 98304 (96k) SCNs per second calculated by the number of seconds from Januari 1st, 2008) If that is true, the issue that is warned for could potentially already happen!

Luckily, there is are ways to investigate this:

The reported “newer” versions and the versions that are patched for the rate change have an SGA variable “kcmscnc_” that lists the current SCN rate of the database. There are 3 values that I have seen for “kcmscnc_”:
1: This is the rate of 16k per second since Januari 1st 1988.
2: ?
3: This is the rate of 96k per second since Januari 1st 2008.

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv where ksmfsnam like 'kcmscnc_';
KSMFSNAM							 KSMFSADR	    KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcmscnc_							 000000006001579C	   4
SQL> oradebug peek 0x6001579C 4
[06001579C, 0600157A0) = 00000001

So, this databases is capable of switching (because kcmscnc_ exists), and is currently running with the 16k per second threshold.

In fact, I tested this on 11.2.0.4, 12.1.0.2 and 18.3.0.0, all these version report currently (currently is before June 2019) “1” or “scheme 1”. So the above mentioned rate of 96k for 12.2 and above does currently not apply for the soft limit or SCN headroom calculation for any database, including 12.2 and 18.3.

There is a less “hardcore” way to obtain this information, via the DBMS_SCN. This package equally to the “kcmscnc_” variable only exists if the database is of a version or patch version high enough to switch:

declare
  v_rsl number;
  v_headroom_in_scn number;
  v_headroom_in_sec number;
  v_cur_scn_compat number;
  v_max_scn_compat number;
begin
  dbms_scn.getcurrentscnparams(v_rsl, v_headroom_in_scn, v_headroom_in_sec, v_cur_scn_compat, v_max_scn_compat);
  dbms_output.put_line('reasonable scn limit (soft limit): '||to_char(v_rsl,'999,999,999,999,999,999'));
  dbms_output.put_line('headroom in scn                  : '||to_char(v_headroom_in_scn,'999,999,999,999,999,999'));
  dbms_output.put_line('headroom in sec                  : '||v_headroom_in_sec);
  dbms_output.put_line('current scn compatibility scheme : '||v_cur_scn_compat);
  dbms_output.put_line('max scn compatibility scheme     : '||v_max_scn_compat);
end;
/

SQL> /
reasonable scn limit (soft limit):	 16,104,861,483,008
headroom in scn 		 :	 16,104,860,802,471
headroom in sec 		 : 982962695
current scn compatibility scheme : 1
max scn compatibility scheme	 : 3

PL/SQL procedure successfully completed.

This is executed in a version 18.3.0.0 database. So the conclusion here is that currently all versions up to 18.3.0.0 are still compatible, because they all use the same SCN limit per second, which is referred to as ‘scheme 1’. However, on June 23, 2019 newer versions of the database will switch to a new scheme, which is referred to as ‘scheme 3’, which allows an more headroom.

First of all, I hope a lot of databases out there have enough headroom in the first place, and a modest rate of SCNs it is using per second, which means it doesn’t advance into the RSL. In such a case, when you got older versions that can not switch to scheme 3 and newer versions that will, I can see no reason to worry at all.

The second case is when your database is close to running out of headroom currently in scheme 1, and the SCN rate in the database is also close the limit, so you truly should worry when your database switches to scheme 3, it might actually run over the 16k per second limit, and when it does that long enough to run over RSL of scheme 1, communication over a database link between the current scheme 3 database with a scheme 1 database will be disrupted until the scheme 3 database SCN is low enough again to fit the scheme 1 RSL. If the SCN rate persists in the scheme 3 database, communication will be impossible between the scheme 1 and a scheme 3 database.

So, at this point it should be clear that the most important statistic for determining issues between scheme 1 and after June 2019 scheme 3 databases is the current headroom in your databases. For all of the databases involved that will have a database link between a scheme 1 and a scheme 3 database, you should investigate their SCN number and SCN rate. If some of these databases have run into the soft limit ora-600, ora 600 [2252], these are potential candidates for running over the soft limit when they switch to scheme 3.

However, the most important statistic is to see how close the current SCN is to the scheme 1 soft limit. This can be done using the following SQL (this SQL does not need a newer version of the database, and is tested with version 11.2.0.2 and higher):

col "RSL scheme 1" format 9,999,999,999,999,999
col "current value" format 9,999,999,999,999,999
select dbms_flashback.get_system_change_number "current value",
       ((((to_number(to_char(sysdate,'YYYY'))-1988)*12*31*24*60*60) +
       ((to_number(to_char(sysdate,'MM'))-1)*31*24*60*60) +
       (((to_number(to_char(sysdate,'DD'))-1))*24*60*60) +
       (to_number(to_char(sysdate,'HH24'))*60*60) +
       (to_number(to_char(sysdate,'MI'))*60) +
       (to_number(to_char(sysdate,'SS')))) * (16*1024)) "RSL scheme 1",
       round(dbms_flashback.get_system_change_number/((((to_number(to_char(sysdate,'YYYY'))-1988)*12*31*24*60*60) +
       ((to_number(to_char(sysdate,'MM'))-1)*31*24*60*60) +
       (((to_number(to_char(sysdate,'DD'))-1))*24*60*60) +
       (to_number(to_char(sysdate,'HH24'))*60*60) +
       (to_number(to_char(sysdate,'MI'))*60) +
       (to_number(to_char(sysdate,'SS')))) * (16*1024))*100,5) "% to RSL scheme 1"
from dual;

If a database is close the RSL (roughly speaking beyond 90%-95%), the next thing to do is measure if the database keeps on using SCNs and keeps on being close to the RSL. If that is true, an additional increase in SCN usage could in the current situation using scheme 1 lead to an ora-600 [2252], but if that database has switched to scheme 3 after June 2019, there will not be anything keeping that database from going beyond an SCN number that will exceed the RSL of scheme 1, which will then cause issues if that database has a database link with a scheme 1 database.

Is there anything you can do if you suspect or know a database will go over the scheme 1 RSL? Purely for this issue, the obvious solution would be to make sure you are on a version that will switch to scheme 3 on June 2019, so at least after June 2019 it will not run into ora-600 [2252].

However, if such a scheme 3 database needs to connect to an older scheme 1 database, you have two choices:
1. Potentially run over the the scheme 1 limit and disrupt the database link communication.
2. Stop a newer database from switching to scheme 3, potentially disrupt changes in the current database, but it guarantees the database link will always work.

To look into the switch to scheme 3, which oracle calls ‘auto rollover’, the following SQL can be used:

declare
  v_autorollover_date date;
  v_target_compat number;
  v_is_enabled boolean;
begin
  dbms_scn.getscnautorolloverparams(v_autorollover_date, v_target_compat, v_is_enabled);
  dbms_output.put_line('auto rollover date      : '||to_char(v_autorollover_date,'YYYY-MM-DD'));
  dbms_output.put_line('target scheme	        : '||v_target_compat);
  dbms_output.put_line('rollover enabled (1=yes): '||sys.diutil.bool_to_int(v_is_enabled));
end;
/
SQL> /
auto rollover date	: 2019-06-23
target scheme		: 3
rollover enabled (1=yes): 1

PL/SQL procedure successfully completed.

If you want to prevent a database from rolling over to scheme 3, the procedure dbms_scn.disableautorollover can be used:

exec dbms_scn.disableautorollover;

Obviously, the procedure enableautorollover does the opposite. Please mind to contact Oracle support before doing this with your production database, this is an undocumented procedure at this time.

Also mind that if you create a new database after June 23, 2019, with a new or patched version that can switch to scheme 3, it will probably be running scheme 3 by default. If you want to be absolutely sure it will not exceed the scheme 1 limit, you can revert it to scheme 1 manually using the alter database set scn compatibility N command in mount mode:

SQL> startup mount;
ORACLE instance started.

Total System Global Area 1048574496 bytes
Fixed Size		    8665632 bytes
Variable Size		  281018368 bytes
Database Buffers	  616562688 bytes
Redo Buffers		  142327808 bytes
Database mounted.
SQL> alter database set scn compatibility 1;

Database altered.

SQL> alter database open;

Database altered.

For this too I would strongly advise to contact Oracle support first. The purpose of this blogpost is to define the problem, show all the technical details that have to do with it, and show all the tools that are part of it. There is in no way anything in this article to tell you what you should do, it just shows everything that surrounds the switch to scheme 3 in June 2019.

Another view that might be beneficial is x$kcmscn. This view seems to be created to help looking if a scheme 3 database can connect to a scheme 1 database:

col cur_max_scn format 999,999,999,999,999
col pre_11_2_0_2_cur_max_scn format 999,999,999,999,999
select * from x$ksmscn;
ADDR		       INDX    INST_ID	   CON_ID CUR_MAX_RATE	  CUR_SCN
---------------- ---------- ---------- ---------- ------------ ----------
	 CUR_MAX_SCN DIFF_IN_DAYS DIFF_IN_HOURS PRE_11_2_0_2_CUR_MAX_SCN
-------------------- ------------ ------------- ------------------------
PRE_11_2_0_2_DIFF_IN_DAYS
-------------------------
00007F773DEDAE10	  0	     1		0	     0	   800918
  16,108,830,064,640	    11379	 273112       16,108,830,064,640
		186444783

Because currently (before June 2019) every database by default will be in scheme 1, the cur_max_scn and pre_11_2_0_2_cur_max_scn are identical. I even believe the column naming is wrong, the first version that can switch if it is patched to a high enough PSU version is 11.2.0.3, I do believe the column name is suggesting scheme 1 databases are databases of a version lower than 11.2.0.2, not including 11.2.0.2.

Conclusion.
I think there’s been a lot of fuzz for something that in most cases is not an issue. This article is supposed to give you all the knowledge and the tools to determine how it looks like in your situation.

This might be an issue if you happen to have one or more databases that are high on SCN numbering, and continues to take a lot of SCN numbers, and will be converted to a scheme 3 database on June 29, 2019 and is suspected to increase on taking SCN numbers for whatever reason AND it has a database link to a scheme 1 database that remains scheme 1. That’s a lot of ifs.

On the other hand you only need one database to be high in SCN numbering which continues to take a lot of SCNs keeping it close to the soft limit, which will propagate its SCN to other databases if it is linked, or the required properties of the problem spread out over multiple linked databases.

Again, I do not advise anything in this article, the purpose here is to provide all the details that surround it so you can make the best decision for yourself.

This blogpost is a look into a bug in the wait interface that has been reported by me to Oracle a few times. I verified all versions from Oracle 11.2 version up to 18.2.0.0.180417 on Linux x86_64, in all these versions this bug is present. The bug is that the wait event ‘db file async I/O submit’ does not time anything when using ASM, only when using a filesystem, where this wait event essentially times the time the system call io_submit takes. All tests are done on Linux x86_64, Oracle Linux 7.4 with database and grid version 18.2.0.0.180417

So what?
You might have not seen this wait event before; that’s perfectly possible, because this wait event is unique to the database writer. So does this wait event matter?

When the Oracle datebase engine is set to using asynchronous I/O, and when it makes sense to use asynchronous I/O (!), the engine will use the combination of io_submit() to issue I/O requests to the operating system, and when needs to, fetch the I/O requests using io_getevents(). In general (so not consistently), the engine does not time io_submit, which is a non-blocking call, it only times when it needs to wait for I/O requests using io_getevents(), which is reported as a wait event in an IO wait event class. A lot of ‘%parallel%’ IO related wait events can time asynchronous IO calls.

So why would the engine then time io_submit() for the database writer?
Well, io_submit() is not a blocking call, UNLESS the device queue to which the requests are submitted is full. This means that the developers of the database writer code decided to implement a wait event for io_submit, which is not the case for any other process.

To understand why this makes sense, a little knowledge about database writer internals is necessary. When blocks are dirtied in the cache and these blocks are checkpointed later on, these must be written to disk. The amount of blocks to be written and therefore the number of writes can get high very quickly. The way this is processed is quite interesting (simplified obviously; and when using a filesystem):

a) the database writer picks up a batch of blocks needing writing, for up to 128 IO requests.
b) that batch is submitted, timed by ‘db file async I/O submit’
c) a blocking io_getevents call is issued, timed by ‘db file parallel write’, to wait for the IOs to finish. The interesting thing specifically for the database writer is that the minimal number of IOs ready to wait for is very low (a few IOs to 25-75% of the IOs if the amount gets bigger). Any finished IO will be picked up here, however it’s perfectly possible IOs are still active after this step. In fact, I think it’s deliberately made that way.
d) if any IO requests are still pending, a nonblocking, non-wait event timed io_getevents call is issued to pick up any finished IOs.
e) if any blocks still need writing for which no IO request have been submitted, go to a).
f) if at this point IO requests are still pending, to to c).

This means that the database writer can submit huge amounts of IO requests, and keep on doing that, much more than any other process, because it doesn’t need to wait for all IOs to finish. So, this means that if there is a process that is likely to run into a blocking io_submit call, it’s the database writer.

When using a database without ASM, the above wait timing is exactly what happens. A function call graph of io_submit for the database writer when the database uses a filesystem looks like this:

 | | | | | > kslwtbctx(0x7ffc55eb3e60, 0x8b4, ...)
 | | | | | | > sltrgftime64(0x6c2f4288, 0x6bbe3ca0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eb3400, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffc55eb3400, ...)
 | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwait_timeout_centi_to_micro(0x7fffffff, 0x19183e92, ...)
 | | | | | |  kskthbwt(0x19c37b0d8, 0xb3, ...)
 | | | | | |  kslwt_start_snapshot(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | < kslwt_start_snapshot+0x0000000000d0 returns: 0x6c2f4ae8
 | | | | |  ksfdgo(0x800, 0, ...)
 | | | | | | > ksfd_skgfqio(0x7fc304483f78, 0x9, ...)
 | | | | | | | > skgfqio(0x7fc3091fddc0, 0x7fc304483f78, ...)
 | | | | | | | | > skgfrvldtrq(0x7fc304483f78, 0x9, ...)
 | | | | | | | |  sltrgftime64(0x2000, 0x7fc3043772b0, ...)
 | | | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eae3b0, ...)
 | | | | | | | | | | > clock_gettime(0x1, 0x7ffc55eae3b0, ...)
 | | | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | | |  skgfr_lio_listio64(0x7fc3091fddc0, 0x1, ...)
 | | | | | | | | | > io_submit@plt(0x7fc302992000, 0x115, ...)
 | | | | | | | | | < io_submit+0x000000000007 returns: 0x115
 | | | | | | | | < skgfr_lio_listio64+0x000000000131 returns: 0
 | | | | | | | < skgfqio+0x00000000035e returns: 0
 | | | | | | < ksfd_skgfqio+0x0000000001f5 returns: 0
 | | | | |  kslwtectx(0x7ffc55eb3e60, 0x7fc304483f78, ...)
 | | | | | | > sltrgftime64(0x7ffc55eb3e60, 0x7fc304483f78, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eb33e0, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffc55eb33e0, ...)
 | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwt_end_snapshot(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | > kslwh_enter_waithist_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | |  kslwtrk_enter_wait_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x6bcaa180
 | | | | | |  kslwt_update_stats_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | > kews_update_wait_time(0x9, 0x8f54, ...)
 | | | | | | |  ksucpu_wait_update(0x9, 0x8f54, ...)
 | | | | | | | < ksucpu_wait_update+0x000000000036 returns: 0x6bd658b0
 | | | | | |  kskthewt(0x19c38402c, 0xb3, ...)
 | | | | | | < kskthewt+0x0000000005b1 returns: 0x30
 | | | | |  select event#, name from v$event_name where event# = to_number('b3','xx');
    EVENT# NAME
---------- ----------------------------------------------------------------
       179 db file async I/O submit

Now on to the actual purpose of this blog post, the same situation, but now when ASM is used. When ASM is used, there is a significant increase in the call stack. This means more code is executed. This may sound strange at first, but it’s very logical if you give it some thought: when using ASM, the Oracle database is talking to raw devices. This means that any of the functionality a filesystem performs, which is implemented in ASM must in some way be performed. This is done in several additional layers in the database code.

Let’s look at a backtrace of io_submit of the database writer when using a filesystem:

#0  0x00007f22bdb36690 in io_submit () from /lib64/libaio.so.1
#1  0x0000000004832ef0 in skgfr_lio_listio64 ()
#2  0x000000001238b7ce in skgfqio ()
#3  0x0000000011d5c3ad in ksfd_skgfqio ()
#4  0x0000000011d57fce in ksfdgo ()
#5  0x0000000000d9f21c in ksfdaio ()
#6  0x00000000039c4a5e in kcfisio ()
#7  0x0000000001d836ec in kcbbdrv ()
#8  0x000000001222fac5 in ksb_act_run_int ()
#9  0x000000001222e792 in ksb_act_run ()
#10 0x0000000003b8b9ce in ksbabs ()
#11 0x0000000003baa161 in ksbrdp ()
#12 0x0000000003fbaed7 in opirip ()
#13 0x00000000026ecaa0 in opidrv ()
#14 0x00000000032904cf in sou2o ()
#15 0x0000000000d681cd in opimai_real ()
#16 0x000000000329d2a1 in ssthrdmain ()
#17 0x0000000000d680d3 in main ()

If you want to follow the call sequence, a backtrace/stacktrace must be read from the bottom up.
ksb = kernel service background processes
kcf = kernel cache file management
ksfd = kernel service functions disk IO
skgf = o/s dependent kernel generic fiile
I hope you recognise the logical layers that are necessary for doing the I/O.

Now look at a backtrace of io_submit of the database writer when using ASM:

#0  0x00007f22bdb36690 in io_submit () from /lib64/libaio.so.1
#1  0x0000000004832ef0 in skgfr_lio_listio64 ()
#2  0x000000001238b7ce in skgfqio ()
#3  0x0000000011d5c3ad in ksfd_skgfqio ()
#4  0x0000000011d57fce in ksfdgo ()
#5  0x0000000000d9f21c in ksfdaio ()
#6  0x000000000755c1a8 in kfk_ufs_async_io ()
#7  0x0000000001455fb2 in kfk_submit_io ()
#8  0x00000000014551a8 in kfk_io1 ()
#9  0x0000000001450b3e in kfk_transitIO ()
#10 0x000000000143c450 in kfioSubmitIO ()
#11 0x000000000143bbaa in kfioRequestPriv ()
#12 0x000000000143b160 in kfioRequest ()
#13 0x000000000136f6bd in ksfdafRequest ()
#14 0x000000000137311a in ksfdafGo ()
#15 0x0000000011d58179 in ksfdgo ()
#16 0x0000000000d9f269 in ksfdaio ()
#17 0x00000000039c4a5e in kcfisio ()
#18 0x0000000001d836ec in kcbbdrv ()
#19 0x000000001222fac5 in ksb_act_run_int ()
#20 0x000000001222e792 in ksb_act_run ()
#21 0x0000000003b8b9ce in ksbabs ()
#22 0x0000000003baa161 in ksbrdp ()
#23 0x0000000003fbaed7 in opirip ()
#24 0x00000000026ecaa0 in opidrv ()
#25 0x00000000032904cf in sou2o ()
#26 0x0000000000d681cd in opimai_real ()
#27 0x000000000329d2a1 in ssthrdmain ()
#28 0x0000000000d680d3 in main ()

Essentially, a couple of layers are added to facilitate ASM; ksfdaf, kfio, kfk.
So the logical sequence becomes:
ksb = kernel service background processes
kcf = kernel cache file management
ksfd = kernel service functions disk IO
ksfdaf = kernel service functions disk IO ASM files
kfio = kernel automatic storage management translation I/O layer
kfk = kernel automatic storage management KFK

ksfd = kernel service functions disk IO
skgf = o/s dependent kernel generic file

Now to give an overview of the function call sequence, I simply need to cut out a lot of functions because otherwise it would be unreadable.

 | | | | | > ksfdgo(0x806, 0x35b4, ...)
 | | | | | | > ksfdafGo(0x806, 0x35b4, ...)
 | | | | | | | > ksfdafRequest(0x7ffcc7d845a0, 0x10f, ...)
 | | | | | | | | > kfioRequest(0x7ffcc7d845a0, 0x10f, ...)
 | | | | | | | | | > _setjmp@plt(0x7ffcc7d821d8, 0x10f, ...)
 | | | | | | | | |  __sigsetjmp(0x7ffcc7d821d8, 0, ...)
 | | | | | | | | |  __sigjmp_save(0x7ffcc7d821d8, 0, ...)
 | | | | | | | | |  kfioRequestPriv(0x7ffcc7d845a0, 0x10f, ...)
...
 | | | | | | | | | | | | | | | | > ksfdgo(0x188, 0x35c5, ...)
 | | | | | | | | | | | | | | | | | > ksfd_skgfqio(0x7f4232709f78, 0x9, ...)
 | | | | | | | | | | | | | | | | | | > skgfqio(0x7f4237483dc0, 0x7f4232709f78, ...)
 | | | | | | | | | | | | | | | | | | | > skgfrvldtrq(0x7f4232709f78, 0x9, ...)
 | | | | | | | | | | | | | | | | | | |  sltrgftime64(0x2000, 0x7f4230b61c98, ...)
 | | | | | | | | | | | | | | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d7bb10, ...)
 | | | | | | | | | | | | | | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d7bb10, ...)
 | | | | | | | | | | | | | | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | | | | | | | | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | | | | | | | | | | | | | |  skgfr_lio_listio64(0x7f4237483dc0, 0x1, ...)
 | | | | | | | | | | | | | | | | | | | | > io_submit@plt(0x7f4230ad8000, 0x112, ...)
 | | | | | | | | | | | | | | | | | | | | < io_submit+0x000000000007 returns: 0x112
 | | | | | | | | | | | | | | | | | | | < skgfr_lio_listio64+0x000000000131 returns: 0
 | | | | | | | | | | | | | | | | | | < skgfqio+0x00000000035e returns: 0
 | | | | | | | | | | | | | | | | | < ksfd_skgfqio+0x0000000001f5 returns: 0
 | | | | | | | | | | | | | | | | < ksfdgo+0x000000000135 returns: 0
...
 | | | | | | | | | < kfioRequestPriv+0x000000000224 returns: 0
 | | | | | | | | < kfioRequest+0x000000000251 returns: 0
 | | | | | | | < ksfdafRequest+0x0000000003c8 returns: 0
 | | | | | | < ksfdafGo+0x000000000081 returns: 0x1
 | | | | |  kslwtbctx(0x7ffcc7d86f60, 0x7f4232709f38, ...)
 | | | | | | > sltrgftime64(0x6da39e68, 0x6d2f5bc0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d86500, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d86500, ...)
 | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwait_timeout_centi_to_micro(0x7fffffff, 0x14cb3fcb, ...)
 | | | | | |  kskthbwt(0x2b0f3fe06, 0xb3, ...)
 | | | | | |  kslwt_start_snapshot(0x6da3b118, 0x6da3b118, ...)
 | | | | | | < kslwt_start_snapshot+0x0000000000d0 returns: 0x6da3a6c8
 | | | | |  ksfdgo(0x808, 0, ...)
 | | | | |  kslwtectx(0x7ffcc7d86f60, 0x9, ...)
 | | | | | | > sltrgftime64(0x7ffcc7d86f60, 0x9, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d864e0, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d864e0, ...)
 | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwt_end_snapshot(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | > kslwh_enter_waithist_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | |  kslwtrk_enter_wait_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x6dacf1e8
 | | | | | |  kslwt_update_stats_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | > kews_update_wait_time(0x9, 0xd02, ...)
 | | | | | | |  ksucpu_wait_update(0x9, 0xd02, ...)
 | | | | | | | < ksucpu_wait_update+0x000000000036 returns: 0x6db40f70
 | | | | | |  kskthewt(0x2b0f40b08, 0xb3, ...)
 | | | | | | < kskthewt+0x0000000005b1 returns: 0x30
 | | | | |  ksfdafCopyWaitCtx(0x7ffcc7d86f60, 0xb3, ...)
 | | | | | | > _intel_fast_memcpy(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | |  _intel_fast_memcpy.P(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | |  __intel_ssse3_rep_memcpy(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x00000000242e returns: 0x7ffcc7d86f60
 | | | | | < ksfdafCopyWaitCtx+0x000000000038 returns: 0x7ffcc7d86f60
 | | | | < ksfdaio+0x00000000055f returns: 0x7ffcc7d86f60
 | | |  oradebug setorapname dbw0
Oracle pid: 18, Unix process pid: 3617, image: oracle@o182-fs.local (DBW0)
SQL> oradebug event sql_trace wait=true
Statement processed.

Then go to the trace directory, and tail the database writer trace file.
Next, attach to the database writer with gdb, and break on the io_submit call and perform a sleep 1 (sleep for 1 second). This should add 1000000 microseconds to the waiting time, if the wait event includes the function we put the break on.

(gdb) break io_submit
Breakpoint 1 at 0x7f336b986690
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end

Now continue the database writer, and execute a checkpoint (alter system checkpoint), and look at the wait events:

WAIT #0: nam='db file async I/O submit' ela= 2 requests=11 interrupt=0 timeout=0 obj#=-1 tim=15801301770
WAIT #0: nam='db file parallel write' ela= 5077 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=15801306930

Well, it’s clear nothing has timed the one second we added, right? (the time in the wait event is at ‘ela’, which is in microseconds)

For the sake of completeness, and to validate this test method, let’s add the sleep to io_getevents (io_getevents_0_4) to see if ‘db file parallel write’ does show the extra time we added in the system call, because ‘db file parallel write’ is supposed to time io_getevents():

(gdb) dis 1
(gdb) break io_getevents_0_4
Breakpoint 2 at 0x7f336b986650
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end

Continue the database writer again, and execute a checkpoint:

WAIT #0: nam='db file async I/O submit' ela= 1 requests=22 interrupt=0 timeout=0 obj#=-1 tim=15983030322
WAIT #0: nam='db file parallel write' ela= 1003978 requests=2 interrupt=0 timeout=2147483647 obj#=-1 tim=15984034336

Yay! There we got the artificial waiting time!

Based on this, I can only come the conclusion that the wait event ‘db file async I/O submit’ does not perform any actual timing of the io_submit system call when ASM is used with the Oracle database.

In a blogpost introducing the vagrant builder suite I explained what the suite could do, and the principal use, to automate the installation of the Oracle database software and the creation of a database on a virtual machine using vagrant together with ansible and virtual box.

This blogpost shows how to use that suite for automating the installation of the Oracle database software and the creation of a database on a linux server directly, with only the use of ansible without vagrant and virtualbox.

The suite has been updated with all the PSU’s up to current (180417; april 2018), for 11.2.0.4, 12.1.0.2 and 12.2.0.1, and now includes Oracle 18. Please mind for Oracle version 12.2 and 18 you need to provide the installation media, because these are not downloadable as patch. I really hate that Oracle is not providing the installation media for these downloadable in an automated way. This is a good opportunity for oracle to show it’s listening and provide a solution, especially because it wants to be a player in the developer/devops field.

The regular use if this suite with vagrant/ansible/virtualbox would be to clone the vagrant-builder repository, set the variables in the Vagrantfile, and issue ‘vagrant up’, which will fetch an O/S image from the vagrant cloud, create an extra disk for Oracle, create a VM specification in virtual box, boot up the VM, and then run ansible to do the full configuration of Linux for installing the Oracle database, install and patch the Oracle database software and create a database. Without any human intervention.

Currently, the local/direct install scripts support redhat (actually, a redhat version that manifests itself as ‘Red Hat Enterprise Linux’ or ‘RedHat’ in the ansible fact ‘ansible_distribution’) or oracle linux (‘OracleLinux’), version 7 only. You can get all the facts that ansible gathers on the current host by executing ‘ansible localhost -m setup’.

To use it natively on a linux system, you must first make sure the operating system provides a /u01 directory with enough space for the Oracle software install and the database. A rough indication would be more than 20G. I regularly use 40G, which allows me to use the database and create tablespaces. It doesn’t care if it’s only a directory, or if it’s a mount point. Actually the only thing that local_install.yml is doing with it, is changing the ownership of the /u01 directory to oracle once the user is created.

The first thing to do is install ansible and git. Ansible is using python, and python comes installed with all recent RHEL compatible distributions (as an example, ‘yum’ is using python too). In my opinion, installing additional python packages should not be a problem. If your environment is highly standardised and these kind of installs are not allowed, you should use the scripts in the “regular” ansible way, which I will detail at the end.

1. Install git
Oracle linux 7 has git in the base repository. I assume this means this is the same for RHEL 7. That means that when the server has the base repository or the latest repository or a company repository (for version controlling the packages), it can be simply done using yum install:

# yum install git

2. Install ansible
One way of installing ansible is using easy_install and pip. This will get you a recent ansible version. The simplest way would be using an EPEL repository, however, packages in the linux distribution repositories are not updated very frequently in my experience.

# yum install python-setuptools
# easy_install pip
# pip install ansible

3. Clone the vagrant-builder suite
This CLI example clones the vagrant-builder suite into the builder directory. You can name the directory any way you want. The best way of doing this is using a normal (meaning non-root) user, that has password-less unlimited sudo rights. Cloud environments come with such a user by default, for the oracle cloud this is ‘opc’, for amazon this is ‘ec2-user’ (and for vagrant boxes this is ‘vagrant’). The ansible script uses sudo to execute as root, creates the oracle user and then uses sudo to execute as oracle.

$ git clone https://gitlab.com/FritsHoogland/vagrant-builder.git builder

4. Run local_install.yml
Now the automatic installation components are all setup. The next thing to do is go into the builder/ansible directory, and edit the local_install.yml file:

---
- hosts: localhost
  become: true
  vars:
  - mosuser:
  - mospass:
  - oracle_base: /u01/app/oracle
  - database_name:
  - global_password: oracle
  - db_create_file_dest: /u01/app/oracle/oradata
  - database_version:
  - asm_version: ""
  - stage_directory: /u01/stage
...

Line 5/6: fill out MOS details. This is needed to download patches or installation media in patches.
Line 7: this is the default value for the ORACLE_BASE.
Line 8: you need a database name if you want to create a database. If you set it to empty (“”) no database will be created.
Line 9: this sets this password for all oracle database accounts.
Line 10: this variable sets the place for all the database files, db_create_file_dest.
Line 11: this sets the database software version to install. Ansible will determine what files to download. The Oracle 12.2 installation media must be placed in the builder/ansible/files directory. Look in the Vagrantfile for specifications and versions available.
Line 12: the grid software is currently not installed. The variable needs to be specified in order to have the facts setup. Facts are ansible variables.
Line 13: this is the directory in which all files are staged. It will be created at the beginning of a role, and removed at the end.

After this has been filled out, run it with a user that is allowed to execute sudo without specifying a password:

$ ansible-playbook local_install.yml

Ansible will read the playbook, and see that no inventory (list of hosts) is specified, and the hosts specifier is set to localhost, and then run locally.

Just like with its original use with vagrant, this will download all the necessary software directly from Oracle using the MOS credentials, with the exception of the installation media of Oracle 12.2.0.1 and Oracle 18.0.0.0, because these can not be downloaded via the CLI as far as I know, and therefore have to be provided in the ansible/files directory.

If you want to speed up the build, or do not have a good enough network, you can put the installation media in the files directory anyway, the ansible script will look in the files directory for it. The patches for performing patching (all MOS downloads are patches, of which some are actually the installation media) are always downloaded from MOS.

Using ansible non-local.
If you are not allowed to install additional software on the oracle server, there is another way. However, when you want to use this, it means you are asked to essentially manually install the oracle software. Regardless of what is allowed, you have to make changes to the operating system to facilitate running oracle anyway, and apparently the environment is not that automated, so that claim would be a bit silly.

The non-local way is running ansible in the way it is normally used, which is having a server with ansible installed, from which it uses ssh to run the ansible playbooks on (one or more) remote servers (which then only needs python installed on the remote servers, which EL6 and EL7 have by default). The local_install.yml script can be modified very easily to be used in this way: the hosts specification must be changed to ‘all’ or to a name given to a group of servers in the inventory file. This however is beyond the scope of the article.

However, if you need to perform database software installations and database creations regularly in an environment, it would make sense to use a centralised server to perform these actions, instead of setting it up on every distinct server.

Bonus material.
I added another script in the ansible directory, db_management.yml, which performs database creation and removal (and the install of slob). It follows the same pattern of local_install.yml, fill out the variables, and, unique for this script, uncomment the action you want it to perform, and run it.

Starting from Oracle 12, in a default configured database, there are more log writer processes than the well known ‘LGWR’ process itself, which are the ‘LGnn’ processes:

$ ps -ef | grep test | grep lg
oracle   18048     1  0 12:50 ?        00:00:13 ora_lgwr_test
oracle   18052     1  0 12:50 ?        00:00:06 ora_lg00_test
oracle   18056     1  0 12:50 ?        00:00:00 ora_lg01_test

These are the log writer worker processes, for which the minimal amount is equal to the amount public redo strands. Worker processes are assigned to a group, and the group is assigned to a public redo strand. The amount of worker processes in the group is dependent on the undocumented parameter “_max_log_write_parallelism”, which is one by default.

The actual usage of the worker processes is dependent in the first place on the value of the undocumented parameter “_use_single_log_writer”, for which the default value is ‘ADAPTIVE’, which means it’s switching automatically between ‘single log writer mode’, which is the traditional way of the LGWR process handling everything that the log writer functionality needs to do, and the ‘scalable log writer mode’, which means the log writer functionality is presumably using the log writer worker processes.

Other values for “_use_single_log_writer” are ‘TRUE’ to set ‘single log writer mode’, or ‘FALSE’ to set ‘scalable log writer mode’ fixed.

I assume most readers of this blog will know that the master log writer idle work cycle is sleeping on a semaphore (semtimedop()) under the wait event ‘rdbms ipc message’ for 3 seconds, then performs some “housekeeping”, after which it’ll sleep again repeating the small cycle of sleeping and housekeeping. For the log writer worker processes, this looks different if you look at the wait event information of the log writer worker processes:

135,59779,@1    14346                    DEDICATED oracle@memory-presentation.local (LGWR)	    time:1909.44ms,event:rdbms ipc message,seq#:292
48,34282,@1     14350                    DEDICATED oracle@memory-presentation.local (LG00)	    time:57561.85ms,event:LGWR worker group idle,seq#:150
136,24935,@1    14354                    DEDICATED oracle@memory-presentation.local (LG01)	    time:112785.66ms,event:LGWR worker group idle,seq#:74

The master log writer process (LGWR) has been sleeping for 1.9s when I queried the database, and it will sleep for 3 seconds, and then do some work and sleep again. However, the log writer worker processes have been sleeping for much longer: LG00 for 57.6s and LG01 for 112.8s, and the event is different: ‘LGWR worker group idle’. How is this implemented? Let’s look!

$ strace -p $(pgrep lg01)
strace: Process 14354 attached
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

I used strace on the LG01 process, and it’s still doing the same as most idle background processes are doing: sleeping on a semaphore for 3 seconds. But, it does not end its wait like LGWR does, the event the log writer worker processes are waiting in keeps on being timed.

Using a pin tools debugtrace shows the following:

 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | > __errno_location(0x38000, 0x7ffce278c328, ...)
 | | | > fthread_self(0x38000, 0x7ffce278c328, ...)
 | | | < fthread_self+0x000000000024 returns: 0
 | | < __errno_location+0x000000000010 returns: 0x7f7e930a26a0
 | < sskgpwwait+0x00000000014e returns: 0
 < skgpwwait+0x0000000000e0 returns: 0
 > ksuSdiInProgress(0x19e80, 0x19e80, ...)
 < ksuSdiInProgress+0x000000000035 returns: 0
 > sltrgftime64(0x19e80, 0x19e80, ...)
 | > clock_gettime@plt(0x1, 0x7ffce278c3a0, ...)
 | | > clock_gettime(0x1, 0x7ffce278c3a0, ...)
 | | < clock_gettime+0x000000000069 returns: 0
 | < clock_gettime+0x00000000003a returns: 0
 < sltrgftime64+0x00000000004c returns: 0x19c253f3ff
 > kslwo_getcbk(0xa2, 0xd80fa62, ...)
 < kslwo_getcbk+0x000000000017 returns: 0
 > kgslwait_last_waitctx_time_waited_usecs(0x7f7e930a29a0, 0x6dfd01c0, ...)
 < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x25e5e80
 > kskiorm(0x6d1854a8, 0, ...)
 < kskiorm+0x00000000001e returns: 0
 > kfias_iswtgon_ksfd(0x6d1854a8, 0, ...)
 < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 > kxdbio_has_work(0x7ffce278c3c4, 0x6003d010, ...)
 < kxdbio_has_work+0x000000000027 returns: 0
 > skgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | > kslwait_conv_wait_time(0x2dc6c0, 0x7f7e930a7ca0, ...)
 | < kslwait_conv_wait_time+0x000000000027 returns: 0x2dc6c0
 | > sskgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | | > semtimedop(0x38000, 0x7ffce278c328, ...)
 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff

And a full stack trace of a log writer worker look like this:

$ pstack $(pgrep lg01)
#0  0x00007feda8eaebda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000031fb4d0 in kcrfw_slave_queue_remove ()
#6  0x00000000031fad2a in kcrfw_slave_group_main ()
#7  0x00000000012160fa in ksvrdp_int ()
#8  0x000000000370d99a in opirip ()
#9  0x0000000001eb034a in opidrv ()
#10 0x0000000002afedf1 in sou2o ()
#11 0x0000000000d0547a in opimai_real ()
#12 0x0000000002b09b31 in ssthrdmain ()
#13 0x0000000000d05386 in main ()

If you combine the pstack backtrace and the debugtrace information, you see that the idle cycle does not leave the ‘ksliwat’ function, so the wait event is not finished. Quickly looking at the other functions, it’s easy to spot it reads the system clock (sltrgftime64), updates some information (kgslwait_last_waitctx_time_waited_usecs) and then performs some proactive IO checks (kskiorm, kfias_iswtgon_ksfd, kxdbio_has_work) after which it calls the post/wait based functions to setup the semaphore again.

Conclusion so far is the log writer workers do perform a 3 second sleep just like the master log writer, however the wait event ‘LGWR worker group idle’ is not interrupted like ‘rdbms ipc message’ is for the master log writer. This means the wait time for the event for each worker process indicates the last time the worker process actually performed something. A next logical question then is: but what do the log writer worker processes perform? Do they entirely take over the master log writer functionality, or do they work together with the master log writer?

In order to fully understand the next part, it is very beneficial to read up on how the log writer works in ‘single log writer’ mode, where the master log writer handling the idle and work cycle itself:
https://fritshoogland.wordpress.com/2018/02/20/a-look-into-into-oracle-redo-part-4-the-log-writer-null-write/
https://fritshoogland.wordpress.com/2018/02/27/a-look-into-oracle-redo-part-5-the-log-writer-writing/

If you want to perform this investigation yourself, make sure the database is in ‘scalable log writer’ mode, by setting “_use_single_log_writer” to FALSE. This is exactly what I did in order to make sure a log write is done in ‘scalable log writer’ mode.

Now let’s first apply some logic. Above the idle cycle of a log writer worker process is shown. Based on the ‘log writer null write’ blog post, we know that the log writer does advance the LWN and On-disk SCN every 3 seconds. Clearly, the log writer worker process does not do that. So that must mean the master log writer is still performing that function. It would also make very much sense, because it doesn’t matter for scalability if the master log writer performs the function of advancing the LWN and On-disk SCN or a worker process, nothing is waiting on it. Plus, if the master log writer performs most of its functions just like in ‘single log writer’ mode, the change to scalable mode would mean no change for client processes, any committing process must semop() the log writer to start writing.

Let’s look at the relevant debugtrace output of the master log writer in scalable log writer mode:

 | > kcrfw_redo_write_driver(0, 0, ...)
 | | > kcrfw_handle_member_write_errors(0, 0, ...)
 | | < kcrfw_handle_member_write_errors+0x000000000020 returns: 0x600161a0
 | | > kcmgtsf(0, 0, ...)
 | | | > sltrgatime64(0, 0, ...)
 | | | | > sltrgftime64(0, 0, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7fff1fe13010, ...)
 | | | | | | > clock_gettime(0x1, 0x7fff1fe13010, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x53747fe42
 | | | < sltrgatime64+0x00000000003e returns: 0x155d4fd
 | | < kcmgtsf+0x00000000032f returns: 0x3a182314
 | | > kcrfw_slave_adaptive_updatemode(0, 0x600161a0, ...)
 | | < kcrfw_slave_adaptive_updatemode+0x000000000080 returns: 0x7efe34d1f760
 | | > kcrfw_defer_write(0, 0x600161a0, ...)
 | | < kcrfw_defer_write+0x000000000038 returns: 0x7efe34d1f760
 | | > kcrfw_slave_queue_find(0, 0x600161a0, ...)
 | | < kcrfw_slave_queue_find+0x0000000000f1 returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0x1, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_group_switchpic(0, 0x1, ...)
 | | < kcrfw_slave_group_switchpic+0x000000000050 returns: 0x699b4508
 | | > skgstmGetEpochTs(0, 0x1, ...)
 | | | > gettimeofday@plt(0x7fff1fe13070, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x20debfd6192e5
 | | > kcsnew3(0x600113b8, 0x7fff1fe13228, ...)
 | | | > kcsnew8(0x600113b8, 0x7fff1fe13070, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0
 | | | > ub8_to_kscn_impl(0x66c3c7, 0x7fff1fe13228, ...)
 | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | < kcsnew3+0x00000000006f returns: 0x8000
 | | > ktfwtsm(0x3a182314, 0x7fff1fe13228, ...)
 | | | > kcmgtsf(0x2, 0x7fff1fe13228, ...)
 | | | | > sltrgatime64(0x2, 0x7fff1fe13228, ...)
 | | | | | > sltrgftime64(0x2, 0x7fff1fe13228, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | > clock_gettime(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x537484a6d
 | | | | < sltrgatime64+0x00000000003e returns: 0x155d511
 | | | < kcmgtsf+0x0000000001b2 returns: 0x3a182314
 | | | > kcmtdif(0x3a182314, 0x3a182314, ...)
 | | | < kcmtdif+0x00000000001b returns: 0
 | | | > ksl_get_shared_latch_int(0x60050340, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x60050340, 0x66c3c7, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcn_stm_write(0x7fff1fe13228, 0x66c3c7, ...)
 | | | > kstmgetsectick(0x7fff1fe13228, 0x66c3c7, ...)
 | | | < kstmgetsectick+0x00000000003a returns: 0x5ae4c494
 | | | > ksl_get_shared_latch_int(0x6004ee40, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x6004ee40, 0x2244, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcrfw_redo_write_initpic(0x699b4508, 0x7fff1fe13228, ...)
 | | | > kscn_to_ub8_impl(0x7fff1fe13228, 0x7fff1fe13228, ...)
 | | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | < kcrfw_redo_write_initpic+0x0000000000dc returns: 0x3a182314
 | | > kscn_to_ub8_impl(0x7fff1fe13228, 0, ...)
 | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | > kcrfw_gather_lwn(0x7fff1fe13268, 0x699b4508, ...)
 | | | > kslgetl(0x6abe4538, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe4538, 0x17d5f, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x6abe45d8, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe45d8, 0x137, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff
 | | > krsh_trace(0x1000, 0x200, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | | > kspgip(0x71e, 0x1, ...)
 | | < kspgip+0x00000000023f returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_queue_flush_internal(0x1, 0, ...)
 | | < kcrfw_slave_queue_flush_internal+0x0000000000d7 returns: 0x1
 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x9b, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x6a9a0718
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x9b
 | | | | > kcscu8(0x60016290, 0x7fff1fe12f98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7fff1fe12f38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7fff1fe0ac30, 0x7fff1fe12c50, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x6a9a0720, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3
 | < kcrfw_redo_write_driver+0x000000000535 returns: 0xd3

The highlighted functions are extra functions executed when the instance is set to scalable log writer mode, or when adaptive mode has set the instance to scalable log writer mode. This means that the changes between the modes is minimal when there’s no writes, and outside of a few extra functions, the log writer does exactly the same.

The absence of any spectacular changes in the behaviour of the log writer when in scalable log writer mode when there are no writes does hint what the actual changes will be of the scalable mode, which is how writing is handled. In single log writer mode, the most time the log writer is process is likely to spend on is writing the change vectors into the online redologfiles, and maybe, if you have a bad application (!) semop()-ing foreground sessions will be second, if there are a large number of processes committing, because every process needs to be semop()-ed individually. These two functions, along with some other functionality are exactly what the log writer worker processes are doing.

This means that foreground processes do nothing different in scalable log writer mode, they signal (semop) the master log writer, which will investigate the public redo strands, and if the master log writer finds change vectors to write, it will assign log writer worker processes to perform the write, and the log writer worker process will semop() the foreground sessions to indicate the redo has been written when the instance is in post/wait mode, or do not semop() when the instance is in polling mode.

This is the entire function flow of a write when the instance is in scalable log writer mode:

 | > kcrfw_slave_queue_insert(0, 0xd3, ...)
 | | > kcrfw_slave_group_setcurrsize(0, 0, ...)
 | | < kcrfw_slave_group_setcurrsize+0x0000000001d1 returns: 0x1
 | | > _intel_fast_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> _intel_fast_memcpy.P(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> __intel_ssse3_rep_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x6a9a05f8
 | | > kcrfw_slave_group_postall(0, 0xf0, ...)
 | | | > ksvgcls(0, 0xf0, ...)
 | | | < ksvgcls+0x000000000021 returns: 0
 | | | > ksl_post_proc(0x6ddb32f0, 0, ...)
 | | | <> kskpthr(0x6ddb32f0, 0, ...)
 | | | <> kslpsprns(0x6ddb32f0, 0, ...)
 | | | | > ksl_update_post_stats(0x6ddb32f0, 0, ...)
 | | | | | > dbgtTrcData_int(0x7f464c0676c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f464c0676c0, 0x7ffdae335338, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffdae334e40, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x20e067375b55d
 | | | | | | > dbgtrRecAllocate(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f464c0676c0, 0x65accba0, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x4
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x65acda30
 | | | | | | > dbgtrRecEndSegment(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > skgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | <> sskgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | | > semop@plt(0xc0000, 0x7ffdae335410, ...)
 | | | | | < semop+0x00000000000f returns: 0
 | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | < kslpsprns+0x0000000001c3 returns: 0
 | | < kcrfw_slave_group_postall+0x0000000000a8 returns: 0
 | < kcrfw_slave_queue_insert+0x0000000001b6 returns: 0x667bc540

After the instance has established there are change vectors in kcrfw_gather_lwn, in single log writer mode, the function kcrfw_redo_write is called, which will call kcrfw_do_write which handles the writing, and kslpslf to semop any waiting processes among other things. Now in scalable log writer mode, kcrfw_slave_queue_insert is called which assigns work to worker processes, and then kcrfw_slave_group_postall is called to semop one or more worker processes.

The worker processes are sleeping on a semaphore, and if a process gets signalled, it exits the kcrfw_slave_queue_remove function, ends the wait event, and calls kcrfw_redo_write, just like the master log writer process would call in single log writer mode, which includes doing the write (kcrfw_do_write) and posting the foregrounds (kslpslf), exactly all the functions.

Conclusion.
The adaptive scalable log writer processes function has been silently introduced with Oracle 12, although a lot of the used functionality has been available more or less in earlier versions. It is a fully automatic feature which will turn itself on and off based on heuristics. The purpose of this article is to explain how it works and what it is doing. Essentially, all the functionality that surrounds a log writer write has been moved to a worker process, which means the work can be done in parallel with multiple processes, whilst all the work outside of the work around the write, which is not performance critical, is left with the master log writer.

I gotten some requests to provide an overview of the redo series of blogposts I am currently running. Here it is:

https://fritshoogland.wordpress.com/2018/01/29/a-look-into-oracle-redo-part-1-redo-allocation-latches/
https://fritshoogland.wordpress.com/2018/02/05/a-look-into-oracle-redo-part-2-the-discovery-of-the-kcrfa-structure/
https://fritshoogland.wordpress.com/2018/02/12/a-look-into-oracle-redo-part-3-the-log-writer-work-cycle-overview/
https://fritshoogland.wordpress.com/2018/02/20/a-look-into-into-oracle-redo-part-4-the-log-writer-null-write/
https://fritshoogland.wordpress.com/2018/02/27/a-look-into-oracle-redo-part-5-the-log-writer-writing/
https://fritshoogland.wordpress.com/2018/03/05/a-look-into-oracle-redo-part-6-oracle-post-wait-commit-and-the-on-disk-scn/
https://fritshoogland.wordpress.com/2018/03/19/a-look-into-oracle-redo-part-7-adaptive-log-file-sync/
https://fritshoogland.wordpress.com/2018/03/26/a-look-into-oracle-redo-part-8-generate-redo/
https://fritshoogland.wordpress.com/2018/04/03/a-look-into-oracle-redo-part-9-commit/
https://fritshoogland.wordpress.com/2018/04/09/a-look-into-oracle-redo-part-9a-commit-concurrency-considerations/
https://fritshoogland.wordpress.com/2018/04/16/a-look-into-oracle-redo-part-10-commit_wait-and-commit_logging/

Private redo strands, In memory undo and throw away undo: https://fritshoogland.wordpress.com/2016/11/15/redo-a-blogpost/

The redo series would not be complete without writing about changing the behaviour of commit. There are two ways to change commit behaviour:

1. Changing waiting for the logwriter to get notified that the generated redo is persisted. The default is ‘wait’. This can be set to ‘nowait’.
2. Changing the way the logwriter handles generated redo. The default is ‘immediate’. This can be set to ‘batch’.

There are actually three ways these changes can be made:
1. As argument of the commit statement: ‘commit’ can be written as ‘commit write wait immediate’ (statement level).
2. As a system level setting. By omitting an explicit commit mode when executing the commit command, the setting as set with the parameters commit_wait (default: wait) and commit_logging (default: immediate).
3. As a session level setting. By omitting an explicit commit mode, but by setting either commit_wait or commit_logging it overrides the settings at the system level.

At this point I should say that in my personal opinion, if you need to change this, there is something very wrong with how the database is used in the first place. This can enhance performance a bit (totally depending on what you are doing and how your hardware looks like), but it does nothing magic, as you will see.

a) commit wait/nowait
I ran a pin tools debugtrace on a session that commits explicitly with the write mode explicitly set to wait (the default), and a session that commits explicitly with the write mode set to nowait. If you took the time to read the other redo related articles you know that a commit generates changes vectors that are written in the public redo strand, changes the transaction table in the undo segment header and then signals the logwriter to write in kcrf_commit_force_int, releases all transactional control on the rows in the transaction that are committed, after which kcrf_commit_force_int is called again in order to wait for the logwriter to get notified that the change vectors have been persisted.

When commit is set to nowait, actually what happens is very simple: everything that is executed in ‘wait mode’ commit is executed in ‘nowait mode’ too, except for calling the kcrf_commit_force_int a second time, which is the functionality to wait for the notification from the logwriter.

commit wait:

 | | < kpoal8+0x000000000f8c returns: 0x2
 | | > ksupop(0x1, 0x7a87a9a0, ...)
 | | | > ksugit_i(0x11526940, 0x7a87a9a0, ...)
 | | | < ksugit_i+0x00000000002a returns: 0
 | | | > _setjmp@plt(0x7ffda5959c50, 0x7a87a9a0, ...)
 | | | <> __sigsetjmp(0x7ffda5959c50, 0, ...)
 | | | <> __sigjmp_save(0x7ffda5959c50, 0, ...)
 | | | < __sigjmp_save+0x000000000025 returns: 0
 | | | > kcbdsy(0x7ffda5959c50, 0x7f3011cbc028, ...)
 | | | <> kcrf_commit_force_int(0x7f3011d75e10, 0x1, ...)
...
 | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1
 | | | > kslws_check_waitstack(0x3, 0x7f3011d82f40, ...)
 | | | < kslws_check_waitstack+0x000000000065 returns: 0
 | | | > kssdel(0x7a87a9a0, 0x1, ...)
 | | | | > kpdbUidToId(0, 0x1, ...)
 | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | > kss_del_cb(0x7ffda5959b50, 0x7f3011d82f40, ...)
 | | | | | > kpdbUidToId(0, 0x7f3011d82f40, ...)
 | | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | | > ksudlc(0x7a87a9a0, 0x1, ...)

commit nowait:

 | | < kpoal8+0x000000000f8c returns: 0x2
 | | > ksupop(0x1, 0x63c82a38, ...)
 | | | > ksugit_i(0x11526940, 0x63c82a38, ...)
 | | | < ksugit_i+0x00000000002a returns: 0
 | | | > _setjmp@plt(0x7fff43332a50, 0x63c82a38, ...)
 | | | <> __sigsetjmp(0x7fff43332a50, 0, ...)
 | | | <> __sigjmp_save(0x7fff43332a50, 0, ...)
 | | | < __sigjmp_save+0x000000000025 returns: 0
 | | | > kslws_check_waitstack(0x3, 0x7fd1cea22028, ...)
 | | | < kslws_check_waitstack+0x000000000065 returns: 0
 | | | > kssdel(0x63c82a38, 0x1, ...)
 | | | | > kpdbUidToId(0, 0x1, ...)
 | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | > kss_del_cb(0x7fff43332950, 0x7fd1ceae8f40, ...)
 | | | | | > kpdbUidToId(0, 0x7fd1ceae8f40, ...)
 | | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | | > ksudlc(0x63c82a38, 0x1, ...)

Yes, it’s that simple. In normal commit mode, commit wait, in ksupop (kernel service user pop (restore) user or recursive call) a call to kcbdsy is executed, which performs a tailcall to kcrf_commit_force_int. In nowait commit mode, kcbdsy is simply not called in ksupop, which actually exactly does what nowait means, the waiting for the logwriter notification is not done.

b) commit immediate/batch
I ran a pin tools debugtrace on a session that commits explicitly with the write mode explicitly set to immediate, and a session that commits explicitly with the write mode set to batch. If you read the other redo related articles you know that a commit generates changes vectors that are written in the public redo strand, changes the transaction table in the undo segment header and then signals the logwriter to write in kcrf_commit_force_int, then releases all transactional control on the rows in the transaction that are committed, after which kcrf_commit_force_int is called again in order to wait for the logwriter to get notified that the change vectors have been persisted.

When commit is set to batch, actually what happens is very simple: everything is done exactly the same in ‘immediate mode’ commit, except for calling the kcrf_commit_force_int the first time, which is the functionality that triggers the logwriter to write. So it looks like ‘batch mode’ is not explicitly batching writes for the logwriter, but rather the disablement of the signal to the logwriter to write right after the change vectors have been copied and the blocks are changed. But that is not all…

I noticed something weird when analysing the calls in the debugtrace of ‘commit write batch’: not only was the first invocation of kcrf_commit_force_int gone, the second invocation of kcrf_commit_force_int was also gone too! That is weird, because the Oracle documentation says:

WAIT | NOWAIT

Use these clauses to specify when control returns to the user.

The WAIT parameter ensures that the commit will return only after the corresponding redo is persistent in the online redo log. Whether in BATCH or IMMEDIATE mode, when the client receives a successful return from this COMMIT statement, the transaction has been committed to durable media. A crash occurring after a successful write to the log can prevent the success message from returning to the client. In this case the client cannot tell whether or not the transaction committed.

The NOWAIT parameter causes the commit to return to the client whether or not the write to the redo log has completed. This behavior can increase transaction throughput. With the WAIT parameter, if the commit message is received, then you can be sure that no data has been lost.

If you omit this clause, then the transaction commits with the WAIT behavior.

The important, and WRONG thing, is in the last line: ‘if you omit this clause, then the transaction commits with the WAIT behavior’. Actually, if the commit mode is set to batch, the commit wait mode flips to nowait with it. It does perform the ultimate batching, which is not sending a signal to the logwriter at all, so what happens is that change vectors in the public redo strands are written to disk by the logwriter only every 3 seconds, because that is the timeout for the logwriter sleeping on a semaphore, after which it obtains any potential redo to write via information in kcrfsg_ and KCRFA structures. This is important, because with NOWAIT behaviour, there is no guarantee changes have been persisted for the committing session.

I was surprised to find this, which for me it meant I was searching for ‘kcrf_commit_force_int’ in the debugtrace of a commit with the ‘write batch’ arguments, and did not find any of them. Actually, this has been reported by Marcin Przepiorowski in a comment on an article by Christian Antognini on this topic.

Can this commit batching be changed to include waiting for the logwriter? Yes, actually it can if you explicitly include ‘wait’ with the commit write batch. It is very interesting the kcrf_commit_force_int function then comes back at a totally different place:

 | | | | | | | | | | | | | < ktuulc+0x000000000119 returns: 0
 | | | | | | | | | | | | | > ktudnx(0x69fc8eb0, 0, ...)
 | | | | | | | | | | | | | | > ktuIMTabCacheCommittedTxn(0x69fc8eb0, 0x7ffe9eb79e74, ...)
 | | | | | | | | | | | | | | < ktuIMTabCacheCommittedTxn+0x000000000071 returns: 0
 | | | | | | | | | | | | | | > kslgetl(0x6ab9d6e8, 0x1, ...)
 | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | > kslfre(0x6ab9d6e8, 0x6ab9ce00, ...)
 | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | | | | | | | | | | < ktudnx+0x0000000005e4 returns: 0
 | | | | | | | | | | | | | > ktuTempdnx(0x69fc8eb0, 0, ...)
 | | | | | | | | | | | | | < ktuTempdnx+0x000000000083 returns: 0
 | | | | | | | | | | | | | > kcb_sync_last_change(0x69fc8eb0, 0x6df64df8, ...)
 | | | | | | | | | | | | | <> kcrf_commit_force_int(0x7f525ba19c00, 0x1, ...)
...
 | | | | | | | | | | | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1
 | | | | | | | | | | | | | > kghstack_free(0x7f525bb359a0, 0x7f525690ead8, ...)
 | | | | | | | | | | | | | < kghstack_free+0x00000000005a returns: 0
 | | | | | | | | | | | | < ktucmt+0x000000000e0c returns: 0

Instead of simply keeping the separate call after the transaction in the ksupop function, described above with commit wait/nowait, which is kcrf_commit_force_int with second argument set to 1, which means it notifies the logwriter as well as waits for the logwriter notification of the write, it is now is called after the function to clear the TX enqueue (ktuulc) and the undo transaction count has been lowered (ktudnx) at the end of the ktucmt function as a tailcall of kcb_sync_last_change, which wasn’t called before. Of course this limits the IO batching opportunities.

Conclusion
Do not change your database or even your session to make your commit faster. If you must, read this article carefully and understand the trade offs. One trade off which hasn’t been highlighted is: this might change in a different version, and it requires some effort to investigate. And again: if you still are considering this: probably you have a different problem that you should look at. Do not take this option in desperation to hope for a magical restoration of performance.

The commit_write option nowait does trigger the logwriter to write (the first invocation of the kcrf_commit_force_int function), but it does not wait for write confirmation. The commit_logging option batch does something different than the documentation says it does, it does not issue a signal to the logwriter, nor wait for it. This way the logwriter can wait the full three seconds before it times out on its semaphore and write what is in the public redo strands. But there is no way to tell if the redo for your change has been persisted yet, because that wait is gone too (that wait is the infamous ‘log file sync’ wait). If you want batching but still want a write notification, you must set commit_write to wait explicitly. By doing that you do not get the optimal batching because then waiting for the logwriter, including sending a signal to write is executed, which I suspect to be in the same ballpark as regular committing, but I haven’t checked that.

%d bloggers like this: