Archive

Networking

I was asked some time ago what the Oracle database event ‘TCP socket (KGAS)’ means. This blogpost is a deep dive into what this event times in Oracle database 12.1.0.2.180717.

This event is not normally seen, only when TCP connections are initiated from the database using packages like UTL_TCP, UTL_SMTP and the one used in this article, UTL_HTTP.

A very basic explanation is this event times the time that a database foreground session spends on TCP connection management and communicating over TCP, excluding client and database link (sqlnet) networking. If you trace the system calls, you see that mostly that is working with a (network) socket. Part of the code in the oracle database that is managing that, sits in the kernel code layer kgas, kernel generic (of which I am quite sure, and then my guess:) asynchronous services, which explains the naming of the event.

This is what the Oracle online manual (https://docs.oracle.com/database/121/REFRN/GUID-203ACA60-9912-4493-9B79-EA4CDE89A78D.htm#REFRN00642 – Oracle is notorious for breaking links) says about ‘TCP socket (KGAS)’:

C.3.157 TCP Socket (KGAS)
A session is waiting for an external host to provide requested data over a network socket. The time that this wait event tracks does not indicate a problem, and even a long wait time is not a reason to contact Oracle Support. It naturally takes time for data to flow between hosts over a network, and for the remote aspect of an application to process any request made to it. An application that communicates with a remote host must wait until the data it will read has arrived. In addition, on Microsoft Windows, a separate thread monitors the arrival of traffic. This thread spends most of its life in waits tracked by the TCP Socket (KGAS) wait event.

Wait Time: The total elapsed time for the network connection to be established or for data to arrive from over the network

Parameter Description
P0 For Oracle internal use only. Values 8, 9, and 10 occur within the special thread present on Microsoft Windows; other P0 values occur in normal user sessions.

P1 For Oracle internal use only

Overall, the basic explanation that Oracle provides is mostly correct. I think the text saying to not contact Oracle support is not relevant, but maybe there is a need to relieve Oracle support. In my tests, I found that the full TCP connection lifecycle (creation, usage and removal) is put under this event, for which the text seems to emphasise on waiting for a remote host, which would be the most logical culprit for wait times, but other issues could lead to wait times additionally. This means the wait event itself is not explaining what it is showing, outside of TCP connection management.

The wait time explanation is nearly complete. If it would say something like ‘all TCP connection management and usage’ it would have fully covered it, it now excludes disconnecting and sending, because it explicitly mentions creating the connecting and receiving (waiting for data).

I do not understand what is meant with P0 and P1. I think it is p1 and p2 of the wait event, but naming it P0 and P1 is weird. When looking at the explanation it reads to me ‘we do not wish to explain anything to you’.

So, that means I am going to find this out myself….

If you are interested in this, or do want to write articles like this too, I urge you to replay this on your own system.

First of all, create a small setup which you can use to actually execute UTL_HTTP. The example essentially is taken from father of code examples, Tim Hall/Oracle base. Thank you Tim!
First setup the database to allow a user (in my case ‘ts’) to create the objects and use the network:

grant create sequence to ts;
grant create procedure to ts;
grant create table to ts;
grant alter session to ts;
begin
  dbms_network_acl_admin.create_acl (
    acl => 'anything.xml',
    description => 'allow anything',
    principal => 'TS',
    is_grant => true,
    privilege => 'connect'
  );
end;
begin
  dbms_network_acl_admin.assign_acl (
    acl => 'anything.xml',
    host => '*'
  );
end;

Then connect as the actual user (ts), and create the objects and the procedure that uses UTL_HTTP:

drop table http_clob_test;
create table http_clob_test (
        id number(10),
        url varchar2(255),
        data clob,
        constraint http_clob_test_pk primary key (id)
);
drop sequence http_clob_test_seq;
create sequence http_clob_test_seq;
CREATE OR REPLACE PROCEDURE load_html_from_url (p_url  IN  VARCHAR2) AS
  -- examples by tim hall
  -- https://oracle-base.com/articles/misc/retrieving-html-and-binaries-into-tables-over-http
  l_http_request   UTL_HTTP.req;
  l_http_response  UTL_HTTP.resp;
  l_clob           CLOB;
  l_text           VARCHAR2(32767);
BEGIN
  DBMS_LOB.createtemporary(l_clob, FALSE);

  -- Make a HTTP request and get the response.
  l_http_request  := UTL_HTTP.begin_request(p_url);
  l_http_response := UTL_HTTP.get_response(l_http_request);

  -- Copy the response into the CLOB.
  BEGIN
    LOOP
      UTL_HTTP.read_text(l_http_response, l_text, 32766);
      DBMS_LOB.writeappend (l_clob, LENGTH(l_text), l_text);
    END LOOP;
  EXCEPTION
    WHEN UTL_HTTP.end_of_body THEN
      UTL_HTTP.end_response(l_http_response);
  END;

  -- Insert the data into the table.
  INSERT INTO http_clob_test (id, url, data)
  VALUES (http_clob_test_seq.NEXTVAL, p_url, l_clob);

  -- Relase the resources associated with the temporary LOB.
  DBMS_LOB.freetemporary(l_clob);
EXCEPTION
  WHEN OTHERS THEN
    UTL_HTTP.end_response(l_http_response);
    DBMS_LOB.freetemporary(l_clob);
    RAISE;
END load_html_from_url;
/

The first thing to do is turn on sql_trace at level 8 to include waits:

set serverout on
alter session set events 'sql_trace level 8';
exec load_html_from_url('http://orafun.info/');
alter session set events 'sql_trace off';

If you look at the relevant piece, which means where it shows the wait events, it looks like this:

WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 128265  =2  =0  =0 obj#=662 tim=86395107497
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 395  =5  =0  =0 obj#=662 tim=86395110191
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 150  =6  =0  =0 obj#=662 tim=86395111115
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 131998  =6  =0  =0 obj#=662 tim=86395243764
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 269  =4  =0  =0 obj#=662 tim=86395245182
WAIT #139864521752120: nam='direct path write temp' ela= 4137 file number=201 first dba=257795 block cnt=1 obj#=662 tim=86395250494
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 352  =3  =2  =0 obj#=662 tim=86395251294

What is shown here is some quite spectacular differences in elapsed time. Also, the only way to understand what is actually done flagged as ‘TCP Socket (KGAS)’ is the value following ‘ela’, which is the event p1 value.
The pattern is:

- 2
- 5
- 6
- 6
- 4
- 3

It’s relatively simple to guess what a few of these are:

- 2 - connect
- 5 - send
- 6 - \
- 6 - |   receiving ?
- 4 - /
- 3 - close

But if you include the timing, there must be more into play:

- 2 - ela= 128265   connect
- 5 - ela= 395      send
- 6 - ela= 150      \
- 6 - ela= 131998   |   receiving ?
- 4 - ela= 14       /
- 3 - ela= 177     close

2/connect: In order to build up a connection, a tcp connection needs to be created and established. That takes some time.
5/send: Sending from the perspective of a userland process is writing into a socket, which will get send by the operating system independently. This means sending from a userland process normally takes relative little time, because it’s not waiting for actually sending it.
6,4/receive: At this time, this doesn’t make sense to me.
3/close: Closing for a userland process is a simple, swift task. The operating system will keep the port open for some time, etc. but this is not visible for the user land application.

Let’s pull an old trick out of the hat: use strace (system call tracing) with an emphasis on writing on an oracle session that has SQL trace with waits enabled set. This will show the system calls executed, and show exactly when the oracle engine ends a wait, so we can reasonably well establish a relation between wait events and system calls. I say “reasonably well”, because we can’t see when Oracle truly started timing the wait event (kslwtbctx), only the output to trace file as part of ending the wait event (kslwtectx).

The way I done it, is using the command ‘strace -e write=all -p 18513 -o utl_http_test.txt’. Obviously 18513 is the process ID of the database foreground process. The results of the strace are in utl_http_test.txt.

Now open utl_http_test.txt and search for KGAS. The full output is way too much text, let me show some of the output which I think is noteworthy. Again: this is selective, partial output.
I do maintain the order in which the calls are visible.

1. TCP Socket (KGAS) p1=2, earlier annotated as ‘connect’

-- try to find a socket that has been created by nscd (name server caching deamon)
-- two times?
--
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
--
-- obtain file status of resolv.conf (hostname resolving configuration file)
--
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
--
-- open and read host.conf (another hostname resolving configuration file)
--
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "multi on\n", 4096)             = 9
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open and read resolv.conf (hostname resolving configuration)
--
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "# Generated by NetworkManager\nse"..., 4096) = 93
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open /etc/hosts (ip address to hostname mapping locally)
--
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=200, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "127.0.0.1   localhost localhost."..., 4096) = 200
read(9, "", 4096)                       = 0
close(9)
--
-- at this point two dynamic loadable libraries are read: libnss_dns.so.2 and libresolv.so.2
--
-- this is the DNS lookup of orafun.info
-- again, this is done twice, just like the use of /var/run/nscd/socket above?
--
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 0)    = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "=#\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  3d 23 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  =#...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 01 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 5000)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [45])                = 0
recvfrom(9, "=#\201\200\0\1\0\1\0\0\0\0\6orafun\4info\0\0\1\0\1\300\f\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 45
close(9)                                = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 4971) = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "o=\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  6f 3d 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  o=...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 1c 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 4970)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [109])               = 0
recvfrom(9, "o=\201\200\0\1\0\0\0\1\0\0\6orafun\4info\0\0\34\0\1\300\f\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 109
close(9)                                = 0
--
-- an epoll is created at file descriptor 9 (epoll: I/O event notification facility)
--
epoll_create(82)                        = 9
fcntl(9, F_SETFD, FD_CLOEXEC)           = 0
--
-- an IPV6 socket is created at file descriptor 11, 
-- bound to the IPV6 equivalent of localhost (::1),
-- destination port 0, source port 63257,
-- and is NOT connected.
--
socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 11
bind(11, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(11, {sa_family=AF_INET6, sin6_port=htons(63257), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
getpeername(11, 0x7ffdea6ba0f8, 0x7ffdea6ba1c8) = -1 ENOTCONN (Transport endpoint is not connected)
getsockopt(11, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
getsockopt(11, SOL_SOCKET, SO_RCVBUF, [262144], [4]) = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
--
-- File descriptor 11 is added to the epoll at file descriptor 9.
--
epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=3110993336, u64=139864426020280}}) = 0
--
-- A connection is created to the true destination (orafun.info/18.218.92.122).
-- This connection gets file descriptor 12.
-- Destination port 80 (http), source port 11751.
--
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 12
fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(12, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("18.218.92.122")}, 16) = -1 EINPROGRESS (Operation now in progress)
times(NULL)                             = 438106227
mmap(NULL, 786432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34b959b000
poll([{fd=12, events=POLLOUT}], 1, 60000) = 1 ([{fd=12, revents=POLLOUT}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(12, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(12, F_SETFL, O_RDWR)              = 0
getsockname(12, {sa_family=AF_INET, sin_port=htons(11751), sin_addr=inet_addr("10.0.2.15")}, [16]) = 0
getsockopt(12, SOL_SOCKET, SO_SNDBUF, [87040], [4]) = 0
getsockopt(12, SOL_SOCKET, SO_RCVBUF, [374400], [4]) = 0
setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(12, F_SETFD, FD_CLOEXEC)          = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=2
--
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 32 38 32  36 35 20 20 3d 32 20 20  la= 128265  =2   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 31 30 37 34 39 37     tim=86395107497  |

So yes, I am not sure if all of this is in the wait event, but there is a lot of stuff happening to build a connection to the remote server.

In order to find out why the lookup which is tried via the NSCD socket at the beginning, and later via DNS, is done twice, I ran the same procedure again and used tcpdump to look at the actual network traffic. This explained a lot:

# tcpdump -n host 10.0.2.3 and port 53
09:14:02.923389 IP 10.0.2.15.16819 > 10.0.2.3.domain: 15651+ A? orafun.info. (29)
09:14:02.948791 IP 10.0.2.3.domain > 10.0.2.15.16819: 15651 1/0/0 A 18.218.92.122 (45)
09:14:02.952304 IP 10.0.2.15.54590 > 10.0.2.3.domain: 28477+ AAAA? orafun.info. (29)
09:14:02.979534 IP 10.0.2.3.domain > 10.0.2.15.54590: 28477 0/1/0 (109)

In other words: first a DNS A record is requested (TCPv4 DNS name lookup), which results in the IPv4 ip address, then a DNS AAAA record is requested (TCPv6 DNS name lookup), which resulted in no ip address. In other words: orafun.info only has an IPv4 ip address. So the two lookups actually do have a function.

2. TCP Socket (KGAS) p1=5, earlier annotated as ‘send’
These are the systemcalls that are visible and quite probably related to the send wait event:

--
-- file descriptor 12 holding the connection to the destination server is added to the epoll at file descriptor 9
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- Then the http get request is sent to the destination server at its normal file descriptor, 12.
--
write(12, "GET / HTTP/1.1\r\nHost: orafun.inf"..., 56) = 56
 | 00000  47 45 54 20 2f 20 48 54  54 50 2f 31 2e 31 0d 0a  GET / HTTP/1.1.. |
 | 00010  48 6f 73 74 3a 20 6f 72  61 66 75 6e 2e 69 6e 66  Host: orafun.inf |
 | 00020  6f 0d 0a 43 6f 6e 6e 65  63 74 69 6f 6e 3a 20 63  o..Connection: c |
 | 00030  6c 6f 73 65 0d 0a 0d 0a                           lose....         |
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=5
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 39 35 20  20 3d 35 20 20 3d 30 20  la= 395  =5  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  30 31 39 31              =86395110191     |

3. TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 0, so it doesn't block.
--
epoll_wait(9, [], 82, 0)                = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 35 30 20  20 3d 36 20 20 3d 30 20  la= 150  =6  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  31 31 31 35              =86395111115     |

I have been thinking a lot about this seemingly weird call. It calls epoll_wait, but indicates it doesn’t want to wait (timeout=0), and even if epol_wait would have returned anything, indicated by a return code > 0, the epoll_event pointer is not set (indicated by []). The epoll file descriptor is used, but the only working file descriptor in the epoll is file descriptor 12, which has just been sent a http GET command, so the functionality of epoll is used.

This doesn’t make sense, unless you think about the asynchronous IO implementation of Oracle (see a lot of my earlier investigations), for which (in the case of asynchronous IO) io_getevents was called in a similar matter, timeout set to 0, to be able to do more requests while earlier IO requests are executed by the kernel. So my current theory here is that if multiple requests are happening, this mechanism provides a way to handle them.

If you have a simple single request, like in my case, this systemcall seems redundant. And because it queries the epoll file descriptor right after the request, it returns zero events, because there hardly has been any time after sending the http GET request.

4. Second TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 30000 (milliseconds).
--
epoll_wait(9, [{EPOLLIN, {u32=3110998864, u64=139864426025808}}], 82, 30000) = 1
--
-- And this is the second wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 33 31 39  39 38 20 20 3d 36 20 20  la= 131998  =6   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 32 34 33 37 36 34     tim=86395243764  |

This is the second time epoll_wait is called, and this one is blocking, because timeout has been set to 30000 milliseconds. If you look at the ela time, this took some time, and this now makes perfect sense: this system calls waits for an event to become available in the epoll, so it waits for the response of the remote http server. Please mind this call just notifies the userland process that the response is ready, the received data yet has to be read.

5. TCP Socket (KGAS) p1=4, earlier annotated as ‘receive’

--
-- At this point we know there is a response. First the original file descriptor is removed from the epoll:
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6b9710) = 0
--
-- The the response is read from file descriptor 12:
--
read(12, "HTTP/1.1 200 OK\r\nServer: nginx/1"..., 4096) = 2687
--
-- Then file descriptor 12 is added to the epoll again.
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=4
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 32 36 39 20  20 3d 34 20 20 3d 30 20  la= 269  =4  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 34  35 31 38 32              =86395245182     |

So, what p1 set to 4 actually means, is that once the connection did return data, which is checked using epoll, and visible with p1 set to 6, it is read into the process. This is also the reason this takes very little time, this is the time to read data from kernelspace to user space, and to manage the connection’s file descriptor. It is taken off the epoll in order not to disturb it, and it is added again because there could be another request.

6. TCP Socket (KGAS) p1=3, earlier annotated as ‘close’

--
-- file descriptor 12 removed from the epoll
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6bac20) = 0
--
-- file descriptor 12 is closed, closing the network connection
--
close(12)                               = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=3
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 35 32 20  20 3d 33 20 20 3d 32 20  la= 352  =3  =2  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 35  31 32 39 34              =86395251294     |

I don’t think this part holds any surprises. The network file descriptor is first removed from the epoll, and then it is closed, ending the TCP connection that was setup to perform a http request (in my case, I didn’t test, but I believe you will see the same with for example a SMTP connection, or any other type of TCP connection).

Summary

The basic message of this article is not surprising, nor does it conflict with current knowledge. Whenever you see a wait event ‘TCP Socket (KGAS)’, it means a foreground process is performing TCP networking via PLSQL. This wait event is a single event for creating, sending, receiving and closing a connection.

The true information of this article is how you can use the p1 value of the event to learn what actually the foreground is doing, and thus should give you more information to troubleshoot in the case of long waiting times.

TCP Socket (KGAS) p1 values:
1 - ?
2 - perform DNS lookup and create connection
3 - close connection
4 - copy TCP response into process space
5 - send request
6 - wait for TCP response to become available
7 - ?
8 - ? \
9 - ? | According to documentation, windows only in a 'special thread'.
10- ? /
Advertisement

Starting from Oracle 12, in a default configured database, there are more log writer processes than the well known ‘LGWR’ process itself, which are the ‘LGnn’ processes:

$ ps -ef | grep test | grep lg
oracle   18048     1  0 12:50 ?        00:00:13 ora_lgwr_test
oracle   18052     1  0 12:50 ?        00:00:06 ora_lg00_test
oracle   18056     1  0 12:50 ?        00:00:00 ora_lg01_test

These are the log writer worker processes, for which the minimal amount is equal to the amount public redo strands. Worker processes are assigned to a group, and the group is assigned to a public redo strand. The amount of worker processes in the group is dependent on the undocumented parameter “_max_log_write_parallelism”, which is one by default.

The actual usage of the worker processes is dependent in the first place on the value of the undocumented parameter “_use_single_log_writer”, for which the default value is ‘ADAPTIVE’, which means it’s switching automatically between ‘single log writer mode’, which is the traditional way of the LGWR process handling everything that the log writer functionality needs to do, and the ‘scalable log writer mode’, which means the log writer functionality is presumably using the log writer worker processes.

Other values for “_use_single_log_writer” are ‘TRUE’ to set ‘single log writer mode’, or ‘FALSE’ to set ‘scalable log writer mode’ fixed.

I assume most readers of this blog will know that the master log writer idle work cycle is sleeping on a semaphore (semtimedop()) under the wait event ‘rdbms ipc message’ for 3 seconds, then performs some “housekeeping”, after which it’ll sleep again repeating the small cycle of sleeping and housekeeping. For the log writer worker processes, this looks different if you look at the wait event information of the log writer worker processes:

135,59779,@1    14346                    DEDICATED oracle@memory-presentation.local (LGWR)	    time:1909.44ms,event:rdbms ipc message,seq#:292
48,34282,@1     14350                    DEDICATED oracle@memory-presentation.local (LG00)	    time:57561.85ms,event:LGWR worker group idle,seq#:150
136,24935,@1    14354                    DEDICATED oracle@memory-presentation.local (LG01)	    time:112785.66ms,event:LGWR worker group idle,seq#:74

The master log writer process (LGWR) has been sleeping for 1.9s when I queried the database, and it will sleep for 3 seconds, and then do some work and sleep again. However, the log writer worker processes have been sleeping for much longer: LG00 for 57.6s and LG01 for 112.8s, and the event is different: ‘LGWR worker group idle’. How is this implemented? Let’s look!

$ strace -p $(pgrep lg01)
strace: Process 14354 attached
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

I used strace on the LG01 process, and it’s still doing the same as most idle background processes are doing: sleeping on a semaphore for 3 seconds. But, it does not end its wait like LGWR does, the event the log writer worker processes are waiting in keeps on being timed.

Using a pin tools debugtrace shows the following:

 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | > __errno_location(0x38000, 0x7ffce278c328, ...)
 | | | > fthread_self(0x38000, 0x7ffce278c328, ...)
 | | | < fthread_self+0x000000000024 returns: 0
 | | < __errno_location+0x000000000010 returns: 0x7f7e930a26a0
 | < sskgpwwait+0x00000000014e returns: 0
 < skgpwwait+0x0000000000e0 returns: 0
 > ksuSdiInProgress(0x19e80, 0x19e80, ...)
 < ksuSdiInProgress+0x000000000035 returns: 0
 > sltrgftime64(0x19e80, 0x19e80, ...)
 | > clock_gettime@plt(0x1, 0x7ffce278c3a0, ...)
 | | > clock_gettime(0x1, 0x7ffce278c3a0, ...)
 | | < clock_gettime+0x000000000069 returns: 0
 | < clock_gettime+0x00000000003a returns: 0
 < sltrgftime64+0x00000000004c returns: 0x19c253f3ff
 > kslwo_getcbk(0xa2, 0xd80fa62, ...)
 < kslwo_getcbk+0x000000000017 returns: 0
 > kgslwait_last_waitctx_time_waited_usecs(0x7f7e930a29a0, 0x6dfd01c0, ...)
 < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x25e5e80
 > kskiorm(0x6d1854a8, 0, ...)
 < kskiorm+0x00000000001e returns: 0
 > kfias_iswtgon_ksfd(0x6d1854a8, 0, ...)
 < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 > kxdbio_has_work(0x7ffce278c3c4, 0x6003d010, ...)
 < kxdbio_has_work+0x000000000027 returns: 0
 > skgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | > kslwait_conv_wait_time(0x2dc6c0, 0x7f7e930a7ca0, ...)
 | < kslwait_conv_wait_time+0x000000000027 returns: 0x2dc6c0
 | > sskgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | | > semtimedop(0x38000, 0x7ffce278c328, ...)
 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff

And a full stack trace of a log writer worker look like this:

$ pstack $(pgrep lg01)
#0  0x00007feda8eaebda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000031fb4d0 in kcrfw_slave_queue_remove ()
#6  0x00000000031fad2a in kcrfw_slave_group_main ()
#7  0x00000000012160fa in ksvrdp_int ()
#8  0x000000000370d99a in opirip ()
#9  0x0000000001eb034a in opidrv ()
#10 0x0000000002afedf1 in sou2o ()
#11 0x0000000000d0547a in opimai_real ()
#12 0x0000000002b09b31 in ssthrdmain ()
#13 0x0000000000d05386 in main ()

If you combine the pstack backtrace and the debugtrace information, you see that the idle cycle does not leave the ‘ksliwat’ function, so the wait event is not finished. Quickly looking at the other functions, it’s easy to spot it reads the system clock (sltrgftime64), updates some information (kgslwait_last_waitctx_time_waited_usecs) and then performs some proactive IO checks (kskiorm, kfias_iswtgon_ksfd, kxdbio_has_work) after which it calls the post/wait based functions to setup the semaphore again.

Conclusion so far is the log writer workers do perform a 3 second sleep just like the master log writer, however the wait event ‘LGWR worker group idle’ is not interrupted like ‘rdbms ipc message’ is for the master log writer. This means the wait time for the event for each worker process indicates the last time the worker process actually performed something. A next logical question then is: but what do the log writer worker processes perform? Do they entirely take over the master log writer functionality, or do they work together with the master log writer?

In order to fully understand the next part, it is very beneficial to read up on how the log writer works in ‘single log writer’ mode, where the master log writer handling the idle and work cycle itself:
https://fritshoogland.wordpress.com/2018/02/20/a-look-into-into-oracle-redo-part-4-the-log-writer-null-write/
https://fritshoogland.wordpress.com/2018/02/27/a-look-into-oracle-redo-part-5-the-log-writer-writing/

If you want to perform this investigation yourself, make sure the database is in ‘scalable log writer’ mode, by setting “_use_single_log_writer” to FALSE. This is exactly what I did in order to make sure a log write is done in ‘scalable log writer’ mode.

Now let’s first apply some logic. Above the idle cycle of a log writer worker process is shown. Based on the ‘log writer null write’ blog post, we know that the log writer does advance the LWN and On-disk SCN every 3 seconds. Clearly, the log writer worker process does not do that. So that must mean the master log writer is still performing that function. It would also make very much sense, because it doesn’t matter for scalability if the master log writer performs the function of advancing the LWN and On-disk SCN or a worker process, nothing is waiting on it. Plus, if the master log writer performs most of its functions just like in ‘single log writer’ mode, the change to scalable mode would mean no change for client processes, any committing process must semop() the log writer to start writing.

Let’s look at the relevant debugtrace output of the master log writer in scalable log writer mode:

 | > kcrfw_redo_write_driver(0, 0, ...)
 | | > kcrfw_handle_member_write_errors(0, 0, ...)
 | | < kcrfw_handle_member_write_errors+0x000000000020 returns: 0x600161a0
 | | > kcmgtsf(0, 0, ...)
 | | | > sltrgatime64(0, 0, ...)
 | | | | > sltrgftime64(0, 0, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7fff1fe13010, ...)
 | | | | | | > clock_gettime(0x1, 0x7fff1fe13010, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x53747fe42
 | | | < sltrgatime64+0x00000000003e returns: 0x155d4fd
 | | < kcmgtsf+0x00000000032f returns: 0x3a182314
 | | > kcrfw_slave_adaptive_updatemode(0, 0x600161a0, ...)
 | | < kcrfw_slave_adaptive_updatemode+0x000000000080 returns: 0x7efe34d1f760
 | | > kcrfw_defer_write(0, 0x600161a0, ...)
 | | < kcrfw_defer_write+0x000000000038 returns: 0x7efe34d1f760
 | | > kcrfw_slave_queue_find(0, 0x600161a0, ...)
 | | < kcrfw_slave_queue_find+0x0000000000f1 returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0x1, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_group_switchpic(0, 0x1, ...)
 | | < kcrfw_slave_group_switchpic+0x000000000050 returns: 0x699b4508
 | | > skgstmGetEpochTs(0, 0x1, ...)
 | | | > gettimeofday@plt(0x7fff1fe13070, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x20debfd6192e5
 | | > kcsnew3(0x600113b8, 0x7fff1fe13228, ...)
 | | | > kcsnew8(0x600113b8, 0x7fff1fe13070, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0
 | | | > ub8_to_kscn_impl(0x66c3c7, 0x7fff1fe13228, ...)
 | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | < kcsnew3+0x00000000006f returns: 0x8000
 | | > ktfwtsm(0x3a182314, 0x7fff1fe13228, ...)
 | | | > kcmgtsf(0x2, 0x7fff1fe13228, ...)
 | | | | > sltrgatime64(0x2, 0x7fff1fe13228, ...)
 | | | | | > sltrgftime64(0x2, 0x7fff1fe13228, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | > clock_gettime(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x537484a6d
 | | | | < sltrgatime64+0x00000000003e returns: 0x155d511
 | | | < kcmgtsf+0x0000000001b2 returns: 0x3a182314
 | | | > kcmtdif(0x3a182314, 0x3a182314, ...)
 | | | < kcmtdif+0x00000000001b returns: 0
 | | | > ksl_get_shared_latch_int(0x60050340, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x60050340, 0x66c3c7, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcn_stm_write(0x7fff1fe13228, 0x66c3c7, ...)
 | | | > kstmgetsectick(0x7fff1fe13228, 0x66c3c7, ...)
 | | | < kstmgetsectick+0x00000000003a returns: 0x5ae4c494
 | | | > ksl_get_shared_latch_int(0x6004ee40, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x6004ee40, 0x2244, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcrfw_redo_write_initpic(0x699b4508, 0x7fff1fe13228, ...)
 | | | > kscn_to_ub8_impl(0x7fff1fe13228, 0x7fff1fe13228, ...)
 | | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | < kcrfw_redo_write_initpic+0x0000000000dc returns: 0x3a182314
 | | > kscn_to_ub8_impl(0x7fff1fe13228, 0, ...)
 | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | > kcrfw_gather_lwn(0x7fff1fe13268, 0x699b4508, ...)
 | | | > kslgetl(0x6abe4538, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe4538, 0x17d5f, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x6abe45d8, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe45d8, 0x137, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff
 | | > krsh_trace(0x1000, 0x200, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | | > kspgip(0x71e, 0x1, ...)
 | | < kspgip+0x00000000023f returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_queue_flush_internal(0x1, 0, ...)
 | | < kcrfw_slave_queue_flush_internal+0x0000000000d7 returns: 0x1
 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x9b, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x6a9a0718
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x9b
 | | | | > kcscu8(0x60016290, 0x7fff1fe12f98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7fff1fe12f38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7fff1fe0ac30, 0x7fff1fe12c50, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x6a9a0720, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3
 | < kcrfw_redo_write_driver+0x000000000535 returns: 0xd3

The highlighted functions are extra functions executed when the instance is set to scalable log writer mode, or when adaptive mode has set the instance to scalable log writer mode. This means that the changes between the modes is minimal when there’s no writes, and outside of a few extra functions, the log writer does exactly the same.

The absence of any spectacular changes in the behaviour of the log writer when in scalable log writer mode when there are no writes does hint what the actual changes will be of the scalable mode, which is how writing is handled. In single log writer mode, the most time the log writer is process is likely to spend on is writing the change vectors into the online redologfiles, and maybe, if you have a bad application (!) semop()-ing foreground sessions will be second, if there are a large number of processes committing, because every process needs to be semop()-ed individually. These two functions, along with some other functionality are exactly what the log writer worker processes are doing.

This means that foreground processes do nothing different in scalable log writer mode, they signal (semop) the master log writer, which will investigate the public redo strands, and if the master log writer finds change vectors to write, it will assign log writer worker processes to perform the write, and the log writer worker process will semop() the foreground sessions to indicate the redo has been written when the instance is in post/wait mode, or do not semop() when the instance is in polling mode.

This is the entire function flow of a write when the instance is in scalable log writer mode:

 | > kcrfw_slave_queue_insert(0, 0xd3, ...)
 | | > kcrfw_slave_group_setcurrsize(0, 0, ...)
 | | < kcrfw_slave_group_setcurrsize+0x0000000001d1 returns: 0x1
 | | > _intel_fast_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> _intel_fast_memcpy.P(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> __intel_ssse3_rep_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x6a9a05f8
 | | > kcrfw_slave_group_postall(0, 0xf0, ...)
 | | | > ksvgcls(0, 0xf0, ...)
 | | | < ksvgcls+0x000000000021 returns: 0
 | | | > ksl_post_proc(0x6ddb32f0, 0, ...)
 | | | <> kskpthr(0x6ddb32f0, 0, ...)
 | | | <> kslpsprns(0x6ddb32f0, 0, ...)
 | | | | > ksl_update_post_stats(0x6ddb32f0, 0, ...)
 | | | | | > dbgtTrcData_int(0x7f464c0676c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f464c0676c0, 0x7ffdae335338, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffdae334e40, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x20e067375b55d
 | | | | | | > dbgtrRecAllocate(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f464c0676c0, 0x65accba0, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x4
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x65acda30
 | | | | | | > dbgtrRecEndSegment(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > skgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | <> sskgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | | > semop@plt(0xc0000, 0x7ffdae335410, ...)
 | | | | | < semop+0x00000000000f returns: 0
 | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | < kslpsprns+0x0000000001c3 returns: 0
 | | < kcrfw_slave_group_postall+0x0000000000a8 returns: 0
 | < kcrfw_slave_queue_insert+0x0000000001b6 returns: 0x667bc540

After the instance has established there are change vectors in kcrfw_gather_lwn, in single log writer mode, the function kcrfw_redo_write is called, which will call kcrfw_do_write which handles the writing, and kslpslf to semop any waiting processes among other things. Now in scalable log writer mode, kcrfw_slave_queue_insert is called which assigns work to worker processes, and then kcrfw_slave_group_postall is called to semop one or more worker processes.

The worker processes are sleeping on a semaphore, and if a process gets signalled, it exits the kcrfw_slave_queue_remove function, ends the wait event, and calls kcrfw_redo_write, just like the master log writer process would call in single log writer mode, which includes doing the write (kcrfw_do_write) and posting the foregrounds (kslpslf), exactly all the functions.

Conclusion.
The adaptive scalable log writer processes function has been silently introduced with Oracle 12, although a lot of the used functionality has been available more or less in earlier versions. It is a fully automatic feature which will turn itself on and off based on heuristics. The purpose of this article is to explain how it works and what it is doing. Essentially, all the functionality that surrounds a log writer write has been moved to a worker process, which means the work can be done in parallel with multiple processes, whilst all the work outside of the work around the write, which is not performance critical, is left with the master log writer.

To me, ‘cloud computing’ is renting a compute resource to perform a task. In order to use that compute resource, you need to instruct it to do something, which is typically done via the network. If the task the compute resource needs to fulfil is being an application server or being a client or both in the case of an application server that uses an Oracle database, the network latency between the client of the database and the database server is a critical property.

I think so far everybody is with me. If we zoom in to the network, it becomes more difficult, and *very* easy to make wrong assumptions. Let me explain. A network, but really any connection between processing and a resource, has two DIFFERENT properties that I see getting mixed up consistently. These are:
* Latency: the time it takes for a signal or (network) packet to travel from the client to the server, or the time it takes to travel from the client to the server and back.
* Bandwidth: the amount of data that can be transported from the client to the server in a certain time.

How do you determine the latency of a network? Probably the most people respond with ‘use ping’. This is how that looks like:

[user@oid1 ~]$ ping -c 3 lsh1
PING lsh1 (x.x.x.x) 56(84) bytes of data.
64 bytes from lsh1: icmp_seq=1 ttl=62 time=680 ms
64 bytes from lsh1: icmp_seq=2 ttl=62 time=0.304 ms
64 bytes from lsh1: icmp_seq=3 ttl=62 time=0.286 ms

The question I often ask myself is: what is that we see actually? How does this work?
In order to answer that question, the tcpdump tool can answer that question. Using tcpdump, you can capture the network packets on which the ping utility based the above outcome. The ‘-ttt’ option calculates the time between each arrived packet:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host lsh1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1 > lsh1: ICMP echo request, id 35879, seq 1, length 64
00:00:00.680289 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 1, length 64
00:00:00.319614 IP oid1 > lsh1: ICMP echo request, id 35879, seq 2, length 64
00:00:00.000287 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 2, length 64
00:00:01.000180 IP oid1 > lsh1: ICMP echo request, id 35879, seq 3, length 64
00:00:00.000269 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 3, length 64

So, ping works by sending a packet (ICMP echo request) requesting a reply (ICMP echo reply) from the remote server, and measure the time it takes to get that reply. Great, quite simple, isn’t it? However, the biggest issue I see this is using a protocol that is not used for sending regular data (!). Most application servers I encounter send data using TCP (transmission control protocol), the traffic ping sends are sent using a protocol called ICMP (internet control message protocol). Especially in the cloud, which means (probably) a lot of the infrastructure is shared, ICMP might be given different priority than TCP traffic, which you quite probably are using when the application on your cloud virtual machine is running. For those of you who haven’t looked into the network side of the IT landscape, you can priorise protocols and even specific ports, throttle traffic and you can even terminate it. In fact, a sensible protected (virtual) machine in the cloud will not respond to ICMP echo requests in order to protected it from attacks.

So, what would be a more sensible approach then? A better way would be to use the same protocol and port number that your application is going to use. This can be done using a tool called hping. Using that tool, you can craft your own packet with the protocol and flags you want. In the case of Oracle database traffic that would be the TCP protocol, port 1521 (it can be any port number, 1521 is the default port). This is how you can do that. In order to mimic starting a connection, the S (SYN) flag is set (-S), one packet is send (-c 1) to port 1521 (-p 1521).

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip

What this does is best investigated with tcpdump once again. The server this is executed against can respond in two ways (three actually). When you send this to TCP port 1521 where a listener (or any other daemon that listens on that port) is listening, this is the response:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [S], seq 1436552830, win 512, length 0
00:00:00.001229 IP db01-vip.ncube-lm > oid1.kjtsiteserver: Flags [S.], seq 2397022511, ack 1436552831, win 14600, options [mss 1460], length 0
00:00:00.000023 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [R], seq 1436552831, win 0, length 0

This is a variation of the classic TCP three way handshake:
1. A TCP packet is sent with the SYN flag set to indicate starting a (client to server) connection.
2. A TCP packet is sent back with SYN flag set to indicate starting a (server to client) connection, and the first packet is acknowledged.
3. This is where the variation is, normally an acknowledgement would be sent of the second packet to establish a two way connection, but in order to stop the communication a packet is sent with the RST (reset) flag set.

However, this is if a process is listening on the port. This is how that looks like when there is no process listening on port 1521:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.vsamredirector > db01.ncube-lm: Flags [S], seq 1975471906, win 512, length 0
00:00:00.001118 IP db01.ncube-lm > oid1.vsamredirector: Flags [R.], seq 0, ack 1975471907, win 0, length 0

This means that if a connection is initiated to a port on which no process is listening (port status ‘closed’), there is communication between the client and the server. This is why firewalls were invented!
1. A TCP packet is sent with the SYN flag set to indicate starting a connection.
2. A TCP packet is sent back to with the RST (reset) flag set to indicate no connection is possible.

The third option, when port 1521 is firewalled on the server, simply means only the first packet (from client to server with the SYN flag set) is sent and no response is coming back.

Okay, let’s pick up the performance aspect again. This hping command:

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip
HPING db01-vip (eth0 x.x.x.x): S set, 40 headers + 0 data bytes
len=44 ip=db01-vip ttl=57 DF id=0 sport=1521 flags=SA seq=0 win=14600 rtt=1.2 ms

Says the roundtrip time is 1.2ms. If we look at the network packets and timing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.mmcal > db01-vip.ncube-lm: Flags [S], seq 1289836562, win 512, length 0
00:00:00.001113 IP db01-vip.ncube-lm > oid1.mmcal: Flags [S.], seq 2504750542, ack 1289836563, win 14600, options [mss 1460], length 0
00:00:00.000016 IP oid1.mmcal > db01-vip.ncube-lm: Flags [R], seq 1289836563, win 0, length 0

It becomes apparent that the 1.2ms time hping reports is the time it takes for the remote server to send back the SYN+ACK package in the TCP three way handshake.

So does that mean that if we take a number of measurements (let’s say 100, or 1000) to have a statistically significant number of measurements we can establish my TCP roundtrip time and then know how fast my connection will be (outside of all the other variables inherent to the internet and potential noisy neighbours to name a few)?

Oracle provides a way to generate and measure SQL-Net traffic in My Oracle Support note: Measuring Network Capacity using oratcptest (Doc ID 2064368.1). This note provides a jar file which contains server and client software, and is aimed at dataguard, but is useful to measure SQL-Net network latency. I have looked at the packets oratcptest generates, and they mimic SQL-Net quite well.

Let’s see if we can redo the test above to measure pure network latency. First on the database server side, setup the server:

[user@db01m ~]$ java -jar oratcptest.jar -server db01 -port=1521

And then on the client side run the client using the same oratcptest jar file:

java -jar oratcptest.jar db01 -mode=sync -length=0 -duration=1s -interval=1s -port=1521

The important bits are -mode=sync (client packet must be acknowledged before sending another packet) and -length=0 (network traffic contains no payload). This is the result:

[Requesting a test]
	Message payload        = 0 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(07:34:42) The server is ready.
                        Throughput                 Latency
(07:34:43)          0.017 Mbytes/s                0.670 ms
(07:34:43) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.017 Mbytes/s
	             Avg. latency = 0.670 ms

If you look at the hping roundtrip time (1.2ms) and the oratcptest roundtrip time (0.7ms) clearly this is different! If you just look at the numbers (1.2 versus 0.7) it might seem like the oratcptest time is only measuring client to server traffic instead of the whole roundtrip? For this too it’s good to use tcpdump once again and look what oratcptest actually is doing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.63602 > db01.ncube-lm: Flags [S], seq 2408800085, win 17920, options [mss 8960,sackOK,TS val 3861246405 ecr 0,nop,wscale 7], length 0
00:00:00.001160 IP db01.ncube-lm > oid1.63602: Flags [S.], seq 2178995555, ack 2408800086, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:00:00.000015 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 1, win 140, length 0
00:00:00.023175 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 1:145, ack 1, win 140, length 144
00:00:00.000520 IP db01.ncube-lm > oid1.63602: Flags [.], ack 145, win 123, length 0
00:00:00.000951 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 1:145, ack 145, win 123, length 144
00:00:00.000008 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 145, win 149, length 0
00:00:00.018839 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 145:157, ack 145, win 149, length 12
00:00:00.000563 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 145:149, ack 157, win 123, length 4
00:00:00.000358 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 157:169, ack 149, win 149, length 12
00:00:00.000486 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 149:153, ack 169, win 123, length 4
00:00:00.000100 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 169:181, ack 153, win 149, length 12
00:00:00.000494 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 153:157, ack 181, win 123, length 4
...
00:00:00.000192 IP oid1.63586 > db01.ncube-lm: Flags [P.], seq 18181:18193, ack 6157, win 149, length 12
00:00:00.000447 IP db01.ncube-lm > oid1.63586: Flags [P.], seq 6157:6161, ack 18193, win 123, length 4
00:00:00.006696 IP oid1.63586 > db01.ncube-lm: Flags [F.], seq 18193, ack 6161, win 149, length 0
00:00:00.000995 IP db01.ncube-lm > oid1.63586: Flags [F.], seq 6161, ack 18194, win 123, length 0
00:00:00.000012 IP oid1.63586 > db01.ncube-lm: Flags [.], ack 6162, win 149, length 0

If you look at rows 4, 5 and 6 you see the typical TCP three-way handshake. What is nice to see, is that the actual response or roundtrip time for the packet from the server on line 5 actually took 1.1ms, which is what we have measured with hping! At lines 7-10 we see there is a packet send from the client to the server which is ACK’ed and a packet send from the server to the client which is ACK’ed. If you add the ‘-A’ flag to tcpdump you can get the values in the packet printed as characters, which shows the client telling the server how it wants to perform the test and the server responding with the requested settings. This is all a preparation for the test.

Starting from line 11, there is a strict repeating sequence of the client sending a packet of length 12, ACK’ing the previous received packet, and then the server responding with a packet of length 4 ACK’ing its previous received packet. This is the actual performance test! This means that the setting ‘-duration=1s -interval=1s’ does not mean it sends one packet, it actually means it’s continuously sending packets for the duration of 1 second. Also another flag is showing: the P or PSH (push) flag. This flag means the kernel/tcpip-stack understands all data to transmit is provided from ‘userland’, and now must be sent immediately, and instructs the receiving side to process it immediately in order to bring it to the receiving userland application as soon as possible too.

Lines 20-22 show how the connection is closed by sending a packet with a FIN flag, which is done for both the client to the server and the server to the client, and because it’s TCP, these need to be ACK’ed, which is why you see a trailing packet without a flag set, only ACK’ing the FIN packet.

The conclusion so far is that for real usable latency calculations you should not use a different protocol (so whilst ICMP (ping) does give an latency indication it should really only be used as an indicator), and that you should measure doing the actual work, not meta-transactions like the TCP three way handshake. Probably because of the PSH flag, the actual minimal latency for SQL-Net traffic is lower than ping and hping showed.

Wait a minute…did you notice the ‘actual minimal latency’? So far we only have been sending empty packets, which means we measured how fast a packet can travel from client to server and back. In reality, you probably want to send actual data back and forth, don’t you? That is something that we actually have not measured yet!

Let’s do actual Oracle transactions. For the sake of testing network latency, we can use Swingbench to execute SQL. This is how that is done:

[user@oid1 bin]$ cd ~/sw/swingbench/bin
[user@oid1 bin]$ ./charbench -c ../configs/stresstest.xml -u soe -p soe -uc 1 -rt 00:01
Author  :	 Dominic Giles
Version :	 2.5.0.971

Results will be written to results.xml.
Hit Return to Terminate Run...

Time		Users	TPM	TPS

8:22:56 AM      1       14450   775

Please mind I am using 1 user (-uc 1) and a testing time of 1 minute (-rt 00:01), which should be longer when you are doing real testing. As a reminder, I am using 1 session because I want to understand the latency, not the bandwidth! In order to understand if the network traffic looks the same as oratcptest.jar, I can use tcpdump once again. Here is a snippet of the traffic:

...
00:00:00.000106 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5839:5852, ack 5986, win 272, length 13
00:00:00.000491 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 5986:6001, ack 5852, win 330, length 15
00:00:00.000234 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5852:6003, ack 6001, win 272, length 151
00:00:00.000562 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6001:6077, ack 6003, win 330, length 76
00:00:00.000098 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6003:6016, ack 6077, win 272, length 13
00:00:00.000484 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6077:6092, ack 6016, win 330, length 15
00:00:00.000238 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6016:6159, ack 6092, win 272, length 143
00:00:00.000591 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6092:6425, ack 6159, win 330, length 333
...

The important bit is this shows the same single packet traffic client to server and back as we saw oratcptest generated, however now with varying packet size (which is logical, different SQL statements are sent to the database), the PSH bit is set, which also is the same as oratcptest generated.

Let’s assume this is a real-life workload. In order to measure and calculate differences in performance between different networks, we need the average packet length. This can be done with a tool called tcpstat (this link provides the EL6 version). In my case I have only one application using a database on this server, so I can just filter on port 1521 to measure my SQL-Net traffic:

[user@oid1 ~]$ sudo tcpstat -i eth0 -o "Packet/s=%p\tmin size: %m\tavg size: %a\tmax size: %M\tstddev: %d\n" -f 'port 1521'
Packet/s=2526.40	min size: 53	avg size: 227.76	max size: 1436	stddev: 289.21
Packet/s=2531.40	min size: 53	avg size: 229.79	max size: 1432	stddev: 291.22
Packet/s=2634.20	min size: 53	avg size: 229.59	max size: 1432	stddev: 293.38
Packet/s=2550.00	min size: 53	avg size: 234.11	max size: 1435	stddev: 296.77
Packet/s=2486.80	min size: 53	avg size: 232.24	max size: 1436	stddev: 293.16

In case you wondered why tcpstat reports a minimum length of 53 and tcpdump (a little up in the article) of 13; tcpstat reports full packet length including packet, protocol and frame headers, tcpdump in this case reports the payload length.

Now we can execute oratcptest.jar again, but with a payload size set that matches the average size that we measured, I have taken 250 as payload size:

[user@oid1 ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(09:39:47) The server is ready.
                        Throughput                 Latency
(09:39:48)          0.365 Mbytes/s                0.685 ms
(09:39:48) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.365 Mbytes/s
	             Avg. latency = 0.685 ms

As you can see, there is a real modest increase in average latency going from 0.670ms to 0.685ms.

In order to test the impact of network latency let’s move the oratcptest client to the server, to get the lowest possible latency. Actually, this is very easy, because the oratcptest.jar file contains both the client and the server, so all I need to do is logon to the server where I started the oratcptest.jar file in server mode, and run it in client mode:

[user@db01m ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(14:49:29) The server is ready.
                        Throughput                 Latency
(14:49:30)         12.221 Mbytes/s                0.020 ms
(14:49:30) Test finished.
	       Socket send buffer = 26010 bytes
	          Avg. throughput = 11.970 Mbytes/s
	             Avg. latency = 0.021 ms

Wow! The roundtrip latency dropped from 0.685ms to 0.021ms! Another test using oratcptest.jar using a true local network connection (with Linux being virtualised using Xen/OVM) shows a latency of 0.161ms.

These are the different network latency figures measured with oratcptest using a payload size that equals my average network payload size:
– Local only RTT: 0.021
– Local network RTT: 0.161
– Different networks RTT: 0.685

If I take swingbench and execute the ‘stresstest’ run local, on a machine directly connected via the local network and across different networks (think cloud), and now measure TPS (transactions per second), I get the following figures:
– Local only TPS: 2356
– Local network TPS: 1567
– Different networks TPS: 854

Do these figures make sense?
– Local only: Time not in network transit per second: 1000-(0.021*2356)=950.524; approximate average time spend on query: 950.523/2356=0.40ms
– Local network: 1000-(0.161*1567)=747.713/1567=0.48ms
– Different networks: 1000-(0.685*854)=415.010/854=0.49ms
It seems that this swingbench test spends roughly 0.40-0.50ms on processing, the difference in transactions per second seem to be mainly caused by the difference in network latency.

Every now and then I am asked about the availability of the presentations I have delivered. Recently somebody asked about a presentation I delivered at the OUG Scotland about multiblock reads, and I promised to make it available. I’ve now uploaded a PDF version of all my old presentations them and put them in the ‘Whitepapers and presentation’ section.

Enjoy!

In the first post of the hardening serie, I described how to scan a website with Nikto. Another scanner, which not only scans HTTP servers, but does FTP, SMB/CIFS, telnet, ssh, Oracle listener and operating systems too is Nessus. Let me say right away: Nessus was started as an open source effort, but has been made closed source some time ago. A fork of the open source version is called OpenVAS.

Nessus is free for personal and non-commercial use. This means you can download it and use it to look how it works. If you want to scan your corporate webserver or websites, or use it as a tool for doing consultancy, you must purchase a ‘ProfessionalFeed’.

Nessus is a very advanced scanner, with a repository of checks (called plugins), which is kept up-to-date by the company which provides nessus, called ‘Tenable Network Security’. The updates can be downloaded manually (via a script), but are updated automatically every 24 hours by default.

These are the plugin families (from the nessus website):

  • CGI abuses – This plugin family checks for anything that is ‘CGI’ related, unless it is XSS (and only a XSS vulnerability), in which case it falls into the “CGI abuses : XSS” family. These checks use a combination of detection techniques, including checking version of the application and testing for the actual vulnerability. The attacks include software detection, information disclosure, XSS, SQLi, LFI, RFI, overflows and more.
  • CGI abuses : XSS – Specific CGI checks for reflective and persistent XSS vulnerabilities in common web applications.
  • Database – Typically a web server will run a database that is used by various web applications.
  • FTP – Web pages need to be updated, and FTP is a popular protocol used to allow your web developers to send files to the server.
  • Gain a Shell Remotely – If you can obtain a shell on the remote web server, testing the application is somewhat moot.
  • Gain root remotely – Same thing as above, if you gain root, resolve this problem before the application is tested.
  • General – Contains the operating system fingerprinting plugins, including ones that will identify the OS over HTTP. Identifying the underlying operating system is very important for web application testing, as it will determine the syntax of commands sent via injection (command and SQL) attacks.
  • Remote file access- Includes checks for specific web server/application vulnerabilities that lead to remote file disclosure.
  • Service detection – Contains checks for several different services, including detecting Apache running HTTPS, HTTP CONNECT proxy settings and other services that may host web applications.
  • Web servers – Plugins in this family detect approximately 300 specific vulnerabilities in popular web servers, such as Apache, IIS and generic vulnerabilities associated with the HTTP protocol itself.

Also, operating system updates are checked if Nessus is able to log on (either by credentials specified or if it found a username/password combination from an internal list: for example username ‘oracle’ with password ‘oracle’ is tested).

With all these tests it’s still very important to note that nothing beats proper design and validation of applications. Nessus (and most vulnerability checking software I know) checks for known issues. Customizations and self build applications are probably not thoroughly checked for vulnerabilities. This means Nessus could easily lead to a false sense of safety.

The Oracle HTTP Server (OHS) is a version of the Apache HTTP daemon, modified by Oracle to work with it’s Application Server suite. The Application Server suite is called ‘Fusion Middleware’ with version 11. The OHS acts/can act as a frontend to application servers like OC4J and Weblogic. In my opinion an application server should always have an http server in front of it, to act as:
– Logger
– SSL offloader
– Firewall
In my opinion it would also be very good to let the OHS be a frontend for APEX.

The OHS is only a decent firewall if you made sure it can only do what you have intended it to do. Sadly, this isn’t the case by default. Whilst the most recent (11.1) version is quite safe (stripped of functionality), it only is because it’s better than its predecessors, which where quite ‘open’.

Hardening is essentially done the same way as firewalling is: disable everything (disable all functionality) and enable only the needed functionality as limited as possible.

A decent way of checking enabled functionality, is using Nikto 2. Nikto is a web server scanner.

Here is an example:

$ ./nikto.pl -host example.com
- Nikto v2.1.1
---------------------------------------------------------------------------
+ Target IP: xxx.xxx.xxx.xxx
+ Target Hostname: example.com
+ Target Port: 80
+ Start Time: 2010-05-26 10:33:35
---------------------------------------------------------------------------
+ Server: Oracle-Application-Server-10g/10.1.2.0.2 Oracle-HTTP-Server OracleAS-Web-Cache-10g/10.1.2.0.2 (G;max-age=0+0;age=0;ecid=201564454590,0)
+ Uncommon header 'tcn' found, with contents: choice
+ ETag header found on server, inode: 3425418, size: 20042, mtime: 0x43e7685d;493cf3fe
+ Number of sections in the version string differ from those in the database, the server reports: oracle-application-server-10g/10.1.2.0.2oracle-http-serveroracleas-web-cache-10g/10.1.2.0.2(g;max-age=0+0;age=0;ecid=201564454590,0) while the database has: 10.1.3.1.0. This may cause false positives.
+ Oracle-Application-Server-10g/10.1.2.0.2Oracle-HTTP-ServerOracleAS-Web-Cache-10g/10.1.2.0.2(G;max-age=0+0;age=0;ecid=201564454590,0) appears to be outdated (current is at least 10.1.3.1.0)
+ Allowed HTTP Methods: GET, HEAD, OPTIONS, TRACE, POST, PUT, DELETE, CONNECT, PATCH, PROPFIND, PROPPATCH, MKCOL, COPY, MOVE, LOCK, UNLOCK
+ OSVDB-397: HTTP method ('Allow' Header): 'PUT' method could allow clients to save files on the web server.
+ OSVDB-5646: HTTP method ('Allow' Header): 'DELETE' may allow clients to remove files on the web server.
+ HTTP method ('Allow' Header): 'CONNECT' may allow server to proxy client requests.
+ HTTP method ('Allow' Header): 'PROPFIND' may indicate DAV/WebDAV is installed. This may be used to get directory listings if indexing is allow but a default page exists.
+ HTTP method ('Allow' Header): 'PROPPATCH' indicates WebDAV is installed.
+ OSVDB-5647: HTTP method ('Allow' Header): 'MOVE' may allow clients to change file locations on the web server.
+ OSVDB-877: HTTP TRACE method is active, suggesting the host is vulnerable to XST
+ OSVDB-27487: Apache is vulnerable to XSS via the Expect header
+ OSVDB-700: /fcgi-bin/echo?foo=alert('Vulnerable'): Fast-CGI has two default CGI programs (echo.exe/echo2.exe) vulnerable to Cross Site Scripting (XSS). http://www.cert.org/advisories/CA-2000-02.html.
+ OSVDB-3954: /fcgi-bin/echo2?foo=alert('Vulnerable'): Fast-CGI has two default CGI programs (echo.exe/echo2.exe) vulnerable to Cross Site Scripting (XSS). http://www.cert.org/advisories/CA-2000-02.html.
+ OSVDB-561: /server-status: This reveals Apache information. Comment out appropriate line in httpd.conf or restrict access to allowed hosts.
+ OSVDB-3233: /index.html.de: Apache default foreign language file found. All default files should be removed from the web server as they may give an attacker additional system information.
+ OSVDB-3233: /index.html.en: Apache default foreign language file found. All default files should be removed from the web server as they may give an attacker additional system information.
+ OSVDB-3233: /index.html.es: Apache default foreign language file found. All default files should be removed from the web server as they may give an attacker additional system information.
+ OSVDB-3233: /index.html.fr: Apache default foreign language file found. All default files should be removed from the web server as they may give an attacker additional system information.
+ OSVDB-3233: /index.html.it: Apache default foreign language file found. All default files should be removed from the web server as they may give an attacker additional system information.
+ OSVDB-3092: /fcgi-bin/echo: The FastCGI echo program may reveal system info or lead to other attacks.
+ OSVDB-3092: /fcgi-bin/echo2: The FastCGI echo2 program may reveal system info or lead to other attacks.
+ OSVDB-3233: /j2ee/: j2ee directory found--possibly an Oracle app server directory.
+ OSVDB-3233: /icons/README: Apache default file found.
+ 3823 items checked: 25 item(s) reported on remote host
+ End Time: 2010-05-26 10:44:18 (643 seconds)
---------------------------------------------------------------------------
+ 1 host(s) tested

There are all kinds of things to see here (some highlights):
-The server describes into detail what software is used. Not a problem on itself, but it gives very much information which can perfectly be used to attack the server, or which can be used to determine if the server is vulnerable:

+ Server: Oracle-Application-Server-10g/10.1.2.0.2 Oracle-HTTP-Server OracleAS-Web-Cache-10g/10.1.2.0.2

-Nikto tells there are newer versions of the OHS:

+ Oracle-Application-Server-10g/10.1.2.0.2Oracle-HTTP-ServerOracleAS-Web-Cache-10g/10.1.2.0.2(G;max-age=0+0;age=0;ecid=201564454590,0) appears to be outdated (current is at least 10.1.3.1.0)

-Nikto has determined this host is vulnerable to Cross Site Scripting (XSS):

+ OSVDB-27487: Apache is vulnerable to XSS via the Expect header

-Nikto found some files which gives away even more detailed information:

+ OSVDB-3092: /fcgi-bin/echo: The FastCGI echo program may reveal system info or lead to other attacks.

This is very much and detailed information about the configuration of this server.

If you’ve ever wondered about IPv6, and if you should bother, listen to this youtube video; it will tell you all about it:

I hope my provider (ziggo) is listening too, and finally get some idea about IPv6. Up to this time they seem to think it’s a chinese takeway dinner number.

Thanks to madunix for mentioning!

%d bloggers like this: