Archive

Tag Archives: Networking

I was asked some time ago what the Oracle database event ‘TCP socket (KGAS)’ means. This blogpost is a deep dive into what this event times in Oracle database 12.1.0.2.180717.

This event is not normally seen, only when TCP connections are initiated from the database using packages like UTL_TCP, UTL_SMTP and the one used in this article, UTL_HTTP.

A very basic explanation is this event times the time that a database foreground session spends on TCP connection management and communicating over TCP, excluding client and database link (sqlnet) networking. If you trace the system calls, you see that mostly that is working with a (network) socket. Part of the code in the oracle database that is managing that, sits in the kernel code layer kgas, kernel generic (of which I am quite sure, and then my guess:) asynchronous services, which explains the naming of the event.

This is what the Oracle online manual (https://docs.oracle.com/database/121/REFRN/GUID-203ACA60-9912-4493-9B79-EA4CDE89A78D.htm#REFRN00642 – Oracle is notorious for breaking links) says about ‘TCP socket (KGAS)’:

C.3.157 TCP Socket (KGAS)
A session is waiting for an external host to provide requested data over a network socket. The time that this wait event tracks does not indicate a problem, and even a long wait time is not a reason to contact Oracle Support. It naturally takes time for data to flow between hosts over a network, and for the remote aspect of an application to process any request made to it. An application that communicates with a remote host must wait until the data it will read has arrived. In addition, on Microsoft Windows, a separate thread monitors the arrival of traffic. This thread spends most of its life in waits tracked by the TCP Socket (KGAS) wait event.

Wait Time: The total elapsed time for the network connection to be established or for data to arrive from over the network

Parameter Description
P0 For Oracle internal use only. Values 8, 9, and 10 occur within the special thread present on Microsoft Windows; other P0 values occur in normal user sessions.

P1 For Oracle internal use only

Overall, the basic explanation that Oracle provides is mostly correct. I think the text saying to not contact Oracle support is not relevant, but maybe there is a need to relieve Oracle support. In my tests, I found that the full TCP connection lifecycle (creation, usage and removal) is put under this event, for which the text seems to emphasise on waiting for a remote host, which would be the most logical culprit for wait times, but other issues could lead to wait times additionally. This means the wait event itself is not explaining what it is showing, outside of TCP connection management.

The wait time explanation is nearly complete. If it would say something like ‘all TCP connection management and usage’ it would have fully covered it, it now excludes disconnecting and sending, because it explicitly mentions creating the connecting and receiving (waiting for data).

I do not understand what is meant with P0 and P1. I think it is p1 and p2 of the wait event, but naming it P0 and P1 is weird. When looking at the explanation it reads to me ‘we do not wish to explain anything to you’.

So, that means I am going to find this out myself….

If you are interested in this, or do want to write articles like this too, I urge you to replay this on your own system.

First of all, create a small setup which you can use to actually execute UTL_HTTP. The example essentially is taken from father of code examples, Tim Hall/Oracle base. Thank you Tim!
First setup the database to allow a user (in my case ‘ts’) to create the objects and use the network:

grant create sequence to ts;
grant create procedure to ts;
grant create table to ts;
grant alter session to ts;
begin
  dbms_network_acl_admin.create_acl (
    acl => 'anything.xml',
    description => 'allow anything',
    principal => 'TS',
    is_grant => true,
    privilege => 'connect'
  );
end;
begin
  dbms_network_acl_admin.assign_acl (
    acl => 'anything.xml',
    host => '*'
  );
end;

Then connect as the actual user (ts), and create the objects and the procedure that uses UTL_HTTP:

drop table http_clob_test;
create table http_clob_test (
        id number(10),
        url varchar2(255),
        data clob,
        constraint http_clob_test_pk primary key (id)
);
drop sequence http_clob_test_seq;
create sequence http_clob_test_seq;
CREATE OR REPLACE PROCEDURE load_html_from_url (p_url  IN  VARCHAR2) AS
  -- examples by tim hall
  -- https://oracle-base.com/articles/misc/retrieving-html-and-binaries-into-tables-over-http
  l_http_request   UTL_HTTP.req;
  l_http_response  UTL_HTTP.resp;
  l_clob           CLOB;
  l_text           VARCHAR2(32767);
BEGIN
  DBMS_LOB.createtemporary(l_clob, FALSE);

  -- Make a HTTP request and get the response.
  l_http_request  := UTL_HTTP.begin_request(p_url);
  l_http_response := UTL_HTTP.get_response(l_http_request);

  -- Copy the response into the CLOB.
  BEGIN
    LOOP
      UTL_HTTP.read_text(l_http_response, l_text, 32766);
      DBMS_LOB.writeappend (l_clob, LENGTH(l_text), l_text);
    END LOOP;
  EXCEPTION
    WHEN UTL_HTTP.end_of_body THEN
      UTL_HTTP.end_response(l_http_response);
  END;

  -- Insert the data into the table.
  INSERT INTO http_clob_test (id, url, data)
  VALUES (http_clob_test_seq.NEXTVAL, p_url, l_clob);

  -- Relase the resources associated with the temporary LOB.
  DBMS_LOB.freetemporary(l_clob);
EXCEPTION
  WHEN OTHERS THEN
    UTL_HTTP.end_response(l_http_response);
    DBMS_LOB.freetemporary(l_clob);
    RAISE;
END load_html_from_url;
/

The first thing to do is turn on sql_trace at level 8 to include waits:

set serverout on
alter session set events 'sql_trace level 8';
exec load_html_from_url('http://orafun.info/');
alter session set events 'sql_trace off';

If you look at the relevant piece, which means where it shows the wait events, it looks like this:

WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 128265  =2  =0  =0 obj#=662 tim=86395107497
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 395  =5  =0  =0 obj#=662 tim=86395110191
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 150  =6  =0  =0 obj#=662 tim=86395111115
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 131998  =6  =0  =0 obj#=662 tim=86395243764
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 269  =4  =0  =0 obj#=662 tim=86395245182
WAIT #139864521752120: nam='direct path write temp' ela= 4137 file number=201 first dba=257795 block cnt=1 obj#=662 tim=86395250494
WAIT #139864521752120: nam='TCP Socket (KGAS)' ela= 352  =3  =2  =0 obj#=662 tim=86395251294

What is shown here is some quite spectacular differences in elapsed time. Also, the only way to understand what is actually done flagged as ‘TCP Socket (KGAS)’ is the value following ‘ela’, which is the event p1 value.
The pattern is:

- 2
- 5
- 6
- 6
- 4
- 3

It’s relatively simple to guess what a few of these are:

- 2 - connect
- 5 - send
- 6 - \
- 6 - |   receiving ?
- 4 - /
- 3 - close

But if you include the timing, there must be more into play:

- 2 - ela= 128265   connect
- 5 - ela= 395      send
- 6 - ela= 150      \
- 6 - ela= 131998   |   receiving ?
- 4 - ela= 14       /
- 3 - ela= 177     close

2/connect: In order to build up a connection, a tcp connection needs to be created and established. That takes some time.
5/send: Sending from the perspective of a userland process is writing into a socket, which will get send by the operating system independently. This means sending from a userland process normally takes relative little time, because it’s not waiting for actually sending it.
6,4/receive: At this time, this doesn’t make sense to me.
3/close: Closing for a userland process is a simple, swift task. The operating system will keep the port open for some time, etc. but this is not visible for the user land application.

Let’s pull an old trick out of the hat: use strace (system call tracing) with an emphasis on writing on an oracle session that has SQL trace with waits enabled set. This will show the system calls executed, and show exactly when the oracle engine ends a wait, so we can reasonably well establish a relation between wait events and system calls. I say “reasonably well”, because we can’t see when Oracle truly started timing the wait event (kslwtbctx), only the output to trace file as part of ending the wait event (kslwtectx).

The way I done it, is using the command ‘strace -e write=all -p 18513 -o utl_http_test.txt’. Obviously 18513 is the process ID of the database foreground process. The results of the strace are in utl_http_test.txt.

Now open utl_http_test.txt and search for KGAS. The full output is way too much text, let me show some of the output which I think is noteworthy. Again: this is selective, partial output.
I do maintain the order in which the calls are visible.

1. TCP Socket (KGAS) p1=2, earlier annotated as ‘connect’

-- try to find a socket that has been created by nscd (name server caching deamon)
-- two times?
--
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9
connect(9, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(9)                                = 0
--
-- obtain file status of resolv.conf (hostname resolving configuration file)
--
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
--
-- open and read host.conf (another hostname resolving configuration file)
--
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "multi on\n", 4096)             = 9
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open and read resolv.conf (hostname resolving configuration)
--
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=93, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "# Generated by NetworkManager\nse"..., 4096) = 93
read(9, "", 4096)                       = 0
close(9)                                = 0
--
-- open /etc/hosts (ip address to hostname mapping locally)
--
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 9
fstat(9, {st_mode=S_IFREG|0644, st_size=200, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34bf377000
read(9, "127.0.0.1   localhost localhost."..., 4096) = 200
read(9, "", 4096)                       = 0
close(9)
--
-- at this point two dynamic loadable libraries are read: libnss_dns.so.2 and libresolv.so.2
--
-- this is the DNS lookup of orafun.info
-- again, this is done twice, just like the use of /var/run/nscd/socket above?
--
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 0)    = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "=#\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  3d 23 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  =#...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 01 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 5000)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [45])                = 0
recvfrom(9, "=#\201\200\0\1\0\1\0\0\0\0\6orafun\4info\0\0\1\0\1\300\f\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 45
close(9)                                = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, 16) = 0
poll([{fd=9, events=POLLOUT}], 1, 4971) = 1 ([{fd=9, revents=POLLOUT}])
sendto(9, "o=\1\0\0\1\0\0\0\0\0\0\6orafun\4info\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
 | 00000  6f 3d 01 00 00 01 00 00  00 00 00 00 06 6f 72 61  o=...........ora |
 | 00010  66 75 6e 04 69 6e 66 6f  00 00 1c 00 01           fun.info.....    |
poll([{fd=9, events=POLLIN}], 1, 4970)  = 1 ([{fd=9, revents=POLLIN}])
ioctl(9, FIONREAD, [109])               = 0
recvfrom(9, "o=\201\200\0\1\0\0\0\1\0\0\6orafun\4info\0\0\34\0\1\300\f\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.2.3")}, [16]) = 109
close(9)                                = 0
--
-- an epoll is created at file descriptor 9 (epoll: I/O event notification facility)
--
epoll_create(82)                        = 9
fcntl(9, F_SETFD, FD_CLOEXEC)           = 0
--
-- an IPV6 socket is created at file descriptor 11, 
-- bound to the IPV6 equivalent of localhost (::1),
-- destination port 0, source port 63257,
-- and is NOT connected.
--
socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 11
bind(11, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(11, {sa_family=AF_INET6, sin6_port=htons(63257), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
getpeername(11, 0x7ffdea6ba0f8, 0x7ffdea6ba1c8) = -1 ENOTCONN (Transport endpoint is not connected)
getsockopt(11, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
getsockopt(11, SOL_SOCKET, SO_RCVBUF, [262144], [4]) = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
--
-- File descriptor 11 is added to the epoll at file descriptor 9.
--
epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=3110993336, u64=139864426020280}}) = 0
--
-- A connection is created to the true destination (orafun.info/18.218.92.122).
-- This connection gets file descriptor 12.
-- Destination port 80 (http), source port 11751.
--
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 12
fcntl(12, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(12, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("18.218.92.122")}, 16) = -1 EINPROGRESS (Operation now in progress)
times(NULL)                             = 438106227
mmap(NULL, 786432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f34b959b000
poll([{fd=12, events=POLLOUT}], 1, 60000) = 1 ([{fd=12, revents=POLLOUT}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(12, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(12, F_SETFL, O_RDWR)              = 0
getsockname(12, {sa_family=AF_INET, sin_port=htons(11751), sin_addr=inet_addr("10.0.2.15")}, [16]) = 0
getsockopt(12, SOL_SOCKET, SO_SNDBUF, [87040], [4]) = 0
getsockopt(12, SOL_SOCKET, SO_RCVBUF, [374400], [4]) = 0
setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(12, F_SETFD, FD_CLOEXEC)          = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=2
--
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 32 38 32  36 35 20 20 3d 32 20 20  la= 128265  =2   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 31 30 37 34 39 37     tim=86395107497  |

So yes, I am not sure if all of this is in the wait event, but there is a lot of stuff happening to build a connection to the remote server.

In order to find out why the lookup which is tried via the NSCD socket at the beginning, and later via DNS, is done twice, I ran the same procedure again and used tcpdump to look at the actual network traffic. This explained a lot:

# tcpdump -n host 10.0.2.3 and port 53
09:14:02.923389 IP 10.0.2.15.16819 > 10.0.2.3.domain: 15651+ A? orafun.info. (29)
09:14:02.948791 IP 10.0.2.3.domain > 10.0.2.15.16819: 15651 1/0/0 A 18.218.92.122 (45)
09:14:02.952304 IP 10.0.2.15.54590 > 10.0.2.3.domain: 28477+ AAAA? orafun.info. (29)
09:14:02.979534 IP 10.0.2.3.domain > 10.0.2.15.54590: 28477 0/1/0 (109)

In other words: first a DNS A record is requested (TCPv4 DNS name lookup), which results in the IPv4 ip address, then a DNS AAAA record is requested (TCPv6 DNS name lookup), which resulted in no ip address. In other words: orafun.info only has an IPv4 ip address. So the two lookups actually do have a function.

2. TCP Socket (KGAS) p1=5, earlier annotated as ‘send’
These are the systemcalls that are visible and quite probably related to the send wait event:

--
-- file descriptor 12 holding the connection to the destination server is added to the epoll at file descriptor 9
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- Then the http get request is sent to the destination server at its normal file descriptor, 12.
--
write(12, "GET / HTTP/1.1\r\nHost: orafun.inf"..., 56) = 56
 | 00000  47 45 54 20 2f 20 48 54  54 50 2f 31 2e 31 0d 0a  GET / HTTP/1.1.. |
 | 00010  48 6f 73 74 3a 20 6f 72  61 66 75 6e 2e 69 6e 66  Host: orafun.inf |
 | 00020  6f 0d 0a 43 6f 6e 6e 65  63 74 69 6f 6e 3a 20 63  o..Connection: c |
 | 00030  6c 6f 73 65 0d 0a 0d 0a                           lose....         |
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=5
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 39 35 20  20 3d 35 20 20 3d 30 20  la= 395  =5  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  30 31 39 31              =86395110191     |

3. TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 0, so it doesn't block.
--
epoll_wait(9, [], 82, 0)                = 0
--
-- And this is the wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 35 30 20  20 3d 36 20 20 3d 30 20  la= 150  =6  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 31 31  31 31 31 35              =86395111115     |

I have been thinking a lot about this seemingly weird call. It calls epoll_wait, but indicates it doesn’t want to wait (timeout=0), and even if epol_wait would have returned anything, indicated by a return code > 0, the epoll_event pointer is not set (indicated by []). The epoll file descriptor is used, but the only working file descriptor in the epoll is file descriptor 12, which has just been sent a http GET command, so the functionality of epoll is used.

This doesn’t make sense, unless you think about the asynchronous IO implementation of Oracle (see a lot of my earlier investigations), for which (in the case of asynchronous IO) io_getevents was called in a similar matter, timeout set to 0, to be able to do more requests while earlier IO requests are executed by the kernel. So my current theory here is that if multiple requests are happening, this mechanism provides a way to handle them.

If you have a simple single request, like in my case, this systemcall seems redundant. And because it queries the epoll file descriptor right after the request, it returns zero events, because there hardly has been any time after sending the http GET request.

4. Second TCP Socket (KGAS) p1=6, earlier annotated as ‘receive’

--
-- Calling epoll_wait with timeout set to 30000 (milliseconds).
--
epoll_wait(9, [{EPOLLIN, {u32=3110998864, u64=139864426025808}}], 82, 30000) = 1
--
-- And this is the second wait event written by the process: TCP Socket (KGAS), p1=6
-- 
write(7, "WAIT #139864521752120: nam='TCP "..., 95) = 95
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 31 33 31 39  39 38 20 20 3d 36 20 20  la= 131998  =6   |
 | 00040  3d 30 20 20 3d 30 20 6f  62 6a 23 3d 36 36 32 20  =0  =0 obj#=662  |
 | 00050  74 69 6d 3d 38 36 33 39  35 32 34 33 37 36 34     tim=86395243764  |

This is the second time epoll_wait is called, and this one is blocking, because timeout has been set to 30000 milliseconds. If you look at the ela time, this took some time, and this now makes perfect sense: this system calls waits for an event to become available in the epoll, so it waits for the response of the remote http server. Please mind this call just notifies the userland process that the response is ready, the received data yet has to be read.

5. TCP Socket (KGAS) p1=4, earlier annotated as ‘receive’

--
-- At this point we know there is a response. First the original file descriptor is removed from the epoll:
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6b9710) = 0
--
-- The the response is read from file descriptor 12:
--
read(12, "HTTP/1.1 200 OK\r\nServer: nginx/1"..., 4096) = 2687
--
-- Then file descriptor 12 is added to the epoll again.
--
epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=3110998864, u64=139864426025808}}) = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=4
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 32 36 39 20  20 3d 34 20 20 3d 30 20  la= 269  =4  =0  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 34  35 31 38 32              =86395245182     |

So, what p1 set to 4 actually means, is that once the connection did return data, which is checked using epoll, and visible with p1 set to 6, it is read into the process. This is also the reason this takes very little time, this is the time to read data from kernelspace to user space, and to manage the connection’s file descriptor. It is taken off the epoll in order not to disturb it, and it is added again because there could be another request.

6. TCP Socket (KGAS) p1=3, earlier annotated as ‘close’

--
-- file descriptor 12 removed from the epoll
--
epoll_ctl(9, EPOLL_CTL_DEL, 12, 0x7ffdea6bac20) = 0
--
-- file descriptor 12 is closed, closing the network connection
--
close(12)                               = 0
--
-- And a wait event written by the process: TCP Socket (KGAS), p1=3
--
write(7, "WAIT #139864521752120: nam='TCP "..., 92) = 92
 | 00000  57 41 49 54 20 23 31 33  39 38 36 34 35 32 31 37  WAIT #1398645217 |
 | 00010  35 32 31 32 30 3a 20 6e  61 6d 3d 27 54 43 50 20  52120: nam='TCP  |
 | 00020  53 6f 63 6b 65 74 20 28  4b 47 41 53 29 27 20 65  Socket (KGAS)' e |
 | 00030  6c 61 3d 20 33 35 32 20  20 3d 33 20 20 3d 32 20  la= 352  =3  =2  |
 | 00040  20 3d 30 20 6f 62 6a 23  3d 36 36 32 20 74 69 6d   =0 obj#=662 tim |
 | 00050  3d 38 36 33 39 35 32 35  31 32 39 34              =86395251294     |

I don’t think this part holds any surprises. The network file descriptor is first removed from the epoll, and then it is closed, ending the TCP connection that was setup to perform a http request (in my case, I didn’t test, but I believe you will see the same with for example a SMTP connection, or any other type of TCP connection).

Summary

The basic message of this article is not surprising, nor does it conflict with current knowledge. Whenever you see a wait event ‘TCP Socket (KGAS)’, it means a foreground process is performing TCP networking via PLSQL. This wait event is a single event for creating, sending, receiving and closing a connection.

The true information of this article is how you can use the p1 value of the event to learn what actually the foreground is doing, and thus should give you more information to troubleshoot in the case of long waiting times.

TCP Socket (KGAS) p1 values:
1 - ?
2 - perform DNS lookup and create connection
3 - close connection
4 - copy TCP response into process space
5 - send request
6 - wait for TCP response to become available
7 - ?
8 - ? \
9 - ? | According to documentation, windows only in a 'special thread'.
10- ? /
Advertisements

To me, ‘cloud computing’ is renting a compute resource to perform a task. In order to use that compute resource, you need to instruct it to do something, which is typically done via the network. If the task the compute resource needs to fulfil is being an application server or being a client or both in the case of an application server that uses an Oracle database, the network latency between the client of the database and the database server is a critical property.

I think so far everybody is with me. If we zoom in to the network, it becomes more difficult, and *very* easy to make wrong assumptions. Let me explain. A network, but really any connection between processing and a resource, has two DIFFERENT properties that I see getting mixed up consistently. These are:
* Latency: the time it takes for a signal or (network) packet to travel from the client to the server, or the time it takes to travel from the client to the server and back.
* Bandwidth: the amount of data that can be transported from the client to the server in a certain time.

How do you determine the latency of a network? Probably the most people respond with ‘use ping’. This is how that looks like:

[user@oid1 ~]$ ping -c 3 lsh1
PING lsh1 (x.x.x.x) 56(84) bytes of data.
64 bytes from lsh1: icmp_seq=1 ttl=62 time=680 ms
64 bytes from lsh1: icmp_seq=2 ttl=62 time=0.304 ms
64 bytes from lsh1: icmp_seq=3 ttl=62 time=0.286 ms

The question I often ask myself is: what is that we see actually? How does this work?
In order to answer that question, the tcpdump tool can answer that question. Using tcpdump, you can capture the network packets on which the ping utility based the above outcome. The ‘-ttt’ option calculates the time between each arrived packet:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host lsh1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1 > lsh1: ICMP echo request, id 35879, seq 1, length 64
00:00:00.680289 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 1, length 64
00:00:00.319614 IP oid1 > lsh1: ICMP echo request, id 35879, seq 2, length 64
00:00:00.000287 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 2, length 64
00:00:01.000180 IP oid1 > lsh1: ICMP echo request, id 35879, seq 3, length 64
00:00:00.000269 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 3, length 64

So, ping works by sending a packet (ICMP echo request) requesting a reply (ICMP echo reply) from the remote server, and measure the time it takes to get that reply. Great, quite simple, isn’t it? However, the biggest issue I see this is using a protocol that is not used for sending regular data (!). Most application servers I encounter send data using TCP (transmission control protocol), the traffic ping sends are sent using a protocol called ICMP (internet control message protocol). Especially in the cloud, which means (probably) a lot of the infrastructure is shared, ICMP might be given different priority than TCP traffic, which you quite probably are using when the application on your cloud virtual machine is running. For those of you who haven’t looked into the network side of the IT landscape, you can priorise protocols and even specific ports, throttle traffic and you can even terminate it. In fact, a sensible protected (virtual) machine in the cloud will not respond to ICMP echo requests in order to protected it from attacks.

So, what would be a more sensible approach then? A better way would be to use the same protocol and port number that your application is going to use. This can be done using a tool called hping. Using that tool, you can craft your own packet with the protocol and flags you want. In the case of Oracle database traffic that would be the TCP protocol, port 1521 (it can be any port number, 1521 is the default port). This is how you can do that. In order to mimic starting a connection, the S (SYN) flag is set (-S), one packet is send (-c 1) to port 1521 (-p 1521).

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip

What this does is best investigated with tcpdump once again. The server this is executed against can respond in two ways (three actually). When you send this to TCP port 1521 where a listener (or any other daemon that listens on that port) is listening, this is the response:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [S], seq 1436552830, win 512, length 0
00:00:00.001229 IP db01-vip.ncube-lm > oid1.kjtsiteserver: Flags [S.], seq 2397022511, ack 1436552831, win 14600, options [mss 1460], length 0
00:00:00.000023 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [R], seq 1436552831, win 0, length 0

This is a variation of the classic TCP three way handshake:
1. A TCP packet is sent with the SYN flag set to indicate starting a (client to server) connection.
2. A TCP packet is sent back with SYN flag set to indicate starting a (server to client) connection, and the first packet is acknowledged.
3. This is where the variation is, normally an acknowledgement would be sent of the second packet to establish a two way connection, but in order to stop the communication a packet is sent with the RST (reset) flag set.

However, this is if a process is listening on the port. This is how that looks like when there is no process listening on port 1521:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.vsamredirector > db01.ncube-lm: Flags [S], seq 1975471906, win 512, length 0
00:00:00.001118 IP db01.ncube-lm > oid1.vsamredirector: Flags [R.], seq 0, ack 1975471907, win 0, length 0

This means that if a connection is initiated to a port on which no process is listening (port status ‘closed’), there is communication between the client and the server. This is why firewalls were invented!
1. A TCP packet is sent with the SYN flag set to indicate starting a connection.
2. A TCP packet is sent back to with the RST (reset) flag set to indicate no connection is possible.

The third option, when port 1521 is firewalled on the server, simply means only the first packet (from client to server with the SYN flag set) is sent and no response is coming back.

Okay, let’s pick up the performance aspect again. This hping command:

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip
HPING db01-vip (eth0 x.x.x.x): S set, 40 headers + 0 data bytes
len=44 ip=db01-vip ttl=57 DF id=0 sport=1521 flags=SA seq=0 win=14600 rtt=1.2 ms

Says the roundtrip time is 1.2ms. If we look at the network packets and timing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.mmcal > db01-vip.ncube-lm: Flags [S], seq 1289836562, win 512, length 0
00:00:00.001113 IP db01-vip.ncube-lm > oid1.mmcal: Flags [S.], seq 2504750542, ack 1289836563, win 14600, options [mss 1460], length 0
00:00:00.000016 IP oid1.mmcal > db01-vip.ncube-lm: Flags [R], seq 1289836563, win 0, length 0

It becomes apparent that the 1.2ms time hping reports is the time it takes for the remote server to send back the SYN+ACK package in the TCP three way handshake.

So does that mean that if we take a number of measurements (let’s say 100, or 1000) to have a statistically significant number of measurements we can establish my TCP roundtrip time and then know how fast my connection will be (outside of all the other variables inherent to the internet and potential noisy neighbours to name a few)?

Oracle provides a way to generate and measure SQL-Net traffic in My Oracle Support note: Measuring Network Capacity using oratcptest (Doc ID 2064368.1). This note provides a jar file which contains server and client software, and is aimed at dataguard, but is useful to measure SQL-Net network latency. I have looked at the packets oratcptest generates, and they mimic SQL-Net quite well.

Let’s see if we can redo the test above to measure pure network latency. First on the database server side, setup the server:

[user@db01m ~]$ java -jar oratcptest.jar -server db01 -port=1521

And then on the client side run the client using the same oratcptest jar file:

java -jar oratcptest.jar db01 -mode=sync -length=0 -duration=1s -interval=1s -port=1521

The important bits are -mode=sync (client packet must be acknowledged before sending another packet) and -length=0 (network traffic contains no payload). This is the result:

[Requesting a test]
	Message payload        = 0 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(07:34:42) The server is ready.
                        Throughput                 Latency
(07:34:43)          0.017 Mbytes/s                0.670 ms
(07:34:43) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.017 Mbytes/s
	             Avg. latency = 0.670 ms

If you look at the hping roundtrip time (1.2ms) and the oratcptest roundtrip time (0.7ms) clearly this is different! If you just look at the numbers (1.2 versus 0.7) it might seem like the oratcptest time is only measuring client to server traffic instead of the whole roundtrip? For this too it’s good to use tcpdump once again and look what oratcptest actually is doing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.63602 > db01.ncube-lm: Flags [S], seq 2408800085, win 17920, options [mss 8960,sackOK,TS val 3861246405 ecr 0,nop,wscale 7], length 0
00:00:00.001160 IP db01.ncube-lm > oid1.63602: Flags [S.], seq 2178995555, ack 2408800086, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:00:00.000015 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 1, win 140, length 0
00:00:00.023175 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 1:145, ack 1, win 140, length 144
00:00:00.000520 IP db01.ncube-lm > oid1.63602: Flags [.], ack 145, win 123, length 0
00:00:00.000951 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 1:145, ack 145, win 123, length 144
00:00:00.000008 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 145, win 149, length 0
00:00:00.018839 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 145:157, ack 145, win 149, length 12
00:00:00.000563 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 145:149, ack 157, win 123, length 4
00:00:00.000358 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 157:169, ack 149, win 149, length 12
00:00:00.000486 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 149:153, ack 169, win 123, length 4
00:00:00.000100 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 169:181, ack 153, win 149, length 12
00:00:00.000494 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 153:157, ack 181, win 123, length 4
...
00:00:00.000192 IP oid1.63586 > db01.ncube-lm: Flags [P.], seq 18181:18193, ack 6157, win 149, length 12
00:00:00.000447 IP db01.ncube-lm > oid1.63586: Flags [P.], seq 6157:6161, ack 18193, win 123, length 4
00:00:00.006696 IP oid1.63586 > db01.ncube-lm: Flags [F.], seq 18193, ack 6161, win 149, length 0
00:00:00.000995 IP db01.ncube-lm > oid1.63586: Flags [F.], seq 6161, ack 18194, win 123, length 0
00:00:00.000012 IP oid1.63586 > db01.ncube-lm: Flags [.], ack 6162, win 149, length 0

If you look at rows 4, 5 and 6 you see the typical TCP three-way handshake. What is nice to see, is that the actual response or roundtrip time for the packet from the server on line 5 actually took 1.1ms, which is what we have measured with hping! At lines 7-10 we see there is a packet send from the client to the server which is ACK’ed and a packet send from the server to the client which is ACK’ed. If you add the ‘-A’ flag to tcpdump you can get the values in the packet printed as characters, which shows the client telling the server how it wants to perform the test and the server responding with the requested settings. This is all a preparation for the test.

Starting from line 11, there is a strict repeating sequence of the client sending a packet of length 12, ACK’ing the previous received packet, and then the server responding with a packet of length 4 ACK’ing its previous received packet. This is the actual performance test! This means that the setting ‘-duration=1s -interval=1s’ does not mean it sends one packet, it actually means it’s continuously sending packets for the duration of 1 second. Also another flag is showing: the P or PSH (push) flag. This flag means the kernel/tcpip-stack understands all data to transmit is provided from ‘userland’, and now must be sent immediately, and instructs the receiving side to process it immediately in order to bring it to the receiving userland application as soon as possible too.

Lines 20-22 show how the connection is closed by sending a packet with a FIN flag, which is done for both the client to the server and the server to the client, and because it’s TCP, these need to be ACK’ed, which is why you see a trailing packet without a flag set, only ACK’ing the FIN packet.

The conclusion so far is that for real usable latency calculations you should not use a different protocol (so whilst ICMP (ping) does give an latency indication it should really only be used as an indicator), and that you should measure doing the actual work, not meta-transactions like the TCP three way handshake. Probably because of the PSH flag, the actual minimal latency for SQL-Net traffic is lower than ping and hping showed.

Wait a minute…did you notice the ‘actual minimal latency’? So far we only have been sending empty packets, which means we measured how fast a packet can travel from client to server and back. In reality, you probably want to send actual data back and forth, don’t you? That is something that we actually have not measured yet!

Let’s do actual Oracle transactions. For the sake of testing network latency, we can use Swingbench to execute SQL. This is how that is done:

[user@oid1 bin]$ cd ~/sw/swingbench/bin
[user@oid1 bin]$ ./charbench -c ../configs/stresstest.xml -u soe -p soe -uc 1 -rt 00:01
Author  :	 Dominic Giles
Version :	 2.5.0.971

Results will be written to results.xml.
Hit Return to Terminate Run...

Time		Users	TPM	TPS

8:22:56 AM      1       14450   775

Please mind I am using 1 user (-uc 1) and a testing time of 1 minute (-rt 00:01), which should be longer when you are doing real testing. As a reminder, I am using 1 session because I want to understand the latency, not the bandwidth! In order to understand if the network traffic looks the same as oratcptest.jar, I can use tcpdump once again. Here is a snippet of the traffic:

...
00:00:00.000106 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5839:5852, ack 5986, win 272, length 13
00:00:00.000491 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 5986:6001, ack 5852, win 330, length 15
00:00:00.000234 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5852:6003, ack 6001, win 272, length 151
00:00:00.000562 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6001:6077, ack 6003, win 330, length 76
00:00:00.000098 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6003:6016, ack 6077, win 272, length 13
00:00:00.000484 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6077:6092, ack 6016, win 330, length 15
00:00:00.000238 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6016:6159, ack 6092, win 272, length 143
00:00:00.000591 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6092:6425, ack 6159, win 330, length 333
...

The important bit is this shows the same single packet traffic client to server and back as we saw oratcptest generated, however now with varying packet size (which is logical, different SQL statements are sent to the database), the PSH bit is set, which also is the same as oratcptest generated.

Let’s assume this is a real-life workload. In order to measure and calculate differences in performance between different networks, we need the average packet length. This can be done with a tool called tcpstat (this link provides the EL6 version). In my case I have only one application using a database on this server, so I can just filter on port 1521 to measure my SQL-Net traffic:

[user@oid1 ~]$ sudo tcpstat -i eth0 -o "Packet/s=%p\tmin size: %m\tavg size: %a\tmax size: %M\tstddev: %d\n" -f 'port 1521'
Packet/s=2526.40	min size: 53	avg size: 227.76	max size: 1436	stddev: 289.21
Packet/s=2531.40	min size: 53	avg size: 229.79	max size: 1432	stddev: 291.22
Packet/s=2634.20	min size: 53	avg size: 229.59	max size: 1432	stddev: 293.38
Packet/s=2550.00	min size: 53	avg size: 234.11	max size: 1435	stddev: 296.77
Packet/s=2486.80	min size: 53	avg size: 232.24	max size: 1436	stddev: 293.16

In case you wondered why tcpstat reports a minimum length of 53 and tcpdump (a little up in the article) of 13; tcpstat reports full packet length including packet, protocol and frame headers, tcpdump in this case reports the payload length.

Now we can execute oratcptest.jar again, but with a payload size set that matches the average size that we measured, I have taken 250 as payload size:

[user@oid1 ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(09:39:47) The server is ready.
                        Throughput                 Latency
(09:39:48)          0.365 Mbytes/s                0.685 ms
(09:39:48) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.365 Mbytes/s
	             Avg. latency = 0.685 ms

As you can see, there is a real modest increase in average latency going from 0.670ms to 0.685ms.

In order to test the impact of network latency let’s move the oratcptest client to the server, to get the lowest possible latency. Actually, this is very easy, because the oratcptest.jar file contains both the client and the server, so all I need to do is logon to the server where I started the oratcptest.jar file in server mode, and run it in client mode:

[user@db01m ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(14:49:29) The server is ready.
                        Throughput                 Latency
(14:49:30)         12.221 Mbytes/s                0.020 ms
(14:49:30) Test finished.
	       Socket send buffer = 26010 bytes
	          Avg. throughput = 11.970 Mbytes/s
	             Avg. latency = 0.021 ms

Wow! The roundtrip latency dropped from 0.685ms to 0.021ms! Another test using oratcptest.jar using a true local network connection (with Linux being virtualised using Xen/OVM) shows a latency of 0.161ms.

These are the different network latency figures measured with oratcptest using a payload size that equals my average network payload size:
– Local only RTT: 0.021
– Local network RTT: 0.161
– Different networks RTT: 0.685

If I take swingbench and execute the ‘stresstest’ run local, on a machine directly connected via the local network and across different networks (think cloud), and now measure TPS (transactions per second), I get the following figures:
– Local only TPS: 2356
– Local network TPS: 1567
– Different networks TPS: 854

Do these figures make sense?
– Local only: Time not in network transit per second: 1000-(0.021*2356)=950.524; approximate average time spend on query: 950.523/2356=0.40ms
– Local network: 1000-(0.161*1567)=747.713/1567=0.48ms
– Different networks: 1000-(0.685*854)=415.010/854=0.49ms
It seems that this swingbench test spends roughly 0.40-0.50ms on processing, the difference in transactions per second seem to be mainly caused by the difference in network latency.

Recently I was asked to analyse the security impact of the snmp daemon on a recent Exadata. This system was running Exadata image version 12.1.2.1.3. This blog article gives you an overview of a lot of the things that surround snmp and security.

First of all what packages are installed doing something with snmp? A list can be obtained the following way:

# rpm -qa | grep snmp
net-snmp-utils-5.5-54.0.1.el6_7.1.x86_64
net-snmp-libs-5.5-54.0.1.el6_7.1.x86_64
net-snmp-5.5-54.0.1.el6_7.1.x86_64
sas_snmp-14.02-0103.x86_64

Essentially the usual net-snmp packages and a package called ‘sas_snmp’.

A next important thing is how the firewall is configured. However, the default setting of the firewall on the compute nodes with exadata is the firewall turned off:

# iptables -L -v
Chain INPUT (policy ACCEPT 437M packets, 216G bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 343M packets, 748G bytes)
 pkts bytes target     prot opt in     out     source               destination

So if there is something running that listens to a network port that can benefit ‘attackers’, there is no firewall to stop them.

Next obvious question is what snmp processes are actually running:

# ps -ef |grep snmp
root       7088      1  0 Aug16 ?        00:51:32 /usr/sbin/snmpd -LS0-6d -Lf /dev/null -p /var/run/snmpd.pid
root      33443      1  0 03:14 ?        00:00:49 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf
root      33454  33443  0 03:14 ?        00:00:00 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf

The snmpd process is the net-snmp snmp daemon. However, there are two additional processes running with the name ‘snmp’ in them, one is owned by init, and a processes that this process has spawned. The name ‘lsi_mrdsnmpagent’ probably means LSI MegaRaid SNMP agent. That gives a fair hint this processes is doing something snmp related specifically for the LSI MegaRaid adapter, which is the disk controller.

Are there any open ports related to snmp processes?

# netstat -anp | grep snmp
tcp        0      0 127.0.0.1:199               0.0.0.0:*                   LISTEN      7088/snmpd
udp        0      0 0.0.0.0:161                 0.0.0.0:*                               7088/snmpd
udp        0      0 0.0.0.0:22917               0.0.0.0:*                               7088/snmpd

1. tcp port 199
This is support for the SMUX protocol (RFC 1227) to communicate with SMUX-based subagents. This is a deprecated feature in favour of AgentX. It is considered a bug (https://bugzilla.redhat.com/show_bug.cgi?id=110931) the daemon is still using this port. However the port is opened on localhost (127.0.0.1) and as such not reachable from outside of the machine, which means it is not a direct security problem.

2. udp port 161
This is the default snmpd port. This port is open to the outside world on the compute node, which can be seen from the address 0.0.0.0 in the above ‘source’ column. The port being open can be verified using another machine and the ‘nmap’ tool:

$ sudo nmap -Pn -sU -p 161 311.1.1.1
Password:

Starting Nmap 6.47 ( http://nmap.org ) at 2016-10-26 15:00 CEST
Nmap scan report for 311.1.1.1
Host is up (0.087s latency).
PORT    STATE SERVICE
161/udp open  snmp

The status ‘open’ shows this udp port does respond to requests.

3. udp port 22917 (in this case; this port number is random)
This is a random port that gets set for trapsink directive set in the /etc/snmp/snmpd.conf. A trap sink is the destination for snmp traps that get triggered. Although this udp port is in use, it does not respond to network traffic:

$ sudo nmap -Pn -sU -p 22917 311.1.1.1
Password:

Starting Nmap 6.47 ( http://nmap.org ) at 2016-10-26 15:22 CEST
Nmap scan report for 311.1.1.1
Host is up.
PORT      STATE         SERVICE
22917/udp open|filtered unknown

The status ‘open|filtered’ does mean the udp port does not respond to requests.

Now let’s look how the actual configuration file of the snmp daemon looks like on exadata. The configuration file is /etc/snmp/snmpd.conf:

snmp daemon configuration file:
trapcommunity public
trapsink 127.0.0.1 public
rocommunity public 127.0.0.1
rwcommunity public 127.0.0.1

access  RWGroup         ""      any       noauth    exact all all all
com2sec snmpclient      127.0.0.1               public
group   RWGroup                 v1                              snmpclient

pass .1.3.6.1.4.1.4413.4.1 /usr/bin/ucd5820stat
pass .1.3.6.1.4.1.3582 /usr/sbin/lsi_mrdsnmpmain

syscontact Root <root@localhost> (configure /etc/snmp/snmp.local.conf)
syslocation Unknown (edit /etc/snmp/snmpd.conf)

view    all             included      .1                80

The snmpd.conf file shows:
– trapsink destination (127.0.0.1, localhost) and community string (public).
– the ro and rw communities are set to ‘public 127.0.0.1’
In general it is advised to change the community strings to something unique to avoid being easy guessable. However, in this case there’s also a network description following the ro and rw community, which is: 127.0.0.1. This means snmp access is restricted to localhost.
This can be verified by running snmpwalk from another machine:

$ snmpwalk -v 2c -c public 311.1.1.1
Timeout: No Response from 311.1.1.1

This means there is no way to communicate to the snmp daemon from outside of the machine. We can see from the snmp daemon configuration file that access to the snmp deamon is limited to localhost.

It seems the LSI megaraid snmp agent works together with snmpd:

root      33443      1  0 03:14 ?        00:01:01 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf
root      33454  33443  0 03:14 ?        00:00:00 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf

Obviously it reads the snmpd.conf (-c /etc/snmp/snmpd.conf as seen above), but it has got a configuration file of its own. This configuration file of ‘lsi_mrdsnmpagent’ process can be seen when looking at the files inside the sas_snmp rpm package (rpm -ql sas_snmp), but the main evidence it is using the file can be derived from looking at the open file descriptors of the lsi_mrdsnmpagent process:

# ls -ls /proc/$(pgrep -f lsi_mrdsnmpagent | head -1)/fd
total 0
0 lr-x------. 1 root root 64 Oct 26 03:14 0 -> /dev/null
0 lr-x------. 1 root root 64 Oct 26 03:14 1 -> /etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf
0 l-wx------. 1 root root 64 Oct 26 03:14 2 -> /var/log/cellos/cron_daily_cellos.stderr (deleted)
0 lrwx------. 1 root root 64 Oct 26 03:14 3 -> socket:[2923149143]
0 l-wx------. 1 root root 64 Oct 26 03:14 4 -> /var/log/cellos/cellos.log (deleted)
0 l-wx------. 1 root root 64 Oct 26 03:14 5 -> /var/log/cellos/cellos.trc (deleted)
0 lr-x------. 1 root root 64 Oct 26 03:14 6 -> /etc/snmp/snmpd.conf
0 lr-x------. 1 root root 64 Oct 26 03:14 7 -> /etc/redhat-release
0 lr-x------. 1 root root 64 Oct 26 03:14 8 -> /dev/megaraid_sas_ioctl_node
0 lr-x------. 1 root root 64 Oct 26 03:14 9 -> pipe:[2919419375]

Line 4 shows ‘/etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf’! Let’s look inside that configuration file:

# cat /etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf
#################################################
# Agent Service needs the IP addresses to sent trap
# The trap destination may be specified in this file or
# using snmpd.conf file. Following indicators can be set
# on "TrapDestInd" to instruct the agent to pick the IPs
# as the destination.
# 1 - IPs only from snmpd.conf
# 2 - IPs from this file only
# 3 - IPs from both the files
#################################################
TrapDestInd 3
#############Trap Destination IP##################
# Add port no after IP address with no space after
# colon to send the SNMP trap message to custom port.
# Community is to be mentioned after IP. If no community
# is mentioned, default SNMP community 'public' shall be
# used. 'trapcommunity' token is also used in snmpd.conf.
# Alternatively, you can also use trapsink command
# in snmpd.conf to send the SNMP trap message to
# custom port, else default SNMP trap port '162' shall
# be used.
127.0.0.1	public
# 145.147.201.88:1234	public
# 145.146.180.20:3061	testComm
127.0.0.1:8162 public

It is a configuration file that works alongside the snmpd.conf configuration. What is important to see, is ‘TrapDestInd’, which is set at ‘3’, which means that traps are send to trap destinations set in the snmpd.conf file AND set in the sas_TrapDestionation.conf file. Two traps are defined in the file, 127.0.0.1 with community string public, which means it sends a trap to udp port 161 (at which the snmpd process is listening, as we saw earlier in the open ports list), but the most interesting thing here is there’s also a trap send to 127.0.0.1 at port 8162. That is a port number I do not know from the top of my head!

However, it’s simple to find out. The first thing to check is to see what process is running at port 8162:

# netstat -anp | grep 8162
udp        0      0 :::8162                     :::*                                    15233/java

That’s a java process! Let’s grep the process number to see if the full command line gives more clues what this java process is:

# ps -ef | grep 15233
dbmsvc    15233  15136  0 Aug16 ?        05:32:25 /usr/java/jdk1.7.0_80/bin/java -client -Xms256m -Xmx512m -XX:CompileThreshold=8000 -XX:PermSize=128m -XX:MaxPermSize=256m -Dweblogic.Name=msServer -Djava.security.policy=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server/lib/weblogic.policy -XX:-UseLargePages -XX:ParallelGCThreads=8 -Dweblogic.ListenPort=7878 -Djava.security.egd=file:/dev/./urandom -Xverify:none -da -Dplatform.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3 -Dwls.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server -Dweblogic.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server -Dweblogic.management.discover=true -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/patch_wls1036/profiles/default/sysext_manifest_classpath weblogic.Server

That’s java running weblogic, with the name ‘msServer’. That is something that is part of the daemons that serve dbmcli (alike the daemons that service cellcli on the cells)!

This actually makes sense. The daemons that manage the database server fetch hardware status information and hardware failures from the BMC using the IPMI device (/dev/ipmi0). However the LSI MegaRaid adapter can not provide its status in that way. So in order for the management daemons to keep track of events on the LSI MegaRaid adapter (hardware issues), a daemon that works together with the snmp daemon is setup, which sends snmp traps if something occurs. The management daemon has setup a port that listens for these traps.

I do not know if the community strings is processed by the management deamon processing the trap. However, the port number on which the daemon is listening for traps is defined in ‘/opt/oracle/dbserver/dbms/deploy/config/cellinit.ora’ with the directive BMC_SNMP_PORT.

Conclusion
As far as I can see, the reason the snmp deamon is running is to be able to run the LSI MegaRaid SNMP agent process, so it can send traps to the compute node’s management daemons. Since most Exadata compute nodes do not have the firewall enabled, udp port 161 is exposed. The settings of the snmp daemon itself limits access to localhost.

%d bloggers like this: