When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:
WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957
The current documentation has no description for it. Let’s see what V$EVENT_NAME says:
SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 2 from v$event_name where name = 'PGA memory operation'; EVENT# NAME PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS ------ ------------------------------------- ---------- ---------- ---------- --------------- 524 PGA memory operation Other
Well, that doesn’t help…
Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).
TS@fv122b2 > alter session set events 'sql_trace level 8'; Session altered.
Now use strace to look at the system calls in another session:
# strace -e write=all -e all -p 9426 Process 9426 attached read(9,
Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:
io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000 mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000 lseek(7, 0, SEEK_CUR) = 164639 write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112 | 00000 57 41 49 54 20 23 30 78 37 66 35 62 61 39 35 39 WAIT #0x7f5ba959 | | 00010 36 33 31 30 3a 20 6e 61 6d 3d 27 50 47 41 20 6d 6310: nam='PGA m | | 00020 65 6d 6f 72 79 20 6f 70 65 72 61 74 69 6f 6e 27 emory operation' | | 00030 20 65 6c 61 3d 20 37 38 30 20 70 31 3d 32 30 39 ela= 780 p1=209 | | 00040 37 31 35 32 20 70 32 3d 31 31 31 34 31 31 32 20 7152 p2=1114112 | | 00050 70 33 3d 30 20 6f 62 6a 23 3d 32 32 38 33 33 20 p3=0 obj#=22833 | | 00060 74 69 6d 3d 31 39 35 31 37 30 32 30 35 36 36 0a tim=19517020566. | ... munmap(0x7f5ba8fbd000, 2097152) = 0 munmap(0x7f5ba91bd000, 2097152) = 0 mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000 lseek(7, 0, SEEK_CUR) = 183409 write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100 | 00000 57 41 49 54 20 23 30 78 37 66 35 62 61 39 35 39 WAIT #0x7f5ba959 | | 00010 36 33 31 30 3a 20 6e 61 6d 3d 27 50 47 41 20 6d 6310: nam='PGA m | | 00020 65 6d 6f 72 79 20 6f 70 65 72 61 74 69 6f 6e 27 emory operation' | | 00030 20 65 6c 61 3d 20 35 39 32 20 70 31 3d 30 20 70 ela= 592 p1=0 p | | 00040 32 3d 30 20 70 33 3d 30 20 6f 62 6a 23 3d 32 32 2=0 p3=0 obj#=22 | | 00050 38 33 33 20 74 69 6d 3d 31 39 35 32 30 36 33 33 833 tim=19520633 | | 00060 36 37 34 0a 674. |
Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:
global wait_event_nr=524 probe begin { printf("begin.\n") } probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthbwt") { if ( pid() == target() && register("rdx") == wait_event_nr ) printf("kskthbwt - %d\n", register("rdx")) } probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") { if ( pid() == target() && register("rsi") == wait_event_nr ) printf("kskthewt - %d\n", register("rsi")) } probe syscall.mmap2 { if ( pid() == target() ) printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6)) } probe syscall.mmap2.return { if ( pid() == target() ) printf("return value: %x\n", $return) } probe syscall.munmap { if ( pid() == target() ) printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2)) }
Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.
I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:
# stap -x 9426 mmap.stp begin. kskthbwt - 524 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000 kskthewt - 524 kskthbwt - 524 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000 kskthewt - 524 kskthbwt - 524 munmap, addr 7f5ba8fbd000, size 2097152 munmap, addr 7f5ba91bd000, size 2097152 kskthewt - 524
This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.
Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?
To answer that, we need to look at the flags of the two calls. Here is an example:
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000 mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.
The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.
The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.
Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:
probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") { if ( pid() == target() && register("rsi") == wait_event_nr ) { ksuse = register("r13")-4672 ksuseopc = user_uint16(ksuse + 2098) ksusep1 = user_uint64(ksuse + 2104) ksusep2 = user_uint64(ksuse + 2112) ksusep3 = user_uint64(ksuse + 2120) ksusetim = user_uint32(ksuse + 2128) printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3) } }
When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:
kskthbwt - 524 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000 kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0 kskthbwt - 524 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000 kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0
This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?
Let’s look what happens when I execute the same SQL again:
kskthbwt - 524 munmap, addr 7f1562130000, size 2097152 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000 kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0 kskthbwt - 524 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000 kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0 kskthbwt - 524 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000 kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0
Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.
The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.
The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.
Conclusion:
With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.