Recently I have been presenting on what running on a large intel based NUMA system looks like (OTN EMEA tour in Düsseldorf and Milan, and I will be presenting about this at the Dutch AMIS 25th anniversary event in june). The investigation of this presentation is done on a SGI UV 300 machine with 24 terabyte of memory, 32 sockets (=NUMA nodes), 480 core’s and 960 threads.
Recently I have been given access to a new version of the UV 300, the UV 300 RL, for which the CPU has improved from Ivy Bridge to Haswell, and now has 18 core’s per socket instead of 15, which means the number of core’s on a fully equipped system is 576, which makes 1152 threads.
Now the get back to the actual purpose of this blogpost: SGA memory allocation on startup. One of the things you can do on such a system is allocate an extremely high number of memory. In my case I chose to allocate 10 terabyte for the Oracle SGA. When you go (excessively) outside of normal numbers, you run into things simply because there is a (very) low chance that that has been actually tested, and very few (or none) have actually done it before.
First let me show you the other technical details:
Operating system: Oracle Linux version 7.1
Oracle grid infrastructure version: 188.8.131.52.160419
Oracle database version: 184.108.40.206.160419
For the database instance the use_large_pages parameter is set to ONLY. It simply does not make sense not to use it in a normal database, in my opinion it really is mandatory if you work with large memory databases.
Firing up the instance…
At first I set sga_target to 10737418240000, alias 10 terabyte, and started up the instance. My sqlplus session got busy after pressing enter, however it remained busy. I got coffee downstairs, got back to my office, and it was still busy. Long story short: after 30 minutes it was still busy and I wondered if something was wrong.
What can you do? I first checked the alert.log of the instance. It told me:
Tue May 24 07:52:35 2016 Starting ORACLE instance (normal) (OS id: 535803) Tue May 24 07:52:35 2016 CLI notifier numLatches:131 maxDescs:3801
Not a lot of information, but also no indication anything was wrong.
Then I looked with the ‘top’ utility what was active at my system:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 535803 oracle 20 0 9.765t 22492 17396 R 99.4 0.0 11:10.56 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
I see my local bequeath session running at and near 100% CPU. It’s working, and that is the correct state when you fire up an Oracle database instance. But what is it doing?
Next logical thing: strace:
[oracle@skynet5 frits]$ strace -fp 535803 Process 535803 attached
Yes, this is meant to show that strace does not show any output. This means the process is not doing any system calls. This does NOT mean the process is doing nothing, we saw with ‘top’ it’s busy, running close to 100% on a CPU (thread).
When strace does not show anything, the next layer to dive into is the C function layer with perf. The first thing I tried, and normally try at first, is ‘perf top’. Perf top -p zooms in on one process:
[oracle@skynet5 frits]$ perf top -p 535803 PerfTop: 4003 irqs/sec kernel:99.9% exact: 0.0% [4000Hz cycles], (target_pid: 535803) ----------------------------------------------------------------------------------------------------------------------------------- 95.86% [kernel] [k] clear_page_c_e 1.58% [kernel] [k] clear_huge_page 0.41% [kernel] [k] __ticket_spin_lock 0.32% [kernel] [k] clear_page 0.29% [kernel] [k] _cond_resched 0.28% [kernel] [k] mutex_lock 0.11% [kernel] [k] apic_timer_interrupt 0.11% [kernel] [k] hugetlb_fault
Ah! This shows what the process is actually doing! We can see it’s executing a function called ‘clear_page_c_e’ in kernel mode in 96% of the samples perf took of the running process. The name of the function describes what it does quite well: it is a kernel function to clear a page.
So, this does describe that the process starting up the instance is actually spending all of its time clearing memory, but it does not show how this relates to what the Oracle database is actually doing. In order to dig further into what Oracle is doing, we can use perf in another way: record backtraces, and later report on that. What perf does, is sample full backtraces in a file called ‘perf.data’. When perf is later used to report using that file, it will show the current function that was sampled most in the perf.data file, and all the backtraces leading to that function (when perf record is called with -g “call-graph”).
[oracle@skynet5 frits]$ perf record -g -p 535803 ...after some time: ^C, then: [oracle@skynet5 frits]$ perf report # Overhead Command Shared Object Symbol # ........ ....... .................. .......................................... # 97.14% oracle [kernel.kallsyms] [k] clear_page_c_e | --- clear_page_c_e | |--100.00%-- hugetlb_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | slaac_int | slrac | sskgm_segment_notify_action | skgmcrone | skgm_allocate_areas | skgmcreate | ksmcrealm | ksmcsg | opistr_real | opistr | opiodr | ttcpip | opitsk | opiino | opiodr | opidrv | sou2o | opimai_real | ssthrdmain | main | __libc_start_main --0.00%-- [...]
What this shows, is again the kernel function ‘clear_page_c_e’ (it also shows [k] to indicate it’s a kernel function), and when we follow the backtrace down, we see
* hugetlb_fault: this is a page fault handling function, which shows that I am using huge pages.
* handle_mm_fault, __do_page_fault, do_page_fault, page_fault: these are linux kernel functions indicating it’s handling a page fault.
* slaac_int, slrac, sskgm_segment_notify_action: these are Oracle database functions. Obviously, slaac_int is the function actually ‘touching’ memory, because it results in a page_fault.
So what is a page fault? A page fault is an interrupt that is raised when a program tries to access memory that has already been allocated into the process’ virtual address space, but not actually loaded into main memory. The interrupt causes the kernel to handle the page fault, and make the page available. Essentially Linux, like most other operating systems, makes pages available once they are actually used instead of when they are ‘just’ allocated. (this is a summary, there is a lot to tell about page faults alone, but that’s beyond the scope of this article)
It’s important to realise there is no system call visible that either allocates memory (think about mmap(), malloc(), etc.). If you want to know more about these calls, please read up on them using: ‘man SYSCALLNAME’ (on linux obviously). So, the above backtrace looks like code that deliberately touches the Oracle SGA memory in order to get it truly allocated!
Actually, in my case, with NUMA enabled, and a large amount of memory to be alloced, the above process of getting 10 terabyte ‘touched’ or ‘page faulted’ can be watched in progress in the proc ‘numa_maps’ file of the ‘bequeathing’ process (the process that is starting up the instance). The linux ‘watch’ utility is handy for this:
[oracle@skynet5 frits]$ watch -n 1 -d cat /proc/535803/numa_maps 00400000 prefer:1 file=/u01/app/oracle/product/220.127.116.11/dbhome_1/bin/oracle mapped=3124 N21=3124 10e74000 prefer:1 file=/u01/app/oracle/product/18.104.22.168/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26 10e96000 prefer:1 file=/u01/app/oracle/product/22.214.171.124/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62 110f1000 prefer:1 anon=28 dirty=28 N1=28 129b4000 prefer:1 heap anon=162 dirty=162 N1=80 N2=82 60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge 80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=360268 N0=11259 N1=11259 N2=11259 N3=11259 N4=11259 N5=11259 N6=11259 N7=11259 N8 =11259 N9=11259 N10=11259 N11=11259 N12=11258 N13=11258 N14=11258 N15=11258 N16=11258 N17=11258 N18=11258 N19=11258 N20=11258 N21=11258 N22=11258 N23 =11258 N24=11258 N25=11258 N26=11258 N27=11258 N28=11258 N29=11258 N30=11258 N31=11258 fca0000000 prefer:1 file=/SYSV00000000\040(deleted) huge 7fcda7c4f000 prefer:1 anon=64 dirty=64 N2=64 7fcda7ccf000 prefer:1 anon=534 dirty=534 N1=73 N2=461 ...
With ‘watch’, ‘-n 1’ means a 1 second interval, ‘-d’ means highlight differences.
When the instance is starting (with NUMA enabled, and with a huge amount of memory allocated for SGA), first you will see it’s busy in the line indicated with ‘80000000’ (which is a memory address). In my case it says ‘interleave:0-31’, which means the memory allocation is done over all the indicated NUMA nodes. Actually Nnr=nr means a memory allocation of nr of pages on NUMA node Nnr. In my case I see the number growing per NUMA node jumping from one NUMA node to another for some time. This is the shared pool allocation, that is spread out over all NUMA nodes.
However, after some time, you’ll see this:
00400000 prefer:0 file=/u01/app/oracle/product/126.96.36.199/dbhome_1/bin/oracle mapped=3124 N21=3124 10e74000 prefer:0 file=/u01/app/oracle/product/188.8.131.52/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26 10e96000 prefer:0 file=/u01/app/oracle/product/184.108.40.206/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62 110f1000 prefer:0 anon=28 dirty=28 N1=28 129b4000 prefer:0 heap anon=162 dirty=162 N1=80 N2=82 60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge 80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=516352 N0=16136 N1=16136 N2=16136 N3=16136 N4=16136 N5=16136 N6=16136 N7=16136 N8 =16136 N9=16136 N10=16136 N11=16136 N12=16136 N13=16136 N14=16136 N15=16136 N16=16136 N17=16136 N18=16136 N19=16136 N20=16136 N21=16136 N22=16136 N23 =16136 N24=16136 N25=16136 N26=16136 N27=16136 N28=16136 N29=16136 N30=16136 N31=16136 fca0000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=144384 N25=144384 14320000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=143872 N26=143872 18960000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=12232 N27=12232 1cfa0000000 prefer:0 file=/SYSV00000000\040(deleted) huge 7fcda7c4f000 prefer:0 anon=64 dirty=64 N2=64 7fcda7ccf000 prefer:0 anon=534 dirty=534 N1=73 N2=461
Once the process is done allocating shared pool, it starts allocating buffer cache memory. It does this per NUMA node. You can see it here at the lines fca0000000, 14320000000, 18960000000. Actually, I copied these figures when the instance was still allocating memory at memory address 18960000000. You can see it does specific allocation of memory local to a specific NUMA node by ‘prefer:nr’.
At the end, when all memory was allocated, the sqlplus session showed the sizes of the different memory pools and opened the database. This took approximately 90 minutes. At first, I thought this had to do with the parameter ‘PRE_PAGE_SGA’. However, this parameter was set at its default value, FALSE.
Some investigating and some testing revealed this behaviour (touching the entire SGA) was governed by the hidden parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION, which defaults to TRUE in 220.127.116.11 on Linux.
Next I tested setting it to FALSE, and my startup time reduced to 3.5 minutes (!).
I used perf record again to look where the time is spend when _TOUCH_SGA_PAGES_DURING_ALLOCATION is set to FALSE:
# Overhead Command Shared Object Symbol # ........ ............... .................... ........................................... # 41.25% oracle_563318_s [kernel.kallsyms] [k] clear_page_c_e | --- clear_page_c_e | |--99.99%-- hugetlb_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | | | |--93.26%-- __intel_new_memset | | | | | |--79.48%-- ksmnsin | | | | | | | |--95.98%-- kcbnfy | | | | kscnfy | | | | ksmcsg | | | | opistr_real | | | | opistr | | | | opiodr | | | | ttcpip | | | | opitsk | | | | opiino | | | | opiodr | | | | opidrv | | | | sou2o | | | | opimai_real | | | | ssthrdmain | | | | main | | | | __libc_start_main | | | | | | | |--3.32%-- kcbw_setup
This is a new backtrace, which looks like the former backtrace on the top (between clear_page_c_e and page_fault), simply because it’s doing exactly the same, handling a page fault. However, the Oracle functions are different. What is happening here is the Oracle performs a system call, memset(), which is used to wipe a memory area. If we go further down the stack trace, we see (full speculation!):
* ksmnsin: kernel system memory numa segment initialisation?
* kcbnfy/kscnfy: kernel cache buffers numa FY?/kernel system cache numa FY?
* ksmcsg: kernel system memory create sga?
When looking at the memory allocations during startup of the bequeathing process with _TOUCH_SGA_PAGES_DURING_ALLOCATION set to FALSE in numa_maps, way lesser memory pages are touched. numa_maps only shows truly allocated pages, instead of the logical memory allocations. This means the same memory allocations done per numa node (passing over the numa nodes several times) done for the shared pool, and then the buffer cache allocations per numa node, however now also passing over the different numa nodes several times too. Because of the much smaller amount of pages touched, the startup time of the instance is greatly reduced.
So, is this a solution to the long startup time?
On one hand yes, on the other hand no. What that means is: you need to page the memory, no matter what you do. If you choose not to page the memory during startup (_TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE), you need to do it later. Sounds logical, right: pay now, or pay later? But what does that mean?
After startup of the instance with _TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE, a lot of the memory if not really allocated. If you start to use the new prestine SGA with a normal session, you pay the price of paging which the bequeathing session otherwise took. Let me show you a ‘perf record -g’ example of a full table scan (which is reading the Oracle blocks into the buffercache, not doing a direct path read):
# Overhead Command Shared Object Symbol # ........ ............... .................. .................................... # 61.25% oracle_564482_s [kernel.kallsyms] [k] clear_page_c_e | --- clear_page_c_e | |--99.99%-- hugetlb_fault | | | |--99.86%-- handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | __intel_ssse3_rep_memcpy | | kcbzibmlt | | kcbzib | | kcbgtcr | | ktrget2 | | kdst_fetch0 | | kdst_fetch | | kdstf00000010000100kmP | | kdsttgr | | qertbFetch | | qergsFetch
If we follow the call stack from bottom to top:
* qergsFetch, qertbFetch: query execute row source code.
* kdsttgr: kernel data scan table get row.
* kdstf00000010000100kmP: kernel data scan table full, the ultra fast full table scan.
* kdst: other functions in kernel data scan table functions.
* ktrget2: kernel transaction layer.
* kcbgtcr: kernel cache buffers get consistent row.
* kcbz: kernel cache buffers Z, physical IO helper functions.
* __intel_ssse3_rep_memcpy: this is actually the function memcpy, but replaced by an optimised version for intel CPUs.
* page_fault, do_page_fault, __do_page_fault, handle_mm_fault, hugetlb_fault, clear_page_c_e: these are the same page fault kernel functions we saw earlier.
This clearly shows my full table scan now needs to do the paging!
If I set _TOUCH_SGA_PAGES_DURING_ALLOCATION to TRUE, startup the instance (for which the pages are touched and thus pages), and profile a full table scan, I see:
# Overhead Command Shared Object Symbol # ........ ............... .................... ........................................ # 24.88% oracle_577611_s oracle [.] ksl_get_shared_latch | --- ksl_get_shared_latch | |--78.46%-- kcbzfb | kcbzgb | | | |--99.63%-- kcbzgm | | kcbzibmlt | | kcbzib | | kcbgtcr | | ktrget2 | | kdst_fetch0 | | kdst_fetch | | kdstf00000010000100kmP | | kdsttgr | | qertbFetch | | qergsFetch
This shows the most prominent function which is called is ‘ksl_get_shared_latch’. No paging to be seen.
At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the SGA pages being touched at startup to prevent paging for foreground sessions.
BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at startup, which also is not true. It also claims ‘every process that starts must access every page in the SGA’, again this is not true.
From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is ‘sa00’. When recording the backtraces of that process, I see:
# Overhead Command Shared Object Symbol # ........ ............. ................. .......................................... # 97.57% ora_sa00_slob [kernel.kallsyms] [k] clear_page_c_e | --- clear_page_c_e | |--100.00%-- hugetlb_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | ksmprepage_memory | ksm_prepage_sga_seg | skgmapply | ksmprepage | ksm_sslv_exec_cbk | ksvrdp | opirip | opidrv | sou2o | opimai_real | ssthrdmain | main | __libc_start_main --0.00%-- [...]
The kernel paging functions are exactly the same as we have seen several times now. It’s clear the functions executed by this process are specifically for the prepage functionality. The pre-paging as done on behalf of _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle function names). PRE_PAGE_SGA seems to be a ‘workaround’ if you don’t want to spend the long time paging on startup, but still want to page the memory as soon as possible after startup. Needless to say, this is not the same as _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before the sa00 process pages it, still need to do the paging.
If you want to allocate a large SGA with Oracle 18.104.22.168 (but may apply to earlier versions too), the startup time could be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user) sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged memory, which means they use it before the sa00 process could page it, need to page it theirselves.
Thanks to: Klaas-Jan Jongsma for proofreading.