Archive

Oracle EE

In a previous article called ‘memory allocation on startup’ I touched on the subject of NUMA; Non Uniform Memory Access. This article is about how to configure NUMA, how to look into NUMA usage and a real life case of NUMA optimisation using in-memory parallel execution.

At this point in time (start of the summer of 2016) we see that the CPU speed competition has stagnated and settled at somewhere below maximally 4 gigahertz, and instead the number of core’s and the size of memory is growing. The common used server type in the market I am in is a two socket server. It is not unthinkable that in the near future servers with more than two sockets will increase in popularity, which facilitates the increase in (parallel) computing capacity and maximal amount of memory.

On the other hand of course there are already 2+ socket count servers, of which some have been multi socket for quite a while: Oracle/Sun M series, IBM P series, and a large diversity of intel based 2+ socket servers, like the Oracle/Sun “-4” and “-8” series. The investigations in this blogpost are done done on a SGI UV 300RL, which is a system with (up to) 32 sockets, 576 cores, and 1152 CPU threads.

There already have been a lot of blogposts about Linux and NUMA:
Kevin Closson
Martin Bach: linux-large-pages-and-non-uniform-memory-distribution, _enable_numa_support and numactl
Yves Colin
Bertrand Drouvot
I am sure there will be more, these are the ones that I read in the past.

This article will take you through the configuration of NUMA for the Oracle database and how to measure specifics from the Linux side. It will touch some of the things mentioned in the other blogposts, in order to provide a consistent story.
The database version is: 12.1.0.2.160419
The operating system is Oracle Linux 7.1
Kernel: 3.8.13-98.4.1.el7uek.x86_64

1. NUMA or not.
In order to sensibly use NUMA optimisation with the Oracle database, you first need to have hardware that is actually NUMA capable. This means multiple sockets in use that have local memory (please mind this includes common two socket server hardware). The next thing needed to use NUMA is NUMA support enabled by the Linux kernel. With the current version of the oracle-preinstall rpm, NUMA is disabled on the kernel load line in grub (numa=off). When NUMA is disabled, it will show one NUMA node (=socket) with numactl.

First make sure you are on NUMA enabled hardware by issuing ‘numactl –hardware’ and see multiple nodes specified.

2. Default Oracle.
When you don’t modify (undocumented) NUMA parameters, Oracle will NOT run in NUMA aware mode. In other words, you explicitly need to enable NUMA in the database. This is also visible if you look in the numa_maps, which shows process memory and NUMA information:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3339 mapmax=904 N0=112 N1=98 N2=98 N3=98 N4=98 N5=98 N6=98 N7=97 N8=96 N9=96 N10=96 N11=101 N12=108 N13=106 N14=105 N15=105 N16=107 N17=107 N18=106 N19=109 N20=106 N21=106 N22=107 N23=107 N24=106 N25=107 N26=108 N27=108 N28=109 N29=112 N30=111 N31=113
9c460000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:1 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N0=1

Few things to note: I took the pmon process, and my instance name is slob (pmon_slob). The SGA only is allocated from huge pages, so I grepped for ‘huge’. numa_maps only shows the process’ paged memory, my SGA is way bigger. The important thing to see here is the main SGA memory allocation at 80000000 reads ‘interleave:0-31’, which means the memory is spread over all the NUMA nodes. This can also be seen on the same line, because the memory allocation per node is shown with Nnn=nr.pages.

3. Set parameters
In Oracle 12.1.0.2 you need to set two parameters to fully enable NUMA support in the database:

alter system set "_enable_NUMA_support"=true scope=spfile;
alter system set "_px_numa_support_enabled"=true scope=spfile;

These are undocumented parameters. However, these parameters are explicitly mentioned in MOS article ‘What Is Correct Setting for _px_numa_support_enabled for 12c Databases That Wish to Use NUMA?’ (Doc ID 1956463.1).

Once you have set these parameters, Oracle will start up in NUMA aware mode. One of the things that is different in NUMA aware mode, is how the buffercache is created. This can be seen by looking at the numa_maps file of pmon again:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3587 mapmax=904 N0=112 N1=112 N2=112 N3=112 N4=112 N5=112 N6=112 N7=112 N8=112 N9=112 N10=112 N11=112 N12=113 N13=112 N14=113 N15=112 N16=112 N17=112 N18=113 N19=112 N20=112 N21=112 N22=112 N23=112 N24=112 N25=112 N26=112 N27=112 N28=112 N29=112 N30=112 N31=112
8860000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=1536 N25=1536
d260000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=512 N26=512
11c40000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=512 N27=512
16620000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=512 N28=512
1b000000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=512 N29=512
1f9e0000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=256 N30=256
243c0000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=512 N31=512
28da0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=256 N0=256
2d780000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=512 N1=512
32160000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=256 N2=256
36b40000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=512 N3=512
3b520000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=256 N4=256
3ff00000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=512 N5=512
448e0000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=256 N6=256
492c0000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=512 N7=512
4dca0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=256 N8=256
52680000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=512 N9=512
57060000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=256 N10=256
5ba40000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=512 N11=512
60420000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=256 N12=256
64e00000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=512 N13=512
697e0000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=256 N14=256
6e1c0000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=512 N15=512
72ba0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=256 N16=256
77580000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=512 N17=512
7bf60000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=256 N18=256
80940000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=512 N19=512
85320000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=256 N20=256
89d00000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=512 N21=512
8e6e0000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=256 N22=256
930c0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=512 N23=512
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=512 N24=512
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:2 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N5=1

If we look at the output, a few segments are still allocated interleaved (these are the fixed SGA and the shared pool), and we see allocations for every NUMA node. These per NUMA node allocations are the buffer cache.

Okay. Now the database is setup for NUMA, and we checked how this looks like on the operating system level using Linux tools. However, we need to realise that NUMA is no magic that sets your database into the next gear. Most things you do are not NUMA optimised by the database, simply because it wouldn’t make sense.

For example, when I perform a buffered full table scan without parallel query, it will spread out the blocks over the allocated segments. Why? Because my session is running on a random CPU thread which could run on another CPU thread the next time it is scheduled. What I am saying here is: only specific things are NUMA optimised.

A next logical question would be: but then what is actually NUMA optimised? One obvious thing that is NUMA optimised is the database writer (this is mentioned in the alert.log at startup). Another NUMA optimised thing is parallel query. If you look at the parallel query worker processes, the status file gives a hint:

# grep allowed_list /proc/$(pgrep p000_slob)/status
Cpus_allowed_list:	450-467,1026-1043
Mems_allowed_list:	0-31

This shows that the first parallel query worker process is limited in what CPU (threads) it is allowed to use. All parallel query worker processes are limited to ranges of CPU threads. Please mind it is NOT limited in what memory it can use, so it can use both local and remote memory. It may look weird that it still can access remote memory, totally ignoring NUMA optimised (=local) memory access.

First the CPU thread ranges. If you look what the range mean (it’s rather obvious), you see this:

# lscpu | grep 450
NUMA node25 CPU(s):    450-467,1026-1043

The CPU thread range is exactly a NUMA node. So whatever happens, parallel query worker process p000 can only run on the CPU threads of NUMA node 25!

Then absence of limits in memory: this means the parallel query process can still access anything it needs. This means that anything that used to work with parallel query can still run as it did without or before the NUMA optimisations. In specific cases, parallel query can take advantage of the NUMA setting.

One of such cases is in-memory parallel execution. With in-memory parallel execution, the parallel query slaves read blocks in the buffer cache instead of performing a direct path read. In-memory parallel execution is enabled when the parameter parallel_degree_policy is set to auto, and heuristically determines feasibility.

I setup a table of approximately 700GB, and performed an in-memory parallel query scan on a freshly started instance, in order to have the parallel query worker processes have as little memory touched as possible. At the end of the scan, this is how numa_maps of parallel query worker process p000 looks like:

# grep huge /proc/$(pgrep p000_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=906 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=4608 N0=144 N1=144 N2=144 N3=144 N4=144 N5=144 N6=144 N7=144 N8=144 N9=144 N10=144 N11=144 N12=144 N13=144 N14=144 N15=144 N16=144 N17=144 N18=144 N19=144 N20=144 N21=144 N22=144 N23=144 N24=144 N25=144 N26=144 N27=144 N28=144 N29=144 N30=144 N31=144
8880000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=50176 N25=50176
d2a0000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N26=6145
11c80000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=4608 N27=4608
16660000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=70 N28=5121
1b040000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=14336 N29=14336
1fa20000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=72 N30=5121
24400000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=4608 N31=4608
28de0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N0=6145
2d7c0000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=4099 N1=4099
321a0000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=2528 mapmax=73 N2=2528
36b80000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=2222 N3=2222
3b560000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=2504 mapmax=69 N4=2504
3ff40000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=2249 N5=2249
44920000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=4754 mapmax=69 N6=4754
49300000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=2296 N7=2296
4dce0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=2463 mapmax=70 N8=2463
526c0000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=2246 N9=2246
570a0000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=2440 mapmax=72 N10=2440
5ba80000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=2233 N11=2233
60460000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=2469 mapmax=69 N12=2469
64e40000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=2241 N13=2241
69820000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=2434 mapmax=69 N14=2434
6e200000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=2342 N15=2342
72be0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=2461 mapmax=70 N16=2461
775c0000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=2329 N17=2329
7bfa0000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=2499 mapmax=71 N18=2499
80980000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=2334 N19=2334
85360000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=2466 mapmax=73 N20=2466
89d40000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=2328 N21=2328
8e720000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=2180 mapmax=69 N22=2180
930e0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=2208 mapmax=69 N23=2208
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=2167 mapmax=68 N24=2167
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=147 N26=2
9c480000000 prefer:0 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=906 N2=1

If you look at memory address 8880000000, you see parallel query worker process p000 used 50176 pages, while on the other (buffer cache/’prefer’) segments it touched somewhere between 2200-6150 pages. So it clearly used it’s local memory more than the remote memories. This is proof in-memory parallel query is NUMA optimised.

In fact, when I time the query using in-memory parallel query with NUMA enabled, it takes a little less than 8 seconds consistently once the data is loaded in the buffercache. When I disable NUMA optimisations (_enable_NUMA_support=false), and execute the same query, also doing in-memory parallel query, the time goes up to a little less than 24 seconds, only scanning the buffercache.

Recently I have been presenting on what running on a large intel based NUMA system looks like (OTN EMEA tour in Düsseldorf and Milan, and I will be presenting about this at the Dutch AMIS 25th anniversary event in june). The investigation of this presentation is done on a SGI UV 300 machine with 24 terabyte of memory, 32 sockets (=NUMA nodes), 480 core’s and 960 threads.

Recently I have been given access to a new version of the UV 300, the UV 300 RL, for which the CPU has improved from Ivy Bridge to Haswell, and now has 18 core’s per socket instead of 15, which means the number of core’s on a fully equipped system is 576, which makes 1152 threads.

Now the get back to the actual purpose of this blogpost: SGA memory allocation on startup. One of the things you can do on such a system is allocate an extremely high number of memory. In my case I chose to allocate 10 terabyte for the Oracle SGA. When you go (excessively) outside of normal numbers, you run into things simply because there is a (very) low chance that that has been actually tested, and very few (or none) have actually done it before.

First let me show you the other technical details:
Operating system: Oracle Linux version 7.1
Kernel: kernel-uek-3.8.13-98.4.1.el7uek.x86_64
Oracle grid infrastructure version: 12.1.0.2.160419
Oracle database version: 12.1.0.2.160419
For the database instance the use_large_pages parameter is set to ONLY. It simply does not make sense not to use it in a normal database, in my opinion it really is mandatory if you work with large memory databases.

Firing up the instance…
At first I set sga_target to 10737418240000, alias 10 terabyte, and started up the instance. My sqlplus session got busy after pressing enter, however it remained busy. I got coffee downstairs, got back to my office, and it was still busy. Long story short: after 30 minutes it was still busy and I wondered if something was wrong.

What can you do? I first checked the alert.log of the instance. It told me:

Tue May 24 07:52:35 2016
Starting ORACLE instance (normal) (OS id: 535803)
Tue May 24 07:52:35 2016
CLI notifier numLatches:131 maxDescs:3801

Not a lot of information, but also no indication anything was wrong.

Then I looked with the ‘top’ utility what was active at my system:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 535803 oracle    20   0  9.765t  22492  17396 R  99.4  0.0  11:10.56 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

I see my local bequeath session running at and near 100% CPU. It’s working, and that is the correct state when you fire up an Oracle database instance. But what is it doing?

Next logical thing: strace:

[oracle@skynet5 frits]$ strace -fp 535803
Process 535803 attached

Yes, this is meant to show that strace does not show any output. This means the process is not doing any system calls. This does NOT mean the process is doing nothing, we saw with ‘top’ it’s busy, running close to 100% on a CPU (thread).

When strace does not show anything, the next layer to dive into is the C function layer with perf. The first thing I tried, and normally try at first, is ‘perf top’. Perf top -p zooms in on one process:

[oracle@skynet5 frits]$ perf top -p 535803
   PerfTop:    4003 irqs/sec  kernel:99.9%  exact:  0.0% [4000Hz cycles],  (target_pid: 535803)
-----------------------------------------------------------------------------------------------------------------------------------

    95.86%  [kernel]   [k] clear_page_c_e
     1.58%  [kernel]   [k] clear_huge_page
     0.41%  [kernel]   [k] __ticket_spin_lock
     0.32%  [kernel]   [k] clear_page
     0.29%  [kernel]   [k] _cond_resched
     0.28%  [kernel]   [k] mutex_lock
     0.11%  [kernel]   [k] apic_timer_interrupt
     0.11%  [kernel]   [k] hugetlb_fault

Ah! This shows what the process is actually doing! We can see it’s executing a function called ‘clear_page_c_e’ in kernel mode in 96% of the samples perf took of the running process. The name of the function describes what it does quite well: it is a kernel function to clear a page.

So, this does describe that the process starting up the instance is actually spending all of its time clearing memory, but it does not show how this relates to what the Oracle database is actually doing. In order to dig further into what Oracle is doing, we can use perf in another way: record backtraces, and later report on that. What perf does, is sample full backtraces in a file called ‘perf.data’. When perf is later used to report using that file, it will show the current function that was sampled most in the perf.data file, and all the backtraces leading to that function (when perf record is called with -g “call-graph”).

[oracle@skynet5 frits]$ perf record -g -p 535803
...after some time: ^C, then:
[oracle@skynet5 frits]$ perf report
# Overhead  Command       Shared Object                                      Symbol
# ........  .......  ..................  ..........................................
#
    97.14%   oracle  [kernel.kallsyms]   [k] clear_page_c_e
             |
             --- clear_page_c_e
                |
                |--100.00%-- hugetlb_fault
                |          handle_mm_fault
                |          __do_page_fault
                |          do_page_fault
                |          page_fault
                |          slaac_int
                |          slrac
                |          sskgm_segment_notify_action
                |          skgmcrone
                |          skgm_allocate_areas
                |          skgmcreate
                |          ksmcrealm
                |          ksmcsg
                |          opistr_real
                |          opistr
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                 --0.00%-- [...]

What this shows, is again the kernel function ‘clear_page_c_e’ (it also shows [k] to indicate it’s a kernel function), and when we follow the backtrace down, we see
* hugetlb_fault: this is a page fault handling function, which shows that I am using huge pages.
* handle_mm_fault, __do_page_fault, do_page_fault, page_fault: these are linux kernel functions indicating it’s handling a page fault.
* slaac_int, slrac, sskgm_segment_notify_action: these are Oracle database functions. Obviously, slaac_int is the function actually ‘touching’ memory, because it results in a page_fault.

So what is a page fault? A page fault is an interrupt that is raised when a program tries to access memory that has already been allocated into the process’ virtual address space, but not actually loaded into main memory. The interrupt causes the kernel to handle the page fault, and make the page available. Essentially Linux, like most other operating systems, makes pages available once they are actually used instead of when they are ‘just’ allocated. (this is a summary, there is a lot to tell about page faults alone, but that’s beyond the scope of this article)

It’s important to realise there is no system call visible that either allocates memory (think about mmap(), malloc(), etc.). If you want to know more about these calls, please read up on them using: ‘man SYSCALLNAME’ (on linux obviously). So, the above backtrace looks like code that deliberately touches the Oracle SGA memory in order to get it truly allocated!

Actually, in my case, with NUMA enabled, and a large amount of memory to be alloced, the above process of getting 10 terabyte ‘touched’ or ‘page faulted’ can be watched in progress in the proc ‘numa_maps’ file of the ‘bequeathing’ process (the process that is starting up the instance). The linux ‘watch’ utility is handy for this:

[oracle@skynet5 frits]$ watch -n 1 -d cat /proc/535803/numa_maps
00400000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:1 anon=28 dirty=28 N1=28
129b4000 prefer:1 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=360268 N0=11259 N1=11259 N2=11259 N3=11259 N4=11259 N5=11259 N6=11259 N7=11259 N8
=11259 N9=11259 N10=11259 N11=11259 N12=11258 N13=11258 N14=11258 N15=11258 N16=11258 N17=11258 N18=11258 N19=11258 N20=11258 N21=11258 N22=11258 N23
=11258 N24=11258 N25=11258 N26=11258 N27=11258 N28=11258 N29=11258 N30=11258 N31=11258
fca0000000 prefer:1 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:1 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:1 anon=534 dirty=534 N1=73 N2=461
...

With ‘watch’, ‘-n 1’ means a 1 second interval, ‘-d’ means highlight differences.
When the instance is starting (with NUMA enabled, and with a huge amount of memory allocated for SGA), first you will see it’s busy in the line indicated with ‘80000000’ (which is a memory address). In my case it says ‘interleave:0-31’, which means the memory allocation is done over all the indicated NUMA nodes. Actually Nnr=nr means a memory allocation of nr of pages on NUMA node Nnr. In my case I see the number growing per NUMA node jumping from one NUMA node to another for some time. This is the shared pool allocation, that is spread out over all NUMA nodes.

However, after some time, you’ll see this:

00400000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:0 anon=28 dirty=28 N1=28
129b4000 prefer:0 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=516352 N0=16136 N1=16136 N2=16136 N3=16136 N4=16136 N5=16136 N6=16136 N7=16136 N8
=16136 N9=16136 N10=16136 N11=16136 N12=16136 N13=16136 N14=16136 N15=16136 N16=16136 N17=16136 N18=16136 N19=16136 N20=16136 N21=16136 N22=16136 N23
=16136 N24=16136 N25=16136 N26=16136 N27=16136 N28=16136 N29=16136 N30=16136 N31=16136
fca0000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=144384 N25=144384
14320000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=143872 N26=143872
18960000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=12232 N27=12232
1cfa0000000 prefer:0 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:0 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:0 anon=534 dirty=534 N1=73 N2=461

Once the process is done allocating shared pool, it starts allocating buffer cache memory. It does this per NUMA node. You can see it here at the lines fca0000000, 14320000000, 18960000000. Actually, I copied these figures when the instance was still allocating memory at memory address 18960000000. You can see it does specific allocation of memory local to a specific NUMA node by ‘prefer:nr’.

At the end, when all memory was allocated, the sqlplus session showed the sizes of the different memory pools and opened the database. This took approximately 90 minutes. At first, I thought this had to do with the parameter ‘PRE_PAGE_SGA’. However, this parameter was set at its default value, FALSE.

Some investigating and some testing revealed this behaviour (touching the entire SGA) was governed by the hidden parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION, which defaults to TRUE in 12.1.0.2 on Linux.

Next I tested setting it to FALSE, and my startup time reduced to 3.5 minutes (!).

I used perf record again to look where the time is spend when _TOUCH_SGA_PAGES_DURING_ALLOCATION is set to FALSE:

# Overhead          Command         Shared Object                                       Symbol
# ........  ...............  ....................  ...........................................
#
    41.25%  oracle_563318_s  [kernel.kallsyms]     [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          |
               |          |--93.26%-- __intel_new_memset
               |          |          |
               |          |          |--79.48%-- ksmnsin
               |          |          |          |
               |          |          |          |--95.98%-- kcbnfy
               |          |          |          |          kscnfy
               |          |          |          |          ksmcsg
               |          |          |          |          opistr_real
               |          |          |          |          opistr
               |          |          |          |          opiodr
               |          |          |          |          ttcpip
               |          |          |          |          opitsk
               |          |          |          |          opiino
               |          |          |          |          opiodr
               |          |          |          |          opidrv
               |          |          |          |          sou2o
               |          |          |          |          opimai_real
               |          |          |          |          ssthrdmain
               |          |          |          |          main
               |          |          |          |          __libc_start_main
               |          |          |          |
               |          |          |          |--3.32%-- kcbw_setup

This is a new backtrace, which looks like the former backtrace on the top (between clear_page_c_e and page_fault), simply because it’s doing exactly the same, handling a page fault. However, the Oracle functions are different. What is happening here is the Oracle performs a system call, memset(), which is used to wipe a memory area. If we go further down the stack trace, we see (full speculation!):
* ksmnsin: kernel system memory numa segment initialisation?
* kcbnfy/kscnfy: kernel cache buffers numa FY?/kernel system cache numa FY?
* ksmcsg: kernel system memory create sga?

When looking at the memory allocations during startup of the bequeathing process with _TOUCH_SGA_PAGES_DURING_ALLOCATION set to FALSE in numa_maps, way lesser memory pages are touched. numa_maps only shows truly allocated pages, instead of the logical memory allocations. This means the same memory allocations done per numa node (passing over the numa nodes several times) done for the shared pool, and then the buffer cache allocations per numa node, however now also passing over the different numa nodes several times too. Because of the much smaller amount of pages touched, the startup time of the instance is greatly reduced.

So, is this a solution to the long startup time?
On one hand yes, on the other hand no. What that means is: you need to page the memory, no matter what you do. If you choose not to page the memory during startup (_TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE), you need to do it later. Sounds logical, right: pay now, or pay later? But what does that mean?

After startup of the instance with _TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE, a lot of the memory if not really allocated. If you start to use the new prestine SGA with a normal session, you pay the price of paging which the bequeathing session otherwise took. Let me show you a ‘perf record -g’ example of a full table scan (which is reading the Oracle blocks into the buffercache, not doing a direct path read):

# Overhead          Command       Shared Object                                Symbol
# ........  ...............  ..................  ....................................
#
    61.25%  oracle_564482_s  [kernel.kallsyms]   [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          |
               |          |--99.86%-- handle_mm_fault
               |          |          __do_page_fault
               |          |          do_page_fault
               |          |          page_fault
               |          |          __intel_ssse3_rep_memcpy
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

If we follow the call stack from bottom to top:
* qergsFetch, qertbFetch: query execute row source code.
* kdsttgr: kernel data scan table get row.
* kdstf00000010000100kmP: kernel data scan table full, the ultra fast full table scan.
* kdst: other functions in kernel data scan table functions.
* ktrget2: kernel transaction layer.
* kcbgtcr: kernel cache buffers get consistent row.
* kcbz: kernel cache buffers Z, physical IO helper functions.
* __intel_ssse3_rep_memcpy: this is actually the function memcpy, but replaced by an optimised version for intel CPUs.
* page_fault, do_page_fault, __do_page_fault, handle_mm_fault, hugetlb_fault, clear_page_c_e: these are the same page fault kernel functions we saw earlier.

This clearly shows my full table scan now needs to do the paging!

If I set _TOUCH_SGA_PAGES_DURING_ALLOCATION to TRUE, startup the instance (for which the pages are touched and thus pages), and profile a full table scan, I see:

# Overhead          Command         Shared Object                                    Symbol
# ........  ...............  ....................  ........................................
#
    24.88%  oracle_577611_s  oracle                [.] ksl_get_shared_latch
            |
            --- ksl_get_shared_latch
               |
               |--78.46%-- kcbzfb
               |          kcbzgb
               |          |
               |          |--99.63%-- kcbzgm
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

This shows the most prominent function which is called is ‘ksl_get_shared_latch’. No paging to be seen.

At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the SGA pages being touched at startup to prevent paging for foreground sessions.

BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at startup, which also is not true. It also claims ‘every process that starts must access every page in the SGA’, again this is not true.

From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is ‘sa00’. When recording the backtraces of that process, I see:

# Overhead        Command      Shared Object                                      Symbol
# ........  .............  .................  ..........................................
#
    97.57%  ora_sa00_slob  [kernel.kallsyms]  [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--100.00%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          ksmprepage_memory
               |          ksm_prepage_sga_seg
               |          skgmapply
               |          ksmprepage
               |          ksm_sslv_exec_cbk
               |          ksvrdp
               |          opirip
               |          opidrv
               |          sou2o
               |          opimai_real
               |          ssthrdmain
               |          main
               |          __libc_start_main
                --0.00%-- [...]

The kernel paging functions are exactly the same as we have seen several times now. It’s clear the functions executed by this process are specifically for the prepage functionality. The pre-paging as done on behalf of _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle function names). PRE_PAGE_SGA seems to be a ‘workaround’ if you don’t want to spend the long time paging on startup, but still want to page the memory as soon as possible after startup. Needless to say, this is not the same as _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before the sa00 process pages it, still need to do the paging.

Conclusion
If you want to allocate a large SGA with Oracle 12.1.0.2 (but may apply to earlier versions too), the startup time could be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user) sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged memory, which means they use it before the sa00 process could page it, need to page it theirselves.

Thanks to: Klaas-Jan Jongsma for proofreading.

The intention of this blogpost is to show the Oracle wait time granularity and the Oracle database time measurement granularity. One of the reasons for doing this, is the Oracle database switched from using the function gettimeofday() up to version 11.2 to clock_gettime() to measure time.

This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

The first thing I wanted to proof, is the switch of the gettimeofday() call to the clock_gettime() call. This turned out not to be as simple as I thought.

Because a lot of applications and executables need timing information, which is traditionally done via the gettimeofday() call, the Linux kernel maintainers understood the importance of making this call as performant as possible. Calling a (hardware) clock means you request information from a resource on your computer system. Requesting something like that requires a program to switch to kernel mode first. In order to lower the resources and time needed to get time information, the Linux kernel includes a “trick” to get that information, which is known as a virtual system call or vsyscall. Essentially this means this information is provided in userspace, so there are lesser resources needed, and there is no need to switch to kernel mode. James Morle has an excellent article describing this, this line is a link to that. By staying in userspace, the gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing.

However I said it wasn’t as easy as I thought. I was looking into this, and thought making the vsyscalls visible by echoing “0” in /proc/sys/kernel/vsyscall64. However, I am working with kernel version 3.8.13 for doing this part of the research….which does not have /proc/sys/kernel/vsyscall64, which means I can’t turn off the vsyscall optimisation and make both gettimeofday() and clock_gettime() visible as systemcall.

Searching for kernel.vsyscall64 on the internet I found out that with early versions Linux kernel version 3 vsyscall64 has been removed. This means I can’t use a simple switch to flip to make the calls visible. So, instead of going straight to the thing I wanted to research, I got stuck in doing the necessary preparing and essential preliminary investigation for it.

Can I do it in another way? Yes, this can be done using gdb, the GNU debugger. Start up a foreground (Oracle database) session, and fetch the process ID of that session and attach to it with gdb:

gdb -p PID
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...etc...
(gdb)

Now simply break on gettimeofday and clock_gettime, and make gdb notify you it encountered the call. First 11.2.0.4:

(gdb) break gettimeofday
Breakpoint 1 at 0x332fc9c7c0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>printf "gettimeofday\n"
>continue
>end
(gdb) break clock_gettime
Breakpoint 2 at 0x3330803e10
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "clock_gettime\n"
>continue
>end
(gdb) 

You can save this gdb macro to time.gdb by executing “save breakpoints time.gdb”. Now execute “c” (continue) and enter, to make the process you attached to running again. Execute something very simple, like:

SQL> select * from dual;

This results in Oracle 11.2.0.4:

(gdb) c
Continuing.
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
...etc...

That is expected, we already knew the Oracle database is executing the gettimeofday function a lot. Now let’s do exactly the same, but with Oracle version 12.1.0.2. If you saved the breakpoints and macro’s, you can attach to an Oracle 12.1.0.2 foreground process with gdb and execute ‘source time.gdb’ to set the breakpoints and macro’s. When the ‘select * from dual’ is executed in this version of the database, it looks like this:

(gdb) c
Continuing.
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
...etc...

It is clearly (mostly) executing the clock_gettime() function.

The clock_gettime() function can use a variety of time sources. If you read the manpage of clock_gettime you will see that the first argument is the clock source. You can see the clock sources in the kernel source file Linux/include/uapi/linux/time.h, which shows:

/*
 * The IDs of the various system clocks (for POSIX.1b interval timers):
 */
 #define CLOCK_REALTIME                  0
 #define CLOCK_MONOTONIC                 1
 #define CLOCK_PROCESS_CPUTIME_ID        2
 #define CLOCK_THREAD_CPUTIME_ID         3
 #define CLOCK_MONOTONIC_RAW             4
 ...

The first argument of clock_gettime is the type of clock, so if I remove the macro with clock_gettime, execution stops when clock_gettime is called:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
        silent
        printf "clock_gettime\n"
        c
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
(gdb) c
Continuing.

Now execute something in the sqlplus session. What will happen in the gdb session is:

Breakpoint 2, 0x0000003330803e10 in clock_gettime () from /lib64/librt.so.1
(gdb)

Now look up the first argument of the call:

(gdb) print $rdi
$1 = 1

So Oracle is using CLOCK_MONOTONIC. Not the point of this article, but this means Oracle database time measurement is granular on the microsecond layer.

Now let’s look how much time the Oracle wait interface takes itself. The Oracle wait interface is using the functions kslwtbctx() (kernel service layer wait begin context) and kslwtectx() (kernel service layer wait end context) to start and stop measuring a wait event. Please mind that instead of looking at the time the wait interface provides, this means looking at the time that is taken executing in the kslwtbctx() and kslwtectx() functions. This can be done using systemtap:

global kslwtbctx_time, kslwtectx_time, kslwtbctx_count=0, kslwtbctx_tot=0, kslwtectx_count=0, kslwtectx_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	if ( pid() == target() ) {
		kslwtbctx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx").return {
	if ( pid() == target() ) {
		printf("kslwtbctx: %12d\n", local_clock_ns()-kslwtbctx_time)
		kslwtbctx_tot+=local_clock_ns()-kslwtbctx_time
		kslwtbctx_count++
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	if ( pid() == target() ) {
		kslwtectx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx").return {
	if ( pid() == target() ) {
		printf("kslwtectx: %12d\n", local_clock_ns()-kslwtectx_time)
		kslwtectx_tot+=local_clock_ns()-kslwtectx_time
		kslwtectx_count++
	}
}
probe end {
	printf("\nkslwtbctx: avg %12d\nkslwtectx: avg %12d\n",kslwtbctx_tot/kslwtbctx_count,kslwtectx_tot/kslwtectx_count)
}

This systemtap script has a ‘begin probe’, which executes once the systemtap script starts running. I simply print ‘Begin.’ with a newline. The idea is that it prompts me once the systemtap script is actually running.

Then there is a (userspace) process based probe for the oracle process. There are two probes for both the kslwtbctx and the kslwtectx function in the oracle process. The first one (.function(“kslwtbctx”)) fires when the function is entered, the second one (.function(“kslwtbctx”).return) fires when the function has ended.

The ‘if ( pid() == target() )’ function filters all the invocations and returns of the probed functions for the PID set by “-x PID” parameter. Otherwise any invocation of the probed function by any process would be picked up.

The entering probe simply records the time in nanoseconds in a variable. The returning probe subtracts the previous recorded time from the current time, which means the time between entering and returning is shown. Also, the returning probe adds the time the function took to another variable, and counts the number of times the return probe has fired.

The end probe shows the total time spend in each of the two functions, divided by the number of times the return probe was fired. This way the average time spend in the two functions is calculated. As you will see, the time spend in the function varies.

When this is executed against an Oracle foreground session, this is how it looks like:

# stap -x 2914 wait_interface.stap
Begin.
kslwtectx:         9806
kslwtbctx:         3182
kslwtectx:         1605
kslwtbctx:         1311
kslwtectx:         4200
kslwtbctx:         1126
kslwtectx:         1014
kslwtbctx:          840
kslwtectx:         4402
kslwtbctx:         2636
kslwtectx:         2023
kslwtbctx:         1586
^C
kslwtbctx: avg         2165
kslwtectx: avg         4305

The time measured is in nanoseconds. The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.

The obvious thought you might have, is: “why is this important?”. Well, this is actually important, because the 6us dictates what the wait interface should measure, and what it should not measure. What I mean to say, is that anything that is called inside the Oracle database for which the time spend is in the same ballpark as the wait interface overhead or lower, should not be measured by the wait interface, because the measurement would influence the overall performance in a negative way.

A good example of this are latch gets. The Oracle database does not instrument individual latch gets via the wait interface, but rather exposes waiting for a latch via the wait interface when a process has spun for it (a latch is a spinlock), and decides to go to sleep (on a semaphore) waiting to get woken once the latch becomes available.

However, using systemtap we can actually measure latch gets! Here is a simple script to measure the latch gets for non-shared latches via the kslgetl() function:

global kslgetl_time, kslgetl_count=0, kslgetl_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl") {
	if ( pid() == target() )
		kslgetl_time=local_clock_ns()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl").return {
	if ( pid() == target() ) {
		printf("kslgetl: %12d\n", local_clock_ns()-kslgetl_time)
		kslgetl_tot+=local_clock_ns()-kslgetl_time
		kslgetl_count++
	}
}
probe end {
	printf("\nkslgetl: avg %12d\n",kslgetl_tot/kslgetl_count)
}

This systemtap script follows the exact same structure and logic of the previous systemtap script.

This is how this looks like on my system when executed against a database foreground session executing ‘select * from dual’:

# stap -x 2914 kslgetl.stap
Begin.
kslgetl:         3363
kslgetl:          786
kslgetl:          744
kslgetl:          782
kslgetl:          721
kslgetl:          707
kslgetl:         1037
kslgetl:          728
kslgetl:          711
kslgetl:          736
kslgetl:          719
kslgetl:          714
kslgetl:         1671
kslgetl:          929
kslgetl:          968
kslgetl:          919
kslgetl:          883
kslgetl:          869
kslgetl:         3030
kslgetl:          750
^C
kslgetl: avg         1362

As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing.

This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity. Please mind I am not saying that it does not make sense to detail the response time, I think with modern computer systems with a lot of memory the Oracle database can run more and more without needing to wait for things like disk IOs. This means modern database servers can spend a lot of time just running on CPU, making it hard to understand in what routines the time is spend. Tuning on CPU execution requires an insight into where time is spend. The only option to understand how CPU time in Oracle is composited, is by using external (operating system based) tools like perf and flame graphs to detail CPU time. It would be great if an option would exist in the database to detail on CPU time.

This is the second blogpost on using PL/SQL inside SQL. If you landed on this page and have not read the first part, click this link and read that first. I gotten some reactions on the first article, of which one was: how does this look like with ‘pragma udf’ in the function?

Pragma udf is a way to speed up using PL/SQL functions in (user defined function), starting from version 12. If you want to know more about the use of pragma udf, and when it does help, and when it doesn’t, please google for it.

create or replace function add_one( value number ) return number is
        pragma udf;
        l_value number(10):= value;
begin
        return l_value+1;
end;
/

select sum(add_one(id)) from t2;

As you can see, really the only thing you have to do is add ‘pragma udf’ in the declaration section of PL/SQL.

Here is how the flamegraph looks like:

What is visible, is that the functions between the plsql interpreter (pfrrun) and the function that makes the operand evaluation switch to PL/SQL (evapls) now is only one function, peidxrex. However, inside the evapls function there are two additional functions called (kkxmsagof, kkxmsagif, not readable) which take noticeable time. Conclusion at this point is pragma udf is doing it in yet another way than a native PL/SQL function and the subquery factoring.

Profiling this using the systemtap script:

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

Shows:

# stap -x 92509 plsql.stap
Begin.
^C
evapls time:      2211412
pfrrun time:       804178

So, that’s very close to using this function using subquery factoring, a bit longer (2192685). This is very strictly depending on what is actually done, so milage may vary for your own use.

While we are at it, let’s have a look how this looks like when no PL/SQL is used, so:

select sum(id+1) from t2;

Here it is:

The function used for adding is now evaaddrset. From the size of the kdst_fetch function can be seen that it takes way lesser time. Let’s measure it with a changed version of the systemtap script:

global evaaddrset_time, evaaddrset_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset") {
	if ( pid() == target() )
		evaaddrset_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset").return {
	if ( pid() == target() )
		evaaddrset_tot+=local_clock_us()-evaaddrset_time
}

probe end {
	printf("\nevaaddrset time: %12d\n", evaaddrset_tot)
}

This is how the output looks like:

# stap -x 92509 plsql.stap
Begin.
^C
evaaddrset time:        43389

A simple calculation shows that doing the addition native in SQL only takes 43389/2211412*100=2% of the runtime of PL/SQL with pragma udf.

Whenever you use PL/SQL in SQL statements, the Oracle engine needs to switch from doing SQL to doing PL/SQL, and switch back after it is done. Generally, this is called a “context switch”. This is an example of that:

-- A function that uses PL/SQL 
create or replace function add_one( value number ) return number is
        l_value number(10):= value;
begin
        return l_value+1;
end;
/
-- A SQL statement that uses the PL/SQL function
select sum(add_one(id)) from t2;

Of course the functionality of the function is superfluous, it can easily be done in ‘pure’ SQL with ‘select sum(id+1) from t2’. But that is not the point.
Also, I added a sum() function, for the sake of preventing output to screen per row.

The first thing to check, if there is a difference in performance between executing with sum(id+1) and sum(add_one(id)). If there isn’t we can stop here🙂

TS@frits > set timing on
TS@frits > select sum(id+1) from t2;

 SUM(ID+1)
----------
5.0000E+11

Elapsed: 00:00:00.19
TS@frits > select sum(add_one(id)) from t2;

SUM(ADD_ONE(ID))
----------------
      5.0000E+11

Elapsed: 00:00:01.13

This statement executes a full table scan, I’ve closely guarded the IO times were alike.
But it looks there is a whopping difference between including PL/SQL and not: 113/19*100=595%, or differently worded: almost six times slower.

How does this work? Let’s have a look using stapflames. The idea behind flame graphs in general is to understand in which (c code) functions (user mode and kernel mode) the time is spend. Because of the full backtrace (all the called functions on top of each other), it gives an insight on how a program is working.

The first thing we need to establish, is how PL/SQL looks like from the perspective of C-functions. For that reason, I created a bogus PL/SQL program to profile:

declare
t number:=0;
begin
while t < 1000000 loop
t:=t+1;
end loop;
end;
/

Yes, that is right, the only thing this anonymous PL/SQL block does, is declare a number to a variable named ‘t’, and then loop adding one to the variable until the variable reaches the number 1000000. Again, what this program does is not interesting nor functional, the only thing it needs to do is run, so when we profile the program we are sure it is doing PL/SQL.

I ran this anonymous PL/SQL block using my stapflame utility to generate a flamegraph, and this is how that looks like:
I have taken the flamegraph of all time considered on cpu by the Oracle database.

First of all, one important property of flamegraphs is shown: the sequence is random. If you look at the kpoal8 function, you see there are two different paths taken from this function: opiexe (oracle program interface execute) and opiosq0 (oracle program interface prepare to parse). Of course the PL/SQL block is first parsed and then executed, so the order is different than shown.

What is also very visible, is that almost all time executing, is done using a function ‘pfrrun’, which seems to be the main function executing PL/SQL. On top of that we can see some functions which roughly resemble the functionality used in the PL/SQL block: pfrinstr_ADDN (addition, t:=t+1) and pfrinstr_RELBRNCH (the loop). This also gives a fair indication PL/SQL is interpreted rather than compiled. Anyway, what is important is that from looking at the little test above, we can distinguish running PL/SQL from SQL by the pfrrun function.

Now let’s look at at a flamegraph of running a PL/SQL function in the SQL statement:
The flamegraph shows all the time considered running on CPU for executing the statement ‘select sum(add_one(id)) from t2’. There is a lot to see!

When we look on top of the function kpoal8, we see the function opifch2. This means the vast majority of the time is spend in the fetch phase of the SQL statement. On top of the opifch2 function we see two functions which start with qer. ‘qer’ probably means Query Execute Rowsource. ‘qertbFetch’ is the fetch procedure for table scans. Then we see the kdsttgr function (kernel data scan table get row), and then the ultra fast table scan function (kdstf; kernel data scan table full) followed by a lot of zero’s and/or one’s and ending with ‘km’ or ‘kmP’. There are a lot of kdstf functions in the Oracle executable, I assume the zero’s and one’s after ‘kdstf’ are a bitmap of potentially needed functions during the full scan, so the full table scan function can omit a lot of choices on runtime, leading to better CPU efficiency. See an article by Tanel on what this means.

In the full table scan function, there are two main functions which consume time, kdst_fetch and qesaFastAggNonDistSS. ‘kdst_fetch’ and deeper functions are functions related to doing IO for reading the data from the data file. ‘qesaFastAggNonDistSS’ is the main function for actually processing the data. The function qesaFastAggNonDistSS has two main functions consuming its time, evaopn2 and a function visible as ‘sl..’, which is actually a function called slnxsum, in other words, the sum() function. The function evaopn2 is a function to evaluate operands. This evaluation is the path towards executing the PL/SQL function.

Actually, when carefully looking at the evaopn2 function, we see the slnxsum function, and ‘evapls’, which is the function to switch to PL/SQL. The two main functions in ‘evapls’ are kgmexec and opiomc. Again here the order is switched; what happens here is first a cursor must be mapped for executing PL/SQL (opiomc function), after which it can be executed (kgmexec function).

In order to understand what the time taken by “switching” to PL/SQL is, we can take the total time the query engine is processing everything PL/SQL related, which is the total time taken by the ‘evapls’ function, and measure the time actually running PL/SQL, which is the time taken by the ‘pfrrun’ function. This can be accomplished by simple systemtap script:
(please mind you need a recent systemtap version, preferably gotten from https://sourceware.org/git/systemtap.git, and kernel version 3.10 to be able to use return probes)

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

This is how it looks like on my machine:

# stap -x 29680 plsql.stap
Begin.
^C
evapls time:      3203637
pfrrun time:       951830

So, the overhead or context switching time, which must be Oracle server code executing between the the evapls function, where it determines it needs to execute PL/SQL and the pfrrun function, which is the PL/SQL interpreter, is on my machine:
(1-951830/3203637)*100=70%
Wow!

One way of reducing this problem, is using subquery factoring, alias the ‘with clause’. To use the function that way, this is how the SQL should be written:

with
function add_one( value number ) return number is
	l_value number(10):= value;
begin
	return l_value+1;
end;
select sum(add_one(id)) from t2;
/

Let’s have a look at the flamegraph of this construction:
It becomes apparent that with subquery factoring, there are way lesser functions between the evapls and pfrrun functions.
Normal PLSQL: kgmexec, kkxmpexe, kkxdexe, peidxexe, peidxr_run, plsql_run
Subquery factoring: kkxmss_speedy_stub, peidxrex
Also mind there is no codepath for mapping a cursor.

Let’s have a look at the timing:

# stap -x 29680 plsql.stap
Begin.
^C
evapls time:      2192685
pfrrun time:       880230

The time spend in PL/SQL, by looking at total time spend in the evapls function reduced by 32% ((1-2192685/3203637)*100).
However, if you calculate the overhead, it is still pretty significant: (1-880230/2192685)*100=60%

Conclusion
The most simple conclusion I can make is: do not use PL/SQL if you can solve it in SQL, like in the example above. This does not mean you should never use PL/SQL, contrary: in a lot of cases processing should be done where the data is, and sometimes you need a procedural language for that.

I made a lot of assumptions in this little investigation. The function naming (the translation from the Oracle C function name to what functionality it is supposed to deliver) are speculations.

The context switch between SQL mode and PL/SQL mode looks like it is technically setting up the execution environment for PL/SQL. Indeed this takes time, and the true PL/SQL execution time when repeatedly executing PL/SQL is very low in my case. Please mind actual times will differ on different systems. However, the main conclusion is that using PL/SQL in SQL execution probably is not the most performant thing to do, including using subquery factoring.

There’s been a lot of work in the area of profiling. One of the things I have recently fallen in love with is Brendan Gregg’s flamegraphs. I work mainly on Linux, which means I use perf for generating stack traces. Luca Canali put a lot of effort in generating extended stack profiling methods, including kernel (only) stack traces and CPU state, reading the wait interface via direct SGA reading and kernel stack traces and getting userspace stack traces using libunwind and ptrace plus kernel stack and CPU state. I was inspired by the last method, but wanted more information, like process CPU state including runqueue time.

I started playing around with systemtap, and was able to read a process’ CPU state including run queue time. This involves using kernel tapset scheduler, which unfortunately needs the kernel debug info packages (kernel-euk-debuginfo and kernel-uek-debuginfo-common, available via oss.oracle.com). It is not hard to include wait interface information, this is work Luca and I collaborated on in the past. I created a systemtap script called cpu_and_wait_profile.stap, which shows the oracle database state transition between on cpu and in a wait, as well as kernel CPU state information. This is how that should be executed and what it outputs:

# stap -x 6641 cpu_and_wait_profile.stap
Begin.
w     - 388 (no begin)
c     1    	tot:         334	on:         333	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     2 384	tot:           5	on:           5	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
c     3    	tot:         644	on:         644	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     4 212	tot:          58	on:          41	off:          17	q:           5	ti:           9	tu:           0	#slices:    1
c     5    	tot:         371	on:         371	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     6 212	tot:         146	on:          58	off:          88	q:          14	ti:          69	tu:           0	#slices:    1
c     7    	tot:        1787	on:        1745	off:          42	q:          37	ti:           0	tu:           0	#slices:    2
w     8 212	tot:         265	on:          30	off:         234	q:          12	ti:         218	tu:           0	#slices:    1

The first column indicates if the process is inside an Oracle wait event(w), or is considered running on cpu (c) by the database.
The second column is a serial number. The third column is the wait event number if the process is considered inside a wait, or empty if on CPU.
The column ‘tot’ is the total time (in microseconds) spent on cpu or inside a wait event.
The column ‘on’ is the time spent truly running on CPU. Obviously, ‘off’ is all the time not spent running on the CPU.
The columns ‘q’, ‘ti’ and ‘tu’ are off CPU states. ‘q’ is time spend in the CPU runqueue. This is not a kernel state, a process gets the state ‘TASK_RUNNING’ to indicate it is willing to run, after which it’s the task of the scheduler to manage willing to run processes and get them onto a CPU. ‘ti’ is a kernel state, which means ‘TASK_INTERRUPTABLE’. This is a state after which the process is taken off the CPU, because it is waiting for something to complete. ‘Something’ means a disk IO, a timer to expire, etc. ‘tu’ means ‘TASK_UNINTERRUPTIBLE’, which is used if a process should only continue when a specific condition is met, and reacting to signals would be problematic.
The last column ‘#slices’ is the number of times the process has gotten on cpu.
If you look at the example output above, you see that the process started running, and remained running until sequence number 4 (sequence number is the second column). Sequence number 4 is an Oracle wait event, number 212 (direct path read). The earlier wait event number 384 was passed without actually waiting; total time is 5us, on cpu was 5us too (!). The total time spent in the wait event in sequence #4 is 58us, of which 41us was spent on cpu, and 17us off cpu. The off cpu time is composited of 5us run queue time (q) and 9us ‘TASK_INTERRUPTIBLE’ time, of which the total is 14us, which leaves 3us off cpu/unaccounted for. This is time taken by the state transitions and context switches. The actual sequence of events of the CPU state is: TASK_RUNNING (on cpu), then TASK_INTERRUPTIBLE is entered, which is actually waiting for IOs in this case (wait event ‘direct path read’, remember?). The ‘TASK_INTERRUPTIBLE’ state means the process is stopped from processing by the kernel (taken off cpu), which is logical, because it means the process is deliberately waiting for something before it can continue. Once the condition is met (IO(s) ready in this case), the process can continue. To continue, the process state is set to ‘TASK_RUNNING’, and put on a runqueue. This means there is no explicit process state ‘in run queue’. This state (state set to ‘TASK_RUNNING’ but not running on CPU yet) is shown with ‘q’. Once the process has enough priorities, the scheduler switches the process running on the CPU again.

Okay, so at this point we have a (systemtap) script that can very precisely count the time spend of a process. Wouldn’t it be great if we can see a flame graph per sequence number? I spent a great deal of time trying to figure out a way to combine the two. Until I learned about the ‘-T’ switch of perf record:

    -T, --timestamp       Sample timestamps

Great!! The way this works, is that perf includes ‘timestamps’ during recording (perf record), which are printed when the perf recording is externalised with the ‘perf script’ command:

oracle_92213_fv 92213 34075.900988: cycles:
        ffffffff810483da native_write_msr_safe ([kernel.kallsyms])
        ffffffff8102bf91 intel_pmu_enable_all ([kernel.kallsyms])
        ffffffff810263cc x86_pmu_enable ([kernel.kallsyms])
        ffffffff811221db perf_pmu_enable ([kernel.kallsyms])
        ffffffff81124d09 perf_event_context_sched_in ([kernel.kallsyms])
        ffffffff811252c5 __perf_event_task_sched_in ([kernel.kallsyms])
        ffffffff810962ce finish_task_switch ([kernel.kallsyms])
        ffffffff8159f81d __schedule ([kernel.kallsyms])
        ffffffff8159fec9 schedule ([kernel.kallsyms])
        ffffffff8119e56b pipe_wait ([kernel.kallsyms])
        ffffffff8119f030 pipe_read ([kernel.kallsyms])
        ffffffff81195c37 do_aio_read ([kernel.kallsyms])
        ...etc...

‘34075.900988’ is the timestamp. However, what is this number?? I searched for quite some time, and there is no clear description to be found. It clearly is not epoch time.

Some experimentation learned that -apparently- the number is seconds since startup with microsecond granularity. Further experimentation using systemtap learned that exactly the same number can be fetched with the systemtap local_clock_us() function. This makes it possible to link perf stacktraces with systemtap output!! I created a script (run_stapflame.sh) that runs perf record -g and systemtap at the same time, then combines the information from both tools (meaning the systemtap runtime data is pushed into the stack trace information), after which flame graphs are created.

When a process is not running, there will be no perf stack traces, because there is no process for perf to take the stack trace from. So only when running on CPU (TASK_RUNNING state excluding run queue time), there should be perf data. Also, the systemtap times are accurately measured, but the stack traces of perf are sampled. This means it is missing data (by definition: sampling means you are going to lookup something at a certain interval, which means you are not looking between the intervals). What I have done, is extrapolate the perf samples found for an Oracle CPU or wait interval relative to the time in the measured interval. This means that if the time in the interval is 100us, and I get two collapsed stack traces with 1 and 3 samples, the extrapolated time will be; 1: 1/(1+3)*100us=25us, and 3: 3/(1+3)*100us=75us. This is not scientific, but it is meant to give an idea. Also, in order to show something useful in flame graphs, all the data needs to be based on the same data type, so I need to transform the number of stack traces to time.

I created a github project stapflame for my files.

First, you need to install the debuginfo kernel packages, as has been mentioned in this blogpost.
Then, you need to execute eventsname.sql in order to generate eventsname.sed, which is used to translate wait event numbers to wait event names. Wait event numbers change between Oracle database versions, and could potentially change after PSU apply. eventsname.sed must be in the same directory as the run_stapflame.sh script.
Then, you need to fetch stackcollapse-perf.pl and flamegraph.pl from Brendan Gregg’s github flamegraph repository. These need to be in the same directory as the run_stapflame.sh script too, and have the execute bit set.

Once the requirements are met, you can use the run_stapflame.sh script:

# ./run_stapflame.sh 123

The first argument must the PID of an existing Oracle foreground process.
This will compile and run the systemtap script. Once both systemtap and perf are running, the following message is displayed:

profiling started, press enter to stop

Now execute what you want to be profiled in the Oracle session. Once you are done, return to the session where you ran run_stapflame.sh, and press enter to stop the profiling.
Depending on how much systemtap and perf information this generated, the script will process for some time (it is coded in bash script, it’s not highly optimised, in fact it’s quite inefficient with larger sets(!)).
There will be some messages from perf indicating how many samples it captured:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.213 MB perf.data (~9286 samples) ]

And systemtap complaining about missing symbols for the kernel module it just created (for itself :-)):

No kallsyms or vmlinux with build-id 532f482ae316500802639d35de6c302fdb685afa was found
[stap_b6a486219fd483305d4991b483feb397_16565] with build id 532f482ae316500802639d35de6c302fdb685afa not found, continuing without symbols

The stapflames in the example mentioned here are of a simple ‘select count(*) from table’ in oracle, resulting in a direct path read.

This is the resulting flame graph of my original idea. In short, this does not look very useful in this case, and might never be useful as soon as you got more than approximately 20-100 cpu-wait combinations.

However, other flame graphs are more useful; look at this flame graph, it’s about process CPU state (stack traces grouped by process CPU state, which means that Oracle events (ON CPU and the wait events) can be found for every CPU state). It’s nice to see that in the kernel state TASK_RUNNING (which is on CPU, RUNQUEUE is specified independently) is 46.86% of the total time. Of this time, almost all time (40.35% of the total time) is in the Oracle state ‘ON CPU’. After TASK_RUNNING, TASK_INTERRUPTIBLE time is 49.56% of the total time. Almost all time in TASK_INTERRUPTIBLE has gone to the wait event direct path read. There is not a great deal of time spend in the run queue (1.74% of the total time). If you look up through the stacktraces in TASK_RUNNING and ON CPU, you see that of the 40.35% of running on CPU, 15.33% is spend on executing the io_submit function. This means 15.33/40.35*100=37.99% of the time on CPU is spend on submitting IO requests. Also in TASK_RUNNING and ON CPU, 7.57% is spend in the function sxorchk, which is block checksumming (db_block_checksum; set to ‘typical’ by default).

Another interesting view is this flamegraph; this one is only the wait event ‘direct path read’. As you can see, most of the wait event is not spent running on CPU. That is quite logical for a wait event🙂. 96.03% of the time spent in the wait event ‘direct path read’ is in the ‘TASK_INTERRUPTIBLE’ state. 2.68% of the time in the wait event ‘direct path read’ is spend in TASK_RUNNING on CPU state. Despite being inside wait event time, it’s very logical to have some time spend on running on CPU, because you need to run on the CPU to reap IO requests.

I love to hear comments, additions, corrections or any other feedback!

Credits and other references:
# Brendan Gregg and his work on FlameGraphs.
# Luca Canali and his work on stack profiling, which inspired me to create this tool, and proofreading.
# Tanel Poder and his work on Oracle internals in general.
# Klaas-Jan Jongsma for proofreading.

Footnotes:
– The systemtap script contains the full path to the Oracle database executable in the userspace function probes. This obviously must be changed to reflect the path to the Oracle executable of the process the systemtap script is profiling. Alternatively, it can just list the executable name (“oracle”), but then it the executable must be in the $PATH.

This post is about manually calling and freeing a shared latch. Credits should go to Andrey Nikolaev, who has this covered in his presentation which was presented at UKOUG Tech 15. I am very sorry to see I did miss it.

Essentially, if you follow my Oracle 12 and shared latches part 2 blogpost, which is about shared latches, I showed how to get a shared latch:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x94af8768 1 0 2303 16
Function returned 1

Which works okay, but leaves a bit of a mess when freed:

SQL> oradebug call kslfre 0x93FC02B8
ORA-03113: end-of-file on communication channel
ORA-24323: value not allowed

(please read part 2 to get the full story on how to set it and how to use it for investigations)

Andrey Nikolaev shows there is another function, kslgetsl_w(latch address, wait, why, where, mode), which can be used to get a shared latch (alike the function ksl_get_shared_latch) but without the oradebug session blowing up in your face:

SQL> oradebug call kslgetsl_w 0x93faec78 1 0 2329 16
Function returned 1

And after testing the latch:

SQL> oradebug call kslfre 0x93faec78
Function returned 1

However, when querying v$latchholder from the same session as the oradebug call, an ORA-600 is thrown:

SQL> select * from v$latchholder;
select * from v$latchholder
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [504], [0x095306210], [2], [1], [session idle bit], [1], [0x093FAEC78], [cache buffers chains], [], [], [], []

After which the latch is freed.

The error message strongly hints at the latches gotten in an incompatible order. The session idle bit seems to be a latch gotten to indicate the session state switching from active to inactive (according to an article here), and indeed holding a cache buffer chains latch, a session should not switch state at all. However, a second sysdba session can be used to query v$latchholder.

Some very limited investigations show the function is kslgetsl_w is not used by the database itself, which rather uses ksl_get_shared_latch. Again, the investigation was very limited, there might be situations this function is used.

Conclusion
The reason for this blogpost is to point out an alternative for using the ksl_get_shared_latch function manually for doing investigation. This post also shows that you should not change or play with internals on production systems, because it might lead to all kinds of additional behaviour or plain errors, for which the ORA-600 is a good example.

Follow

Get every new post delivered to your Inbox.

Join 3,042 other followers

%d bloggers like this: