The Oracle database, in-memory parallel execution and NUMA

In a previous article called ‘memory allocation on startup’ I touched on the subject of NUMA; Non Uniform Memory Access. This article is about how to configure NUMA, how to look into NUMA usage and a real life case of NUMA optimisation using in-memory parallel execution.

At this point in time (start of the summer of 2016) we see that the CPU speed competition has stagnated and settled at somewhere below maximally 4 gigahertz, and instead the number of core’s and the size of memory is growing. The common used server type in the market I am in is a two socket server. It is not unthinkable that in the near future servers with more than two sockets will increase in popularity, which facilitates the increase in (parallel) computing capacity and maximal amount of memory.

On the other hand of course there are already 2+ socket count servers, of which some have been multi socket for quite a while: Oracle/Sun M series, IBM P series, and a large diversity of intel based 2+ socket servers, like the Oracle/Sun “-4” and “-8” series. The investigations in this blogpost are done done on a SGI UV 300RL, which is a system with (up to) 32 sockets, 576 cores, and 1152 CPU threads.

There already have been a lot of blogposts about Linux and NUMA:
Kevin Closson
Martin Bach: linux-large-pages-and-non-uniform-memory-distribution, _enable_numa_support and numactl
Yves Colin
Bertrand Drouvot
I am sure there will be more, these are the ones that I read in the past.

This article will take you through the configuration of NUMA for the Oracle database and how to measure specifics from the Linux side. It will touch some of the things mentioned in the other blogposts, in order to provide a consistent story.
The database version is:
The operating system is Oracle Linux 7.1
Kernel: 3.8.13-98.4.1.el7uek.x86_64

1. NUMA or not.
In order to sensibly use NUMA optimisation with the Oracle database, you first need to have hardware that is actually NUMA capable. This means multiple sockets in use that have local memory (please mind this includes common two socket server hardware). The next thing needed to use NUMA is NUMA support enabled by the Linux kernel. With the current version of the oracle-preinstall rpm, NUMA is disabled on the kernel load line in grub (numa=off). When NUMA is disabled, it will show one NUMA node (=socket) with numactl.

First make sure you are on NUMA enabled hardware by issuing ‘numactl –hardware’ and see multiple nodes specified.

2. Default Oracle.
When you don’t modify (undocumented) NUMA parameters, Oracle will NOT run in NUMA aware mode. In other words, you explicitly need to enable NUMA in the database. This is also visible if you look in the numa_maps, which shows process memory and NUMA information:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3339 mapmax=904 N0=112 N1=98 N2=98 N3=98 N4=98 N5=98 N6=98 N7=97 N8=96 N9=96 N10=96 N11=101 N12=108 N13=106 N14=105 N15=105 N16=107 N17=107 N18=106 N19=109 N20=106 N21=106 N22=107 N23=107 N24=106 N25=107 N26=108 N27=108 N28=109 N29=112 N30=111 N31=113
9c460000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:1 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N0=1

Few things to note: I took the pmon process, and my instance name is slob (pmon_slob). The SGA only is allocated from huge pages, so I grepped for ‘huge’. numa_maps only shows the process’ paged memory, my SGA is way bigger. The important thing to see here is the main SGA memory allocation at 80000000 reads ‘interleave:0-31’, which means the memory is spread over all the NUMA nodes. This can also be seen on the same line, because the memory allocation per node is shown with Nnn=nr.pages.

3. Set parameters
In Oracle you need to set two parameters to fully enable NUMA support in the database:

alter system set "_enable_NUMA_support"=true scope=spfile;
alter system set "_px_numa_support_enabled"=true scope=spfile;

These are undocumented parameters. However, these parameters are explicitly mentioned in MOS article ‘What Is Correct Setting for _px_numa_support_enabled for 12c Databases That Wish to Use NUMA?’ (Doc ID 1956463.1).

Once you have set these parameters, Oracle will start up in NUMA aware mode. One of the things that is different in NUMA aware mode, is how the buffercache is created. This can be seen by looking at the numa_maps file of pmon again:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3587 mapmax=904 N0=112 N1=112 N2=112 N3=112 N4=112 N5=112 N6=112 N7=112 N8=112 N9=112 N10=112 N11=112 N12=113 N13=112 N14=113 N15=112 N16=112 N17=112 N18=113 N19=112 N20=112 N21=112 N22=112 N23=112 N24=112 N25=112 N26=112 N27=112 N28=112 N29=112 N30=112 N31=112
8860000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=1536 N25=1536
d260000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=512 N26=512
11c40000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=512 N27=512
16620000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=512 N28=512
1b000000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=512 N29=512
1f9e0000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=256 N30=256
243c0000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=512 N31=512
28da0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=256 N0=256
2d780000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=512 N1=512
32160000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=256 N2=256
36b40000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=512 N3=512
3b520000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=256 N4=256
3ff00000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=512 N5=512
448e0000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=256 N6=256
492c0000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=512 N7=512
4dca0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=256 N8=256
52680000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=512 N9=512
57060000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=256 N10=256
5ba40000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=512 N11=512
60420000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=256 N12=256
64e00000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=512 N13=512
697e0000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=256 N14=256
6e1c0000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=512 N15=512
72ba0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=256 N16=256
77580000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=512 N17=512
7bf60000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=256 N18=256
80940000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=512 N19=512
85320000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=256 N20=256
89d00000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=512 N21=512
8e6e0000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=256 N22=256
930c0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=512 N23=512
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=512 N24=512
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:2 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N5=1

If we look at the output, a few segments are still allocated interleaved (these are the fixed SGA and the shared pool), and we see allocations for every NUMA node. These per NUMA node allocations are the buffer cache.

Okay. Now the database is setup for NUMA, and we checked how this looks like on the operating system level using Linux tools. However, we need to realise that NUMA is no magic that sets your database into the next gear. Most things you do are not NUMA optimised by the database, simply because it wouldn’t make sense.

For example, when I perform a buffered full table scan without parallel query, it will spread out the blocks over the allocated segments. Why? Because my session is running on a random CPU thread which could run on another CPU thread the next time it is scheduled. What I am saying here is: only specific things are NUMA optimised.

A next logical question would be: but then what is actually NUMA optimised? One obvious thing that is NUMA optimised is the database writer (this is mentioned in the alert.log at startup). Another NUMA optimised thing is parallel query. If you look at the parallel query worker processes, the status file gives a hint:

# grep allowed_list /proc/$(pgrep p000_slob)/status
Cpus_allowed_list:	450-467,1026-1043
Mems_allowed_list:	0-31

This shows that the first parallel query worker process is limited in what CPU (threads) it is allowed to use. All parallel query worker processes are limited to ranges of CPU threads. Please mind it is NOT limited in what memory it can use, so it can use both local and remote memory. It may look weird that it still can access remote memory, totally ignoring NUMA optimised (=local) memory access.

First the CPU thread ranges. If you look what the range mean (it’s rather obvious), you see this:

# lscpu | grep 450
NUMA node25 CPU(s):    450-467,1026-1043

The CPU thread range is exactly a NUMA node. So whatever happens, parallel query worker process p000 can only run on the CPU threads of NUMA node 25!

Then absence of limits in memory: this means the parallel query process can still access anything it needs. This means that anything that used to work with parallel query can still run as it did without or before the NUMA optimisations. In specific cases, parallel query can take advantage of the NUMA setting.

One of such cases is in-memory parallel execution. With in-memory parallel execution, the parallel query slaves read blocks in the buffer cache instead of performing a direct path read. In-memory parallel execution is enabled when the parameter parallel_degree_policy is set to auto, and heuristically determines feasibility.

I setup a table of approximately 700GB, and performed an in-memory parallel query scan on a freshly started instance, in order to have the parallel query worker processes have as little memory touched as possible. At the end of the scan, this is how numa_maps of parallel query worker process p000 looks like:

# grep huge /proc/$(pgrep p000_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=906 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=4608 N0=144 N1=144 N2=144 N3=144 N4=144 N5=144 N6=144 N7=144 N8=144 N9=144 N10=144 N11=144 N12=144 N13=144 N14=144 N15=144 N16=144 N17=144 N18=144 N19=144 N20=144 N21=144 N22=144 N23=144 N24=144 N25=144 N26=144 N27=144 N28=144 N29=144 N30=144 N31=144
8880000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=50176 N25=50176
d2a0000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N26=6145
11c80000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=4608 N27=4608
16660000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=70 N28=5121
1b040000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=14336 N29=14336
1fa20000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=72 N30=5121
24400000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=4608 N31=4608
28de0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N0=6145
2d7c0000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=4099 N1=4099
321a0000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=2528 mapmax=73 N2=2528
36b80000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=2222 N3=2222
3b560000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=2504 mapmax=69 N4=2504
3ff40000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=2249 N5=2249
44920000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=4754 mapmax=69 N6=4754
49300000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=2296 N7=2296
4dce0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=2463 mapmax=70 N8=2463
526c0000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=2246 N9=2246
570a0000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=2440 mapmax=72 N10=2440
5ba80000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=2233 N11=2233
60460000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=2469 mapmax=69 N12=2469
64e40000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=2241 N13=2241
69820000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=2434 mapmax=69 N14=2434
6e200000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=2342 N15=2342
72be0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=2461 mapmax=70 N16=2461
775c0000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=2329 N17=2329
7bfa0000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=2499 mapmax=71 N18=2499
80980000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=2334 N19=2334
85360000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=2466 mapmax=73 N20=2466
89d40000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=2328 N21=2328
8e720000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=2180 mapmax=69 N22=2180
930e0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=2208 mapmax=69 N23=2208
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=2167 mapmax=68 N24=2167
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=147 N26=2
9c480000000 prefer:0 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=906 N2=1

If you look at memory address 8880000000, you see parallel query worker process p000 used 50176 pages, while on the other (buffer cache/’prefer’) segments it touched somewhere between 2200-6150 pages. So it clearly used it’s local memory more than the remote memories. This is proof in-memory parallel query is NUMA optimised.

In fact, when I time the query using in-memory parallel query with NUMA enabled, it takes a little less than 8 seconds consistently once the data is loaded in the buffercache. When I disable NUMA optimisations (_enable_NUMA_support=false), and execute the same query, also doing in-memory parallel query, the time goes up to a little less than 24 seconds, only scanning the buffercache.

  1. Enjoyed the blog. Could you setup a larger table since this is a 32 socket server with presumably a large amount of main memory to see how NUMA behaves when the likelihood is greater for remote memory access spanning sockets.

  2. Hi! Thank you for reading!

    I am not sure if I understand your question.

    In the example above I took measurements and figures from parallel query worker p000 as an example. The actual amount of worker processes I used was 64 (/*+ parallel(64) */). The above figures show worker process p000 uses local memory mostly. The way parallel query works in this case, is worker processes get a range of the segment assigned to scan. Each worker process scans the assigned range, and puts it in NUMA local memory if it can. I used the figures of worker process p000 to illustrate how all the other worker process work and use memory.

    This means that once the table was read in memory (the oracle buffer cache) by the worker processes, it is actually spread out over all the NUMA memory segments. Once the table blocks are in memory, subsequent usage of parallel query will read the data from local memory.

    The ability to use memory local to the parallel query worker processes instead of random distributed over all the NUMA memory (interleaved) means a performance difference (in my specific case, but it illustrates the point) of 8 seconds (NUMA optimised) versus 24 seconds (random memory distribution, no Oracle NUMA optimisation).

    ps. This is Oracle in-memory parallel execution specific, normal parallel query execution will not bring the blocks to the cache.

  3. “What I am saying here is: only specific things are NUMA optimised. […] One obvious thing that is NUMA optimised is the database writer (this is mentioned in the alert.log at startup).”

    The thing about making the database writer NUMA optimised is that you also need disk I/O connectivity from each and every NUMA node to your storage devices, so that each database writer can issue “node local” I/O operations. If each disk array is only connected to one and only one node each, then all of your disk I/O requests have to go over the “memory interconnect” between the nodes to the node the storage is attached to and you don’t gain anything in performance. You end up putting high volumes of relatively large disk block traffic over the inter-node memory interconnect. That is why per-node storage connectivity and disk multi-pathing in the operating system is also important to NUMA scalability for databases.

    You don’t mention it explicitly, but it would only make sense if your SGI system also has per-node disk I/O connections for such scalability benefits. Otherwise all disk I/O traffic would be routed over the memory interconnect as I described.

    • Thanks for that comment John. Yes, you are correct; if you want to optimise disk IO, you should be aware how the physical connections to storage are done. An HBA works using a physical connection, which means that it’s hooked up local to a socket or sockets. The most optimal way to hook up storage is if every NUMA node can use a device local to the node.

  4. shengyu shen said:

    Great blog about oracle on NUMA.
    I am wondering that can we partition HUGE table into multiple partitions on multiple nodes, such that these nodes can work parallelly on these partitions to generate the final result for every possible type of query?

  5. Vinicius Heringer said:

    What should be the behavior of oracle database if I set the parameters like this:
    _enable_NUMA_optimization FALSE
    _enable_NUMA_support FALSE
    _px_numa_stealing_enabled TRUE
    _px_numa_support_enabled TRUE

    For an Oracle RAC 2-nodes SUSE Enterprise Linux 11
    Both operating system and hardware are NUMA capable (HP Proliant, 2 sockets ).

    dmesg | grep -i numa
    [ 0.000000] NUMA: Allocated memnodemap from c000 – cc80
    [ 0.000000] NUMA: Using 26 for the hash shift.

    numactl –hardware
    available: 2 nodes (0-1)
    node 0 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
    node 0 size: 49141 MB
    node 0 free: 10073 MB
    node 1 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
    node 1 size: 49151 MB
    node 1 free: 9249 MB
    node distances:
    node 0 1
    0: 10 20
    1: 20 10

    Is it still using NUMA for parallel query or is NUMA off for the oracle database?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: