Linux memory usage
One of the principal important configuration settings for running an Oracle database is making appropriate use of memory. Sizing the memory regions too small leads to increased IO, sizing the memory regions too big leads to inefficient use of memory and an increase in memory latency most notably because of swapping.
On Linux, there is a fair amount of memory information available, however it is not obvious how to use that information, which frequently leads to inefficient use of memory, especially in today’s world of consolidation.
The information about linux server database usage is available in /proc/meminfo, and looks like this:
$ cat /proc/meminfo MemTotal: 3781616 kB MemFree: 441436 kB MemAvailable: 1056584 kB Buffers: 948 kB Cached: 625888 kB SwapCached: 0 kB Active: 500096 kB Inactive: 447384 kB Active(anon): 320860 kB Inactive(anon): 8964 kB Active(file): 179236 kB Inactive(file): 438420 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 1048572 kB SwapFree: 1048572 kB Dirty: 4 kB Writeback: 0 kB AnonPages: 320644 kB Mapped: 127900 kB Shmem: 9180 kB Slab: 45244 kB SReclaimable: 26616 kB SUnreclaim: 18628 kB KernelStack: 3312 kB PageTables: 6720 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 1786356 kB Committed_AS: 767908 kB VmallocTotal: 34359738367 kB VmallocUsed: 13448 kB VmallocChunk: 34359721984 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB CmaTotal: 16384 kB CmaFree: 4 kB HugePages_Total: 1126 HugePages_Free: 1126 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 65472 kB DirectMap2M: 4128768 kB
No matter how experienced you are, it’s not easy to get a good overview just be fetching this information. The point is these figures are not individual memory area’s which you simply can add up to understand to the total memory used by linux. Not even all figures are in kB (kilobyte), the HugePages values are in number of pages.
In fact, there is no absolute truth that I can find that gives a definite overview. Here is a description of what I think are the relevant memory statistics in /proc/meminfo:
MemFree: memory not being used, which should be low after a certain amount of time. Linux strives for using as much memory as much as possible for something useful, most notably as cache. If this number remains high, there is ineffective use of memory.
KernelStack: memory being used by the linux kernel.
Slab: memory being used by the kernel for caching data structures.
SwapCached: memory being used as a cache for memory pages being swapped in and out.
Buffers: memory being used as an IO buffer for disk blocks, not page caching, and should be relatively low.
PageTables: memory used for virtual to physical memory address translation.
Shmem: memory allocated as small pages shared memory.
Cached: memory used for caching pages.
Mapped: memory allocated for mapping a file into an address space.
AnonPages: memory allocated for mapping memory that is not backed by a file (“anonymous”).
Hugepagesize: the size for huge pages blocks. Valid choices with current modern intel Xeon CPUs are 2M or 1G. The Oracle database can only use 2M HugePages on linux.
HugePages_Total: total number of pages explicitly allocated as huge pages memory.
HugePages_Rsvd: total number of pages allocated as huge pages memory, but not yet allocated (and thus reported as free).
HugePages_Free: total number of pages available as huge pages memory, includes HugePages_Rsvd.
Based on information in several blogposts and experimenting with the figures, I came up with this formula to get an overview of used memory. This is not an all-conclusive formula, my tests so far get me within 5% of what Linux is reporting as MemTotal.
Warning: the text: “Another thing is that in most cases when the system has swapped out, ‘Cached’ (minus ‘Shmem’ and ‘Mapped’) gets negative, which I currently can’t explain.” is to true anymore!
By dividing Cached memory into Shmem and Cached+Mapped memory, there is no negative value anymore! I can’t find a way to make a distinction between true ‘Cached’ memory, meaning pages cached without any process attached purely for the sake of reusing them so they do not need to be physically read again, and true Mapped pages, meaning pages that are mapped into a process address space. I know there is a value ‘Mapped’, but I can’t work out reliably how to make the distinction between cache and true mapped. Maybe there even isn’t one.
This is that formula:
MemFree
+ KernelStack
+ Buffers
+ PageTables
+ AnonPages
+ Slab
+ SwapCached
+ Cached – Shmem
+ Shmem
+ HugePages used (HugePages_Total-Hugepages_Free)*Hugepagesize
+ Hugepages rsvd (Hugepages_Rsvd*Hugepagesize)
+ Hugepages free (Hugepages_Free-Hugepages_Rsvd)*Hugepagesize
————————————————————-
= Approximate total memory usage
In order to easily use this, I wrote a shell script to apply this formula to your Linux system, available on gitlab: https://gitlab.com/FritsHoogland/memstat.git. You can use the script to get a (quite wide) overview every 3 seconds by running ./memstat.sh, or you can get an overview of the current situation by running ./memstat.sh –oneshot.
This is how a –oneshot of my test system looks like (which is a quite small VM):
$ ./memstat.sh --oneshot Free 773932 Shmem 2264 Mapped+Cached 359712 Anon 209364 Pagetables 28544 KernelStack 4256 Buffers 0 Slab 63716 SwpCache 21836 HP Used 2023424 HP Rsvd 75776 HP Free 206848 Unknown 11944 ( 0%) Total memory 3781616 --------------------------- Total swp 1048572 Used swp 200852 ( 19%)
There is a lot to say about linux memory management. One important thing to realise is that when a system is running low on memory, it will not show as ‘Free’ declining towards zero. Linux will keep a certain amount of memory for direct use, dictated by ‘vm.min_free_kbytes’ as the absolute minimum.
In general, the ‘Cached’ pages (not Shmem pages at first) will be made available under memory pressure, since the Linux page cache really is only caching for potential performance benefit, there is no process directly attached to ‘Cached’ pages. Please mind my experimentations show there is no reliable way I could make a distinction between true ‘Mapped’ pages, meaning pages which are in use as memory mapped files, and true ‘Cached’ pages, meaning disk pages (blocks) sized 4KB which are kept in memory for the sake of reusing them, not directly related to a process.
Once the the number of page cache pages gets low, and there still is need for available pages, pages from the other categories are starting to get moved to swap. This excludes huge pages, even if they are not used! The way pages are considered is based on an ageing mechanism. This works quite well for light memory pressure for a short amount of time.
In fact, this works so well that the default eagerness of the kernel to swap (vm.swappiness, 60 by default, I have seen 30 as a default value too, 0=not eager to swap, 100=maximal swap eagerness) seems appropriate on most systems, even ones which need strict performance requirements. In fact, when swappiness is set (too) low, the kernel will try to avoid swapping as long as possible, meaning that once there is no way around it, it probably needs to swap multiple pages, leading to noticeable delays, while paging out single pages more in advance will have a hardly noticeable overhead.
However, please mind there is no way around consistent memory pressure! This means if memory in active use exceeds physical available memory, it results in physical memory to be shared at the cost of active memory pages being swapped to disk, for which process have to wait.
To show the impact of memory pressure, and how hard it is to understand that from looking at the memory pages, let me show you an example. I ran ‘memstat.sh’ in one session, and the command ‘memhog’ (part of the numactl rpm package) in another. My virtual machine has 4G of memory, and has an Oracle database running which has the SGA allocated in huge pages.
First I started memstat.sh, then ran ‘memhog 1g’, which allocates 1 gigabyte of memory and then releases it. This is the memstat output:
$ ./memstat.sh Free Shmem Mapped+Cached Anon Pagetables KernelStack Buffers Slab SwpCache HP Used HP Rsvd HP Free Unknown % 42764 435128 495656 387872 40160 4608 96 38600 24 0 0 2306048 30660 0 42616 435128 495656 388076 40256 4640 96 38572 24 0 0 2306048 30504 0 42988 435128 495656 388088 40256 4640 96 38576 24 0 0 2306048 30116 0 42428 435128 495700 388108 40264 4640 96 38600 24 0 0 2306048 30580 0 894424 320960 99456 12704 40256 4640 0 35496 42352 0 0 2306048 25280 0 775188 321468 141804 79160 40260 4640 0 35456 70840 0 0 2306048 6752 0 698636 324248 201476 95044 40264 4640 0 35400 64744 0 0 2306048 11116 0 686452 324264 202388 107056 40260 4640 0 35392 66076 0 0 2306048 9040 0 682452 324408 204496 108504 40264 4640 0 35388 65636 0 0 2306048 9780 0
You can see memstat taking some measurements, then memhog is run which quickly allocates 1g and releases it. This is done between rows 6 and 7. First of all the free memory: once the process has allocated all the memory, it stops running which means the memory is freed. Any private memory allocation mapped into the (now quitted) process address space which has backing by a physical page is returned to the operating system as free because it has effectively become available. So what might seem counter-intuitive, by stopping a process that allocated a lot of non-shared (!) memory, it results in a lot of free memory being available.
As I indicated, ‘Cached’ memory is first to be released to provide memory pages for direct use. Mapped+Cached does contain this together with Mapped memory. The amount of pages used by Mapped and Cached are drastically reduced by swapping. ‘Anon’ pages are significantly reduced too, which means they are swapped to the swap device, and ‘Shmem’ is reduced too, which means swapped to the swap device, but way lesser than ‘Mapped+Cached’ and ‘Anon’. ‘Kernel’ (kernel stack) and ‘Pagetables’ hardly decreased and ‘Slab’ decreased somewhat. ‘Swapcache’ actually grew, which makes sense because that is related to the swapping that took place.
The main thing I wanted to point out is that between the time of no memory pressure (lines 2-6) and past memory pressure (8-16), there is no direct memory statistic showing that a system is doing okay nor having suffered. The only thing that directly indicates memory pressure are active swapping in and swapping out, which can be seen with sar -W; pswpin/s and pgwpout/s, or vmstat si/so columns; which are not shown here.
Even past memory pressure, where prior linux memory management had swapped out a lot of pages to facilitate the 1G being allocated which immediately after been allocated was freed and returned as free memory, the majority of the pages on my system that have been swapped out are still swapped out:
$ ./memstat.sh --oneshot ... Total memory 3781616 -------------------------- Total swp 1048572 Used swp 407228 ( 38%)
This underlines an important linux memory management principle: only do something if there is an immediate, direct need. My system now has no memory pressure anymore, but still 38% of my swap is allocated. Only if these pages are needed, they are paged back in. This underlines the fact that swapping can not and should not be measured by looking at the used amount of swap, a significant amount of swap being used only indicates that memory pressure has occurred in the past. The only way to detect swapping is taking place is by looking at the actual current amount of pages being swapped in and out.
If you see (very) low amounts of pages being swapped out without pages being swapped in at the same time, it’s the swappiness setting that makes pages being moved that have not been used for some time out to the swap device. This is not a problem. If you see pages being swapped in without pages being swapped out at the same time, it means pages that were swapped out either because of past memory pressure or proactive paging due to swappiness are read back in, which is not a problem too. Again, only if both pages are actively being swapped in and out at the same time or if the rate is very high there is a memory problem. The swapping actually is helping you not fail because of memory not being available at all.
On SLES, I was unable to run the script as it is. Had to change a little bit in line 35. Removed ‘$’ from CACHED=$(( CACHED-SHMEM ))
Also, I am getting unusual value in UNKNOWN.
bash-4.2$ ./memstat.sh –oneshot
Free 9106692
Shmem 29771108
Mapped+Cached 209692
Anon 6920
Pagetables 555920
KernelStack 30273400
Buffers 0
Slab 0
SwpCache 0
HP Used 4194304
HP Rsvd 59204
HP Free 0
Unknown 74177240 (2104472%)
Total memory 156
—————————
Total swp 0
Used swp 0 ( 0%)
bash-4.2$ top -b | grep Mem
Mem: 72438M total, 63543M used, 8895M free, 542M buffers
Given the huge amount of unknown, this is not helpful. This needs reviewing and mending to attribute where memory is going.
Looks better in interval output…
bash-4.2$ ./memstat.sh
Free Shmem Mapped+Cached Anon Pagetables KernelStack Buffers Slab SwpCache HP Used HP Rsvd HP Free Unknown %
9103176 29772080 209716 6920 555976 30274584 0 0 0 4194304 60484 0 0 0
9103376 29772080 209724 6920 555976 30274720 0 0 0 4194304 60140 0 0 0
9104296 29772080 209728 6920 555976 30274288 0 0 0 4194304 59648 0 0 0
9103624 29772080 209724 6920 555976 30274232 0 0 0 4194304 60380 0 0 0
Just a quick heads up about that the script can throw an error on newer systems
Cause is the grep for SHMEM on line 18, which matches now multiple lines
By adding a “:” to the grep, it is again limited to the correct field:
SHMEM=$( echo “$MEMINFO” | awk ‘/^Shmem:/ { print $2 }’)