Linux memory usage

One of the principal important configuration settings for running an Oracle database is making appropriate use of memory. Sizing the memory regions too small leads to increased IO, sizing the memory regions too big leads to inefficient use of memory and an increase in memory latency most notably because of swapping.

On Linux, there is a fair amount of memory information available, however it is not obvious how to use that information, which frequently leads to inefficient use of memory, especially in today’s world of consolidation.

The information about linux server database usage is available in /proc/meminfo, and looks like this:

$ cat /proc/meminfo
MemTotal:        3781616 kB
MemFree:          441436 kB
MemAvailable:    1056584 kB
Buffers:             948 kB
Cached:           625888 kB
SwapCached:            0 kB
Active:           500096 kB
Inactive:         447384 kB
Active(anon):     320860 kB
Inactive(anon):     8964 kB
Active(file):     179236 kB
Inactive(file):   438420 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       1048572 kB
SwapFree:        1048572 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:        320644 kB
Mapped:           127900 kB
Shmem:              9180 kB
Slab:              45244 kB
SReclaimable:      26616 kB
SUnreclaim:        18628 kB
KernelStack:        3312 kB
PageTables:         6720 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1786356 kB
Committed_AS:     767908 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       13448 kB
VmallocChunk:   34359721984 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
CmaTotal:          16384 kB
CmaFree:               4 kB
HugePages_Total:    1126
HugePages_Free:     1126
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       65472 kB
DirectMap2M:     4128768 kB

No matter how experienced you are, it’s not easy to get a good overview just be fetching this information. The point is these figures are not individual memory area’s which you simply can add up to understand to the total memory used by linux. Not even all figures are in kB (kilobyte), the HugePages values are in number of pages.

In fact, there is no absolute truth that I can find that gives a definite overview. Here is a description of what I think are the relevant memory statistics in /proc/meminfo:

MemFree: memory not being used, which should be low after a certain amount of time. Linux strives for using as much memory as much as possible for something useful, most notably as cache. If this number remains high, there is ineffective use of memory.
KernelStack: memory being used by the linux kernel.
Slab: memory being used by the kernel for caching data structures.
SwapCached: memory being used as a cache for memory pages being swapped in and out.
Buffers: memory being used as an IO buffer for disk blocks, not page caching, and should be relatively low.
PageTables: memory used for virtual to physical memory address translation.
Shmem: memory allocated as small pages shared memory.
Cached: memory used for caching pages.
Mapped: memory allocated for mapping a file into an address space.
AnonPages: memory allocated for mapping memory that is not backed by a file (“anonymous”).
Hugepagesize: the size for huge pages blocks. Valid choices with current modern intel Xeon CPUs are 2M or 1G. The Oracle database can only use 2M HugePages on linux.
HugePages_Total: total number of pages explicitly allocated as huge pages memory.
HugePages_Rsvd: total number of pages allocated as huge pages memory, but not yet allocated (and thus reported as free).
HugePages_Free: total number of pages available as huge pages memory, includes HugePages_Rsvd.

Based on information in several blogposts and experimenting with the figures, I came up with this formula to get an overview of used memory. This is not an all-conclusive formula, my tests so far get me within 5% of what Linux is reporting as MemTotal.

Warning: the text: “Another thing is that in most cases when the system has swapped out, ‘Cached’ (minus ‘Shmem’ and ‘Mapped’) gets negative, which I currently can’t explain.” is to true anymore!
By dividing Cached memory into Shmem and Cached+Mapped memory, there is no negative value anymore! I can’t find a way to make a distinction between true ‘Cached’ memory, meaning pages cached without any process attached purely for the sake of reusing them so they do not need to be physically read again, and true Mapped pages, meaning pages that are mapped into a process address space. I know there is a value ‘Mapped’, but I can’t work out reliably how to make the distinction between cache and true mapped. Maybe there even isn’t one.

This is that formula:

MemFree
+ KernelStack
+ Buffers
+ PageTables
+ AnonPages
+ Slab
+ SwapCached
+ Cached – Shmem
+ Shmem
+ HugePages used (HugePages_Total-Hugepages_Free)*Hugepagesize
+ Hugepages rsvd (Hugepages_Rsvd*Hugepagesize)
+ Hugepages free (Hugepages_Free-Hugepages_Rsvd)*Hugepagesize
————————————————————-
= Approximate total memory usage

In order to easily use this, I wrote a shell script to apply this formula to your Linux system, available on gitlab: https://gitlab.com/FritsHoogland/memstat.git. You can use the script to get a (quite wide) overview every 3 seconds by running ./memstat.sh, or you can get an overview of the current situation by running ./memstat.sh –oneshot.

This is how a –oneshot of my test system looks like (which is a quite small VM):

$ ./memstat.sh --oneshot
Free                 773932
Shmem                  2264
Mapped+Cached        359712
Anon                 209364
Pagetables            28544
KernelStack            4256
Buffers                   0
Slab                  63716
SwpCache              21836
HP Used             2023424
HP Rsvd               75776
HP Free              206848
Unknown               11944 (  0%)
Total memory        3781616
---------------------------
Total swp           1048572
Used swp             200852 ( 19%)

There is a lot to say about linux memory management. One important thing to realise is that when a system is running low on memory, it will not show as ‘Free’ declining towards zero. Linux will keep a certain amount of memory for direct use, dictated by ‘vm.min_free_kbytes’ as the absolute minimum.

In general, the ‘Cached’ pages (not Shmem pages at first) will be made available under memory pressure, since the Linux page cache really is only caching for potential performance benefit, there is no process directly attached to ‘Cached’ pages. Please mind my experimentations show there is no reliable way I could make a distinction between true ‘Mapped’ pages, meaning pages which are in use as memory mapped files, and true ‘Cached’ pages, meaning disk pages (blocks) sized 4KB which are kept in memory for the sake of reusing them, not directly related to a process.

Once the the number of page cache pages gets low, and there still is need for available pages, pages from the other categories are starting to get moved to swap. This excludes huge pages, even if they are not used! The way pages are considered is based on an ageing mechanism. This works quite well for light memory pressure for a short amount of time.

In fact, this works so well that the default eagerness of the kernel to swap (vm.swappiness, 60 by default, I have seen 30 as a default value too, 0=not eager to swap, 100=maximal swap eagerness) seems appropriate on most systems, even ones which need strict performance requirements. In fact, when swappiness is set (too) low, the kernel will try to avoid swapping as long as possible, meaning that once there is no way around it, it probably needs to swap multiple pages, leading to noticeable delays, while paging out single pages more in advance will have a hardly noticeable overhead.

However, please mind there is no way around consistent memory pressure! This means if memory in active use exceeds physical available memory, it results in physical memory to be shared at the cost of active memory pages being swapped to disk, for which process have to wait.

To show the impact of memory pressure, and how hard it is to understand that from looking at the memory pages, let me show you an example. I ran ‘memstat.sh’ in one session, and the command ‘memhog’ (part of the numactl rpm package) in another. My virtual machine has 4G of memory, and has an Oracle database running which has the SGA allocated in huge pages.

First I started memstat.sh, then ran ‘memhog 1g’, which allocates 1 gigabyte of memory and then releases it. This is the memstat output:

$ ./memstat.sh
          Free          Shmem  Mapped+Cached           Anon     Pagetables    KernelStack        Buffers           Slab       SwpCache        HP Used        HP Rsvd        HP Free        Unknown   %
         42764         435128         495656         387872          40160           4608             96          38600             24              0              0        2306048          30660   0
         42616         435128         495656         388076          40256           4640             96          38572             24              0              0        2306048          30504   0
         42988         435128         495656         388088          40256           4640             96          38576             24              0              0        2306048          30116   0
         42428         435128         495700         388108          40264           4640             96          38600             24              0              0        2306048          30580   0
        894424         320960          99456          12704          40256           4640              0          35496          42352              0              0        2306048          25280   0
        775188         321468         141804          79160          40260           4640              0          35456          70840              0              0        2306048           6752   0
        698636         324248         201476          95044          40264           4640              0          35400          64744              0              0        2306048          11116   0
        686452         324264         202388         107056          40260           4640              0          35392          66076              0              0        2306048           9040   0
        682452         324408         204496         108504          40264           4640              0          35388          65636              0              0        2306048           9780   0

You can see memstat taking some measurements, then memhog is run which quickly allocates 1g and releases it. This is done between rows 6 and 7. First of all the free memory: once the process has allocated all the memory, it stops running which means the memory is freed. Any private memory allocation mapped into the (now quitted) process address space which has backing by a physical page is returned to the operating system as free because it has effectively become available. So what might seem counter-intuitive, by stopping a process that allocated a lot of non-shared (!) memory, it results in a lot of free memory being available.

As I indicated, ‘Cached’ memory is first to be released to provide memory pages for direct use. Mapped+Cached does contain this together with Mapped memory. The amount of pages used by Mapped and Cached are drastically reduced by swapping. ‘Anon’ pages are significantly reduced too, which means they are swapped to the swap device, and ‘Shmem’ is reduced too, which means swapped to the swap device, but way lesser than ‘Mapped+Cached’ and ‘Anon’. ‘Kernel’ (kernel stack) and ‘Pagetables’ hardly decreased and ‘Slab’ decreased somewhat. ‘Swapcache’ actually grew, which makes sense because that is related to the swapping that took place.

The main thing I wanted to point out is that between the time of no memory pressure (lines 2-6) and past memory pressure (8-16), there is no direct memory statistic showing that a system is doing okay nor having suffered. The only thing that directly indicates memory pressure are active swapping in and swapping out, which can be seen with sar -W; pswpin/s and pgwpout/s, or vmstat si/so columns; which are not shown here.

Even past memory pressure, where prior linux memory management had swapped out a lot of pages to facilitate the 1G being allocated which immediately after been allocated was freed and returned as free memory, the majority of the pages on my system that have been swapped out are still swapped out:

$ ./memstat.sh --oneshot
...
Total memory       3781616
--------------------------
Total swp          1048572
Used swp            407228 ( 38%)

This underlines an important linux memory management principle: only do something if there is an immediate, direct need. My system now has no memory pressure anymore, but still 38% of my swap is allocated. Only if these pages are needed, they are paged back in. This underlines the fact that swapping can not and should not be measured by looking at the used amount of swap, a significant amount of swap being used only indicates that memory pressure has occurred in the past. The only way to detect swapping is taking place is by looking at the actual current amount of pages being swapped in and out.

If you see (very) low amounts of pages being swapped out without pages being swapped in at the same time, it’s the swappiness setting that makes pages being moved that have not been used for some time out to the swap device. This is not a problem. If you see pages being swapped in without pages being swapped out at the same time, it means pages that were swapped out either because of past memory pressure or proactive paging due to swappiness are read back in, which is not a problem too. Again, only if both pages are actively being swapped in and out at the same time or if the rate is very high there is a memory problem. The swapping actually is helping you not fail because of memory not being available at all.

4 comments
  1. Anjul said:

    On SLES, I was unable to run the script as it is. Had to change a little bit in line 35. Removed ‘$’ from CACHED=$(( CACHED-SHMEM ))

    Also, I am getting unusual value in UNKNOWN.

    bash-4.2$ ./memstat.sh –oneshot
    Free 9106692
    Shmem 29771108
    Mapped+Cached 209692
    Anon 6920
    Pagetables 555920
    KernelStack 30273400
    Buffers 0
    Slab 0
    SwpCache 0
    HP Used 4194304
    HP Rsvd 59204
    HP Free 0
    Unknown 74177240 (2104472%)
    Total memory 156
    —————————
    Total swp 0
    Used swp 0 ( 0%)

    bash-4.2$ top -b | grep Mem
    Mem: 72438M total, 63543M used, 8895M free, 542M buffers

    • Given the huge amount of unknown, this is not helpful. This needs reviewing and mending to attribute where memory is going.

  2. Anjul said:

    Looks better in interval output…

    bash-4.2$ ./memstat.sh
    Free Shmem Mapped+Cached Anon Pagetables KernelStack Buffers Slab SwpCache HP Used HP Rsvd HP Free Unknown %
    9103176 29772080 209716 6920 555976 30274584 0 0 0 4194304 60484 0 0 0
    9103376 29772080 209724 6920 555976 30274720 0 0 0 4194304 60140 0 0 0
    9104296 29772080 209728 6920 555976 30274288 0 0 0 4194304 59648 0 0 0
    9103624 29772080 209724 6920 555976 30274232 0 0 0 4194304 60380 0 0 0

  3. Freek D'Hooge said:

    Just a quick heads up about that the script can throw an error on newer systems
    Cause is the grep for SHMEM on line 18, which matches now multiple lines
    By adding a “:” to the grep, it is again limited to the correct field:

    SHMEM=$( echo “$MEMINFO” | awk ‘/^Shmem:/ { print $2 }’)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: