stx-puppet/puppet-manifests
Jim Somerville 6b64fc782a set sysvar vm_min_free_kbytes as function of total mem
Problem description:
During times of high memory pressure, the kernel will drop
into synchronous memory allocation mode if the amount of free
memory in the normal zone drops below the min watermark.
Synchronous mode reclaims some memory before allocating any,
and is thus vastly slower than asynchronous mode. If under high
memory pressure we drop to synchronous mode, and there's a lot
of memory allocation requests being made ie. contention for
locks in the memory allocator, and one of the tasks contending
is being monitored by pmon, then it can be blocked long enough
to trigger a watchdog timeout which brings down the entire node.

Solution:
One way to reduce the chances of dropping below the min
watermark in the normal zone is to increase the amount of
kernel reserved memory via sysctl variable vm_min_free_kbytes.
The three watermarks, min, low, and high are all calculated and
set by the kernel based on the amount of kernel reserved memory.
While all three of the watermarks will be increased, the spread
between them will also be increased.  When the number of pages
in the zone drops below low, the kswapd will run and reclaim
pages (typically from the buffer cache) until the high
watermark is met.  This operates a lot like a water pressure
tank.  So by increasing the amount of kernel reserved memory,
kswapd should engage sooner and thus increase the odds that
we don't hit the min watermark.

Currently the amount of kernel reserved memory is set statically
to 128 MB except on a storage node where we set it to 256 MB.
This is really too low on systems with a lot of memory.  It
should be set as a function of total system memory.  There are
many possibilities for such a function, such as just setting it
to a percentage of total memory such as 0.5% or 1% as examples.

Here we take a quantum approach to the function, reserving
128 MB for every 25 GB of system memory, with also a minimum
amount specified that we won't go below.  This is approximately
0.5% based on the step function.  Why this approach?  Because it
has no impact on typical small vbox nodes used in development.
Those nodes are typically around 25 GB in size, and it isn't
a great idea to be reclaiming pages from the buffer cache just
to have them sit unused, as these small emulated nodes need all
i/o performance help they can get.

This approach to the function is not absolute and is subject to
change as we learn more about how to best tune the settings.

Change-Id: I6244497b262c217dc9467010501a3041e102a288
Closes-Bug: 1940855
Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>
2021-08-31 13:35:33 -04:00
..