6b64fc782a
Problem description: During times of high memory pressure, the kernel will drop into synchronous memory allocation mode if the amount of free memory in the normal zone drops below the min watermark. Synchronous mode reclaims some memory before allocating any, and is thus vastly slower than asynchronous mode. If under high memory pressure we drop to synchronous mode, and there's a lot of memory allocation requests being made ie. contention for locks in the memory allocator, and one of the tasks contending is being monitored by pmon, then it can be blocked long enough to trigger a watchdog timeout which brings down the entire node. Solution: One way to reduce the chances of dropping below the min watermark in the normal zone is to increase the amount of kernel reserved memory via sysctl variable vm_min_free_kbytes. The three watermarks, min, low, and high are all calculated and set by the kernel based on the amount of kernel reserved memory. While all three of the watermarks will be increased, the spread between them will also be increased. When the number of pages in the zone drops below low, the kswapd will run and reclaim pages (typically from the buffer cache) until the high watermark is met. This operates a lot like a water pressure tank. So by increasing the amount of kernel reserved memory, kswapd should engage sooner and thus increase the odds that we don't hit the min watermark. Currently the amount of kernel reserved memory is set statically to 128 MB except on a storage node where we set it to 256 MB. This is really too low on systems with a lot of memory. It should be set as a function of total system memory. There are many possibilities for such a function, such as just setting it to a percentage of total memory such as 0.5% or 1% as examples. Here we take a quantum approach to the function, reserving 128 MB for every 25 GB of system memory, with also a minimum amount specified that we won't go below. This is approximately 0.5% based on the step function. Why this approach? Because it has no impact on typical small vbox nodes used in development. Those nodes are typically around 25 GB in size, and it isn't a great idea to be reclaiming pages from the buffer cache just to have them sit unused, as these small emulated nodes need all i/o performance help they can get. This approach to the function is not absolute and is subject to change as we learn more about how to best tune the settings. Change-Id: I6244497b262c217dc9467010501a3041e102a288 Closes-Bug: 1940855 Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>