Sacrifice Ansible procs when OOM

When Linux runs out of memory and activates the OOM killer, it
scores processes based on how much memory they are using[1].  If
a job triggers an OOM by causing ansible-playbook to use a lot
of RAM, normally we would expect the OOM killer to kill Ansible.
However, if the executor is busy, it may be using a lot of RAM
as well, and its score may exceed the score of the smaller
Ansible process.  Nonetheless, we would still rather kill the
Ansible process.

This adjusts the score for the bubblewrap and ansible processes
so that they will have a score increased by an amount equal to
about 20% of system RAM.  This effectively means that as long
as the executor uses less than 20% of system RAM, it is guaranteed
to score lower than Ansible (and likely will continue to score
lower for some significant amount over that as well, depending
on how much RAM Ansible is using).

We read the executor's oom_score_adj when we initialize the bwrap
driver and add 200 to it in order to accomodate the situation where
the executor has its own oom_score_adj.  We always want the bwrap
children to have a higher score than the executor.

The choom program adjusts the OOM score for the command that it
executes, and this is inherited by child processes.  So we adjust
bwrap and expect ansible-playbook to inherit it.

It is also possible to adjust the score of the exeucotor process
lower (so the executor could be made less likely to be a target)
but that requires root privileges, so is not implemented in this
change.

[1] https://lxr.linux.no/#linux+v6.7.1/mm/oom_kill.c#L201

Change-Id: I3a3d116cf68b84b8a6f9ec13808d1d2c2008008f
This commit is contained in:
James E. Blair 2024-05-30 16:29:00 -07:00
parent 1c72b68bae
commit c0484c9d7c
2 changed files with 19 additions and 3 deletions

View File

@ -82,6 +82,9 @@ class Executor(zuul.cmd.ZuulDaemonApp):
def run(self):
self.handleCommands()
self.setup_logging('executor', 'log_config')
self.log = logging.getLogger("zuul.Executor")
self.configure_connections(source_only=True, check_bwrap=True)
if self.config.has_option('executor', 'job_dir'):
@ -96,9 +99,6 @@ class Executor(zuul.cmd.ZuulDaemonApp):
if not os.path.exists(self.job_dir):
os.mkdir(self.job_dir)
self.setup_logging('executor', 'log_config')
self.log = logging.getLogger("zuul.Executor")
self.finger_port = int(
get_default(self.config, 'executor', 'finger_port',
zuul.executor.server.DEFAULT_FINGER_PORT)

View File

@ -33,6 +33,11 @@ from typing import Dict, List # noqa
from zuul.driver import (Driver, WrapperInterface)
from zuul.execution_context import BaseExecutionContext
# Increase the OOM badness score by about 20% of the available RAM.
# This should be sufficient to make child processes the target of the
# oom killer.
OOM_SCORE_ADJ = 200
class WrappedPopen(object):
def __init__(self, command, fds):
@ -192,6 +197,14 @@ class BubblewrapDriver(Driver, WrapperInterface):
bwrap_version_re = re.compile(r'^(\d+\.\d+\.\d+).*')
def __init__(self, check_bwrap):
pid = os.getpid()
with open(f"/proc/{pid}/oom_score_adj") as f:
starting_score_adj = int(f.read().strip())
self.oom_score_adj = min(1000, (starting_score_adj) + OOM_SCORE_ADJ)
self.log.debug("Initializing bubblewrap with oom_score_adj "
"starting: %s, final: %s",
starting_score_adj, self.oom_score_adj)
self.userns_enabled = self._is_userns_enabled()
self.bwrap_version = self._parse_bwrap_version()
self.bwrap_command = self._bwrap_command()
@ -272,6 +285,9 @@ class BubblewrapDriver(Driver, WrapperInterface):
'setpriv',
'--ambient-caps',
'-all',
'choom',
'-n', str(self.oom_score_adj),
'--',
'bwrap',
'--dir', '/tmp',
'--tmpfs', '/tmp',