zuul/zuul/driver/bubblewrap
James E. Blair c0484c9d7c Sacrifice Ansible procs when OOM
When Linux runs out of memory and activates the OOM killer, it
scores processes based on how much memory they are using[1].  If
a job triggers an OOM by causing ansible-playbook to use a lot
of RAM, normally we would expect the OOM killer to kill Ansible.
However, if the executor is busy, it may be using a lot of RAM
as well, and its score may exceed the score of the smaller
Ansible process.  Nonetheless, we would still rather kill the
Ansible process.

This adjusts the score for the bubblewrap and ansible processes
so that they will have a score increased by an amount equal to
about 20% of system RAM.  This effectively means that as long
as the executor uses less than 20% of system RAM, it is guaranteed
to score lower than Ansible (and likely will continue to score
lower for some significant amount over that as well, depending
on how much RAM Ansible is using).

We read the executor's oom_score_adj when we initialize the bwrap
driver and add 200 to it in order to accomodate the situation where
the executor has its own oom_score_adj.  We always want the bwrap
children to have a higher score than the executor.

The choom program adjusts the OOM score for the command that it
executes, and this is inherited by child processes.  So we adjust
bwrap and expect ansible-playbook to inherit it.

It is also possible to adjust the score of the exeucotor process
lower (so the executor could be made less likely to be a target)
but that requires root privileges, so is not implemented in this
change.

[1] https://lxr.linux.no/#linux+v6.7.1/mm/oom_kill.c#L201

Change-Id: I3a3d116cf68b84b8a6f9ec13808d1d2c2008008f
2024-06-03 09:12:57 -07:00
..
__init__.py Sacrifice Ansible procs when OOM 2024-06-03 09:12:57 -07:00