If an Ansible command task produces a very large amount of data,
that will be sent back to the ansible-playbook process on the executor
and deserialized from JSON. If it is sufficiently large, it may
cause an OOM. While we have adjusted settings to encourage the oom-
killer to kill ansible-playbook rather than zuul-executor, it's still
not a great situation to invoke the oom-killer in the first place.
To avoid this in what we presume is an obviously avoidable situation,
we will limit the output sent back to the executor to 1GiB. This
should be much larger than necessary, in fact, the limit may be too
high, but it seem unlikely to be too low.
Other methods were considered: limiting by 50% of the total ram on
the executor (likely to produce a value even higher than 1GiB), or
50% of the available ram on the executor (may be too variable depending
on executor load). In the end, 1GiB seems like a good starting point.
Because this affects the structured data returned by ansible and
that may be used by later tasks in the same playbook to check
the returned values, if we hit this limit, we should consider the
task a failure so that users do not inadvertently use invalid data
(consider a task thatk checks for the presence of some token in
stdout). To that end, if we hit the limit, we will kill the command
process and raise an exception which will cause Ansible to fail
the task (incidentally, it will not include the oversized stdout/
stderr). The cause of the error will be visible in the json and
text output of the job.
This is not a setting that users or operators should be adjusting,
and normally we would not expose something like this through a
configuration option. But because we will fail the tasks, we provide
an escape valve for users who upgrade to this version and suddenly
find they are relying on 1GiB+ stdout values. A deprecated configuration
option is added to adjust the value used. We can remove it in a
later major version of Zuul.
While we're working on the command module, make it more memory-efficient
for large values by using a BytesIO class instead of concatenating
strings. This reduces by 1 the number of complete copies of the
stdout/stderr values on the remote node (but does nothing for
the ansible-playbook process on the executor).
Finally, add a "-vvv" argument to the test invocation; this was useful
in debugging this change and will likely be so for future changes.
Change-Id: I3442b09946ecd0ad18817339b090e49f00d51e93