distcloud/distributedcloud/dccommon/subprocess_cleanup.py
Kyle MacLeod b24837a73d Registration-based subprocess cleanup on service shutdown
Introduce a helper class SubprocessCleanup in dccommon
which allows a worker to register a subprocess that must
be cleaned up (killed) upon service exit.

There are two parts to this mechanism:
1. Registration:
    - The subprocess is registered for cleanup when
      spawned (see utils.run_playbook_with_timeout)
    - Suprocess is also spawned using setsid in order to
      start a new process group + session
2. The Service calls subprocess_cleanup upon stopping.
    - All registered subprocesses are terminated
      using the os.killpg() call to terminate the
      entire subprocess process group.

Caveat: This mechanism only handles clean process
exit cases. If the process crashes or is is killed
non-gracefully via SIGKILL, the cleanup will not happen.

Closes-Bug: 1972013

Test Plan:

PASS:

Orchestrated prestaging:

* Perform system host-swact while prestaging packages in progress
  - ansible-playbook is terminated
  - prestaging task is marked as prestaging-failed

* Perform system host-swact while prestaging images in progress
  - ansible-playbook is terminated
  - prestaging task is marked as prestaging-failed

* Restart dcmanager-orchestrator service for the same
  two cases as above
  - behaviour is the same as for swact

* Kill dcmanager-orchestrator service while prestaging in progress

Non-Orchestrated prestaging:

* Perform host-swact and service restart for non-orchestrated prestaging
  - ansible-playbook is terminated
  - subcloud deploy status marked as prestaging-failed

Swact during large-scale subcloud add
  - initiate large number of subcloud add operations
  - swact during 'installing' state
  - swact during 'bootstrapping' state
  - verify that ansible playbooks are killed
  - verify that deploy status is updated with -failed state

Not covered:

Tested a sudo 'pkill -9 dcmanager-manager' (ungraceful SIGKILL)
  - in this case the ansible subprocess tree is not cleaned up
  - this is expected - we aren't handling a non-clean shutdown

Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I714398017b71c99edeeaa828933edd8163fb67cd
2022-05-18 20:53:47 -04:00

74 lines
2.3 KiB
Python

#
# Copyright (c) 2022 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
import os
import signal
import time
from oslo_concurrency import lockutils
from oslo_log import log as logging
LOG = logging.getLogger(__name__)
class SubprocessCleanup(object):
"""Lifecycle manager for subprocesses spawned via python subprocess.
Notes:
- This is a best-effort cleanup. We need to preserve fast shutdown
times in case of a SWACT.
- There could potentially be multiple hundreds of subprocesses needing
to be cleaned up here.
"""
LOCK_NAME = 'subprocess-cleanup'
SUBPROCESS_GROUPS = {}
@staticmethod
def register_subprocess_group(subprocess_p):
SubprocessCleanup.SUBPROCESS_GROUPS[subprocess_p.pid] = subprocess_p
@staticmethod
def unregister_subprocess_group(subprocess_p):
SubprocessCleanup.SUBPROCESS_GROUPS.pop(subprocess_p.pid, None)
@staticmethod
@lockutils.synchronized(LOCK_NAME)
def shutdown_cleanup(origin='service'):
SubprocessCleanup._shutdown_subprocess_groups(origin)
@staticmethod
def _shutdown_subprocess_groups(origin):
num_process_groups = len(SubprocessCleanup.SUBPROCESS_GROUPS)
if num_process_groups > 0:
LOG.warn("Shutting down %d process groups via %s",
num_process_groups, origin)
start_time = time.time()
for _, subp in SubprocessCleanup.SUBPROCESS_GROUPS.items():
kill_subprocess_group(subp)
LOG.info("Time for %s child processes to exit: %s",
num_process_groups,
time.time() - start_time)
def kill_subprocess_group(subp, logmsg=None):
"""Kill the subprocess and any children."""
exitcode = subp.poll()
if exitcode:
LOG.info("kill_subprocess_tree: subprocess has already "
"terminated, pid: %s, exitcode=%s", subp.pid, exitcode)
return
if logmsg:
LOG.warn(logmsg)
else:
LOG.warn("Killing subprocess group for pid: %s, args: %s",
subp.pid, subp.args)
# Send a SIGTERM (normal kill). We do not verify if the processes
# are shutdown (best-effort), since we don't want to wait around before
# issueing a SIGKILL (fast shutdown)
os.killpg(subp.pid, signal.SIGTERM)