Michele Baldessari 3ca7e8f03f Use exec when spawning dnsmasq inside sidecar container
We see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs:

2020-03-11 17:58:33.737 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:277
2020-03-11 17:58:33.737 54481 DEBUG neutron.common.utils [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:110
2020-03-11 17:58:33.738 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] resync (a187b137-b68c-476e-bd37-39253158e762): [ProcessExecutionError("Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:294
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent [-] Unable to reload_allocations dhcp for a187b137-b68c-476e-bd37-39253158e762.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec
+ trap 'exec 2>&4 1>&3' 0 1 2 3
+ exec
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 160, in call_driver
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 528, in reload_allocations
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 470, in _spawn_or_reload_process
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 92, in enable
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.reload_cfg()
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 100, in reload_cfg
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.disable('HUP')
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent utils.execute(cmd, run_as_root=self.run_as_root)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent returncode=returncode)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + trap 'exec 2>&4 1>&3' 0 1 2 3
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + exec
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent
2020-03-11 17:58:33.740 54481 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/a187b137-b68c-476e-bd37-39253158e762.pid.haproxy get_value_from_file /usr/lib/pyt

The issue is that the dhcp side containers are spawned with the following processes:
           |-conmon(375906)-+-dumb-init(375918)---bash(375935)---dnsmasq(375938)
           | `-{conmon}(375908)

Now when neutron wants to send a SIGHUP to the dnsmasq it actually invokes the following command:
nsenter --net=/run/netns/qdhcp-e11ac152-745f-4292-b423-d282fbf97f13 --preserve-credentials -m -t 1 podman kill --signal HUP 0b7371f6de52cfe377858...

The problem is that podman kill will send the signal to "dumb-init
--single-child" (pid1 for this container) which will then forward it only to
bash, which will cause dnsmasq to be terminated and will eventually be later
respawned with a different pid (stored in
/var/lib/neutron/dhcp/86884abc-f7d7-4118-923f-38b247fee8e9/pid).

So if multiple ports are created concurrently this is racy and one of them will fail to reload dnsmasq with the error above, because one process might use a pid file that is no longer valid.
TLDR: this all works if SIGHUP to the dnsmasq process does not change pids under the hood all of a sudden.
Otherwise a SIGHUP used to reload dnsmasq will be sent to the bash
process father of dnsmasq which will then terminate dnsmasq and break
all the things

Tested this on three runs and did not experience the issue any longer

Co-Authored-By: Bernard Cafarelli <bcafarel@redhat.com>
Co-Authored-By: Slawomir Kaplonski <skaplons@redhat.com>

Closes-Bug: #1867192

Change-Id: I1af2ecd9e3996de4f43224f66a8bdb81eab07022
2020-03-12 20:03:37 +01:00
..