3ca7e8f03f
We see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs: 2020-03-11 17:58:33.737 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:277 2020-03-11 17:58:33.737 54481 DEBUG neutron.common.utils [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:110 2020-03-11 17:58:33.738 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] resync (a187b137-b68c-476e-bd37-39253158e762): [ProcessExecutionError("Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:294 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent [-] Unable to reload_allocations dhcp for a187b137-b68c-476e-bd37-39253158e762.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec + trap 'exec 2>&4 1>&3' 0 1 2 3 + exec 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 160, in call_driver 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 528, in reload_allocations 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 470, in _spawn_or_reload_process 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 92, in enable 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.reload_cfg() 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 100, in reload_cfg 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.disable('HUP') 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent utils.execute(cmd, run_as_root=self.run_as_root) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent returncode=returncode) 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + trap 'exec 2>&4 1>&3' 0 1 2 3 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + exec 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent 2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent 2020-03-11 17:58:33.740 54481 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/a187b137-b68c-476e-bd37-39253158e762.pid.haproxy get_value_from_file /usr/lib/pyt The issue is that the dhcp side containers are spawned with the following processes: |-conmon(375906)-+-dumb-init(375918)---bash(375935)---dnsmasq(375938) | `-{conmon}(375908) Now when neutron wants to send a SIGHUP to the dnsmasq it actually invokes the following command: nsenter --net=/run/netns/qdhcp-e11ac152-745f-4292-b423-d282fbf97f13 --preserve-credentials -m -t 1 podman kill --signal HUP 0b7371f6de52cfe377858... The problem is that podman kill will send the signal to "dumb-init --single-child" (pid1 for this container) which will then forward it only to bash, which will cause dnsmasq to be terminated and will eventually be later respawned with a different pid (stored in /var/lib/neutron/dhcp/86884abc-f7d7-4118-923f-38b247fee8e9/pid). So if multiple ports are created concurrently this is racy and one of them will fail to reload dnsmasq with the error above, because one process might use a pid file that is no longer valid. TLDR: this all works if SIGHUP to the dnsmasq process does not change pids under the hood all of a sudden. Otherwise a SIGHUP used to reload dnsmasq will be sent to the bash process father of dnsmasq which will then terminate dnsmasq and break all the things Tested this on three runs and did not experience the issue any longer Co-Authored-By: Bernard Cafarelli <bcafarel@redhat.com> Co-Authored-By: Slawomir Kaplonski <skaplons@redhat.com> Closes-Bug: #1867192 Change-Id: I1af2ecd9e3996de4f43224f66a8bdb81eab07022