Increase tolerance for declaring neutron agents down

The neutron server listens for heartbeats from the various
neutron agents running on worker nodes. The agents send
this heartbeat every 30s, but use a synchronous RPC, which
can take up to 60s to time out if the rabbitmq server
disappears (e.g. when a controller host is powered down
unexpectedly). The default timeout is 75s, so if two of
these async RPC messages time out in a row (due to rabbitmq
server issues related to a controller power down or swact),
the neutron agent will be declared down incorrectly. This
causes the VIM to migrate instances away from the worker
node, which we want to avoid.

To make this more tolerant of temporary failures in the
rabbitmq server, I am increasing the timeout (agent_down_time)
to 150s.

Change-Id: Iecd1a7d1034bc8c98853ba279336c26dc7bc3fe9
Closes-Bug: 1817935
Signed-off-by: Bart Wensley <barton.wensley@windriver.com>
This commit is contained in:
Bart Wensley
2019-03-18 10:03:27 -05:00
parent 0922358eca
commit 2fcb4f1570

View File

@@ -893,6 +893,9 @@ data:
enable_new_agents: false
allow_automatic_dhcp_failover: true
allow_automatic_l3agent_failover: true
# Increase from default of 75 seconds to avoid agents being declared
# down during controller swacts, reboots, etc...
agent_down_time: 150
agent:
root_helper: sudo
vhost: