Disable vrrp healthchecks by default

VRRP healthchecks were enabled by default starting in the 19.07 charm
release for network deployments which utilize l3ha or dvr+snat. The VRRP
healthchecks have specific expectations that may not be satisfied in
various data centers. This leads to problems with networks as failed
healthchecks lead to router failovers.

This change alters the default config option to disable the vrrp
healthchecks by default and require users to opt in to using them. The
description around the option has been updated to indicate that doing so
may lead to routers failing over if ICMP pings are missed.

Change-Id: Ie281a311a95ba394d72c2dfeeb0a1a0a12847e77
Closes-Bug: #192101
This commit is contained in:
Billy Olsen 2021-03-24 12:52:43 -07:00
parent 7b9b0de521
commit eb4e3e3bc3
2 changed files with 12 additions and 7 deletions

View File

@ -404,13 +404,18 @@ options:
access. The charm will go into a blocked state if this is attempted. access. The charm will go into a blocked state if this is attempted.
keepalived-healthcheck-interval: keepalived-healthcheck-interval:
type: int type: int
default: 30 default: 0
description: | description: |
By default all HA routers will check their external network gateway Specifies the frequency (in seconds) at which HA routers will check
by sending a ping and if that fails they trigger a vrrp transition. This their external network gateway by performing an ICMP ping between the
option defines how frequently this check is performed. Setting this value virtual routers. When the ping check fails, this will trigger the HA
to 0 will disable the healthchecks. Note that this only applies when routers to failover to another node. A value of 0 will disable this
using l3ha and dvr_snat. check. This setting only applies when using l3ha and dvr_snat.
.
WARNING: Enabling the health checks should be done with caution as it
may lead to rapid failovers of HA routers. ICMP pings are low priority
and may be dropped or take longer than the 1 second afforded by neutron,
which leads to routers failing over to other nodes.
of-inactivity-probe: of-inactivity-probe:
type: int type: int
default: 10 default: 10

View File

@ -278,7 +278,7 @@ class OVSPluginContextTest(CharmTestCase):
'nsg_log_output_base': None, 'nsg_log_output_base': None,
'nsg_log_rate_limit': None, 'nsg_log_rate_limit': None,
'nsg_log_burst_limit': 25, 'nsg_log_burst_limit': 25,
'keepalived_healthcheck_interval': 30, 'keepalived_healthcheck_interval': 0,
'of_inactivity_probe': 10, 'of_inactivity_probe': 10,
'disable_mlockall': False, 'disable_mlockall': False,
} }