StarlingX High Availability/Process Monitoring/Service Management
Go to file
Eric MacDonald cb5fa9510f Remove hbsAgent restart in failover failure recovery handling
A forced reboot of the active controller in an AIO DC system
puts SM into a failover failure recovery loop that prevents
maintenance from detecting the heartbeat failure of the just-
rebooted controller.

The SM's failover failure recovery handling algorithm includes
a self (sm process) restart preceded by a restart of the
hbsAgent, both added by the following update last year.

update: Add unhealthy state recovery audit to service management (sm)
review: https://review.opendev.org/c/starlingx/ha/+/735219

The self restart of SM was and is required in this case. However,
the restart of the hbsAgent was only included as a safety measure,
at the time, to ensure SM received updated cluster state info. The
hbsAgent restart was only added at that time with the longer term
intention to have it removed once the hbsAgent cluster state change
notification improvement was implemented. That change is now
implemented and merged by the following update.

update: Mtce heartbeat cluster state change notification improvement
review: https://review.opendev.org/c/starlingx/metal/+/769936

Testing of the fix for the following issue in an AIO DC system
resulted in the takeover controller not detecting a heartbeat loss
of the just rebooted standby controller.

title: Force active controller reboot results in a second reboot
issue: https://bugs.launchpad.net/starlingx/+bug/1922584

The hbsAgent is not able to detect the heartbeat loss of the just-
booted controller because SM keeps restarting it before it reaches
the heartbeat loss state.

With the cluster notification improvement update now implemented
and merged it's time to remove the hbsAgent restart from SM's
failover failure recovery algorithm.

Test Plan:

PASS: Active controller force reboot handling in AIO DC, DX and
      standard systems.
PASS: Standby controller force reboot handling in AIO DC, DX and
      standard systems

Partial-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66
2021-04-27 13:58:32 +00:00
api-ref/source Switch to newer openstackdocstheme and reno versions 2020-06-04 14:27:03 +02:00
devstack Build layering, add layer build config file 2019-10-21 10:53:26 +08:00
doc Switch to newer openstackdocstheme and reno versions 2020-06-04 14:27:03 +02:00
releasenotes Switch to newer openstackdocstheme and reno versions 2020-06-04 14:27:03 +02:00
service-mgmt Remove hbsAgent restart in failover failure recovery handling 2021-04-27 13:58:32 +00:00
service-mgmt-api Add auto-version for remaining stx/ha packages 2020-12-17 13:27:02 -05:00
service-mgmt-client Merge "Add auto-version for remaining stx/ha packages" 2020-12-17 22:35:48 +00:00
service-mgmt-tools Add auto-version for remaining stx/ha packages 2020-12-17 13:27:02 -05:00
stx-ocf-scripts Add auto-version for remaining stx/ha packages 2020-12-17 13:27:02 -05:00
.gitignore [Doc] OpenStack API Reference Guide 2018-09-27 10:14:44 -07:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:24 +00:00
.zuul.yaml Use newer flake8 to run on ubuntu-focal Zuul machines 2020-09-16 13:01:03 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:26 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:26 -07:00
README.rst Followup opendev cleanup and test jobs 2019-04-21 14:31:33 -05:00
centos_build_layer.cfg Build layering, add layer build config file 2019-10-21 10:53:26 +08:00
centos_dev_wheels.inc Add sm-client-wheels to tarball 2019-11-14 10:55:52 -05:00
centos_iso_image.inc Config file changes to add 'stx-ocf-scripts ' after relocation from 'stx-upstream' 2019-09-04 15:59:21 -04:00
centos_pkg_dirs Remove version from sm folder 2019-09-26 14:11:31 -05:00
centos_stable_wheels.inc Add sm-client-wheels to tarball 2019-11-14 10:55:52 -05:00
github_sync.trigger Verify upload to GitHub mirror with a new commit 2020-02-04 11:54:18 -05:00
pylint.rc Adding pylint target to stx-ha 2018-10-04 09:20:06 -05:00
test-requirements.txt Fix zuul jobs broken due to pip upversion 2020-12-17 13:40:42 -06:00
tox.ini Fix zuul jobs broken due to pip upversion 2020-12-17 13:40:42 -06:00

README.rst

ha

StarlingX Service Management