ha/service-mgmt
Eric MacDonald cb5fa9510f Remove hbsAgent restart in failover failure recovery handling
A forced reboot of the active controller in an AIO DC system
puts SM into a failover failure recovery loop that prevents
maintenance from detecting the heartbeat failure of the just-
rebooted controller.

The SM's failover failure recovery handling algorithm includes
a self (sm process) restart preceded by a restart of the
hbsAgent, both added by the following update last year.

update: Add unhealthy state recovery audit to service management (sm)
review: https://review.opendev.org/c/starlingx/ha/+/735219

The self restart of SM was and is required in this case. However,
the restart of the hbsAgent was only included as a safety measure,
at the time, to ensure SM received updated cluster state info. The
hbsAgent restart was only added at that time with the longer term
intention to have it removed once the hbsAgent cluster state change
notification improvement was implemented. That change is now
implemented and merged by the following update.

update: Mtce heartbeat cluster state change notification improvement
review: https://review.opendev.org/c/starlingx/metal/+/769936

Testing of the fix for the following issue in an AIO DC system
resulted in the takeover controller not detecting a heartbeat loss
of the just rebooted standby controller.

title: Force active controller reboot results in a second reboot
issue: https://bugs.launchpad.net/starlingx/+bug/1922584

The hbsAgent is not able to detect the heartbeat loss of the just-
booted controller because SM keeps restarting it before it reaches
the heartbeat loss state.

With the cluster notification improvement update now implemented
and merged it's time to remove the hbsAgent restart from SM's
failover failure recovery algorithm.

Test Plan:

PASS: Active controller force reboot handling in AIO DC, DX and
      standard systems.
PASS: Standby controller force reboot handling in AIO DC, DX and
      standard systems

Partial-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66
2021-04-27 13:58:32 +00:00
..
sm Remove hbsAgent restart in failover failure recovery handling 2021-04-27 13:58:32 +00:00
sm-common Add auto-version for remaining stx/ha packages 2020-12-17 13:27:02 -05:00
sm-db Fix SQLite3 concurrent access issue 2021-03-18 11:08:27 -04:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:26 -07:00