Remove hbsAgent restart in failover failure recovery handling

A forced reboot of the active controller in an AIO DC system
puts SM into a failover failure recovery loop that prevents
maintenance from detecting the heartbeat failure of the just-
rebooted controller.

The SM's failover failure recovery handling algorithm includes
a self (sm process) restart preceded by a restart of the
hbsAgent, both added by the following update last year.

update: Add unhealthy state recovery audit to service management (sm)
review: https://review.opendev.org/c/starlingx/ha/+/735219

The self restart of SM was and is required in this case. However,
the restart of the hbsAgent was only included as a safety measure,
at the time, to ensure SM received updated cluster state info. The
hbsAgent restart was only added at that time with the longer term
intention to have it removed once the hbsAgent cluster state change
notification improvement was implemented. That change is now
implemented and merged by the following update.

update: Mtce heartbeat cluster state change notification improvement
review: https://review.opendev.org/c/starlingx/metal/+/769936

Testing of the fix for the following issue in an AIO DC system
resulted in the takeover controller not detecting a heartbeat loss
of the just rebooted standby controller.

title: Force active controller reboot results in a second reboot
issue: https://bugs.launchpad.net/starlingx/+bug/1922584

The hbsAgent is not able to detect the heartbeat loss of the just-
booted controller because SM keeps restarting it before it reaches
the heartbeat loss state.

With the cluster notification improvement update now implemented
and merged it's time to remove the hbsAgent restart from SM's
failover failure recovery algorithm.

Test Plan:

PASS: Active controller force reboot handling in AIO DC, DX and
      standard systems.
PASS: Standby controller force reboot handling in AIO DC, DX and
      standard systems

Partial-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66
This commit is contained in:
Eric MacDonald 2021-04-27 09:43:00 -04:00
parent 05a01c2100
commit cb5fa9510f
1 changed files with 1 additions and 3 deletions

View File

@ -1,5 +1,5 @@
//
// Copyright (c) 2018 Wind River Systems, Inc.
// Copyright (c) 2018-2021 Wind River Systems, Inc.
//
// SPDX-License-Identifier: Apache-2.0
//
@ -36,7 +36,6 @@ static const int SM_FAILOVER_FAILED_LOG_THROTTLE_THLD = 12;
// processes to restart over a failover failed recovery
#define MAX_RESTART_PROCESS_NAME_LEN 10
#define PROCESS_HBSAGENT ((const char *)("hbsAgent"))
#define PROCESS_SM ((const char *)("sm"))
static struct timespec start_time;
@ -198,7 +197,6 @@ SmErrorT SmFailoverFailedState::event_handler(SmFailoverEventT event, const ISmF
DPRINTFI("** Failover Failed state recovery **");
DPRINTFI("************************************");
sm_node_utils_reset_unhealthy_flag();
sm_failover_failed_process_restart(PROCESS_HBSAGENT);
sm_failover_failed_process_restart(PROCESS_SM);
for ( int i = 0 ; i < 10 ; i++ )
{