Remove hbsAgent restart in failover failure recovery handling
A forced reboot of the active controller in an AIO DC system puts SM into a failover failure recovery loop that prevents maintenance from detecting the heartbeat failure of the just- rebooted controller. The SM's failover failure recovery handling algorithm includes a self (sm process) restart preceded by a restart of the hbsAgent, both added by the following update last year. update: Add unhealthy state recovery audit to service management (sm) review: https://review.opendev.org/c/starlingx/ha/+/735219 The self restart of SM was and is required in this case. However, the restart of the hbsAgent was only included as a safety measure, at the time, to ensure SM received updated cluster state info. The hbsAgent restart was only added at that time with the longer term intention to have it removed once the hbsAgent cluster state change notification improvement was implemented. That change is now implemented and merged by the following update. update: Mtce heartbeat cluster state change notification improvement review: https://review.opendev.org/c/starlingx/metal/+/769936 Testing of the fix for the following issue in an AIO DC system resulted in the takeover controller not detecting a heartbeat loss of the just rebooted standby controller. title: Force active controller reboot results in a second reboot issue: https://bugs.launchpad.net/starlingx/+bug/1922584 The hbsAgent is not able to detect the heartbeat loss of the just- booted controller because SM keeps restarting it before it reaches the heartbeat loss state. With the cluster notification improvement update now implemented and merged it's time to remove the hbsAgent restart from SM's failover failure recovery algorithm. Test Plan: PASS: Active controller force reboot handling in AIO DC, DX and standard systems. PASS: Standby controller force reboot handling in AIO DC, DX and standard systems Partial-Bug: 1922584 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66
This commit is contained in:
parent
05a01c2100
commit
cb5fa9510f
|
@ -1,5 +1,5 @@
|
|||
//
|
||||
// Copyright (c) 2018 Wind River Systems, Inc.
|
||||
// Copyright (c) 2018-2021 Wind River Systems, Inc.
|
||||
//
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
//
|
||||
|
@ -36,7 +36,6 @@ static const int SM_FAILOVER_FAILED_LOG_THROTTLE_THLD = 12;
|
|||
|
||||
// processes to restart over a failover failed recovery
|
||||
#define MAX_RESTART_PROCESS_NAME_LEN 10
|
||||
#define PROCESS_HBSAGENT ((const char *)("hbsAgent"))
|
||||
#define PROCESS_SM ((const char *)("sm"))
|
||||
|
||||
static struct timespec start_time;
|
||||
|
@ -198,7 +197,6 @@ SmErrorT SmFailoverFailedState::event_handler(SmFailoverEventT event, const ISmF
|
|||
DPRINTFI("** Failover Failed state recovery **");
|
||||
DPRINTFI("************************************");
|
||||
sm_node_utils_reset_unhealthy_flag();
|
||||
sm_failover_failed_process_restart(PROCESS_HBSAGENT);
|
||||
sm_failover_failed_process_restart(PROCESS_SM);
|
||||
for ( int i = 0 ; i < 10 ; i++ )
|
||||
{
|
||||
|
|
Loading…
Reference in New Issue