metal/mtce-common/src/common
Eric MacDonald da398e0c5f Debian: Make Mtce offline handler more resilient to slow shutdowns
The current offline handler assumes the node is offline after
'offline_search_count' reaches 'offline_threshold' count
regardless of whether mtcAlive messages were received during
the search window.

The offline algorithm requires that no mtcAlive messages
be seen for the full offline_threshold count.

During a slow shutdown the mtcClient runs for longer than
it should and as a result can lead to maintenance seeing
the node as recovered before it should.

This update manages the offline search counter to ensure that
it only reached the count threshold after seeing no mtcAlive
messages for the full search count. Any mtcAlive message seen
during the count triggers a count reset.

This update also
1. Adjusts the reset retry cadence from 7 to 12 secs
   to prevent unnecessary reboot thrash during
   the current shutdown.
2. Clears the hbsClient ready event at the start of the
   subfunction handler so the heartbeat soak is only
   started after seeing heartbeat client ready events
   that follow the main config.

Test Plan:

PASS: Debian and CentOS Build and DX install
PASS: Verify search count management
PASS: Verify issue does not occur over lock/unlock soak (100+)
      - where the same test without update did show issue.
PASS: Monitor alive logs for behavioral correctness
PASS: Verify recovery reset occurs after expected extended time.

Closes-Bug: 1993656
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e
2022-10-24 15:57:43 +00:00
..
Makefile Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
alarmUtil.cpp Debian: Make Mtce offline handler more resilient to slow shutdowns 2022-10-24 15:57:43 +00:00
alarmUtil.h Alarm Hostname controller function has in-service failure reported 2022-10-05 10:30:01 -04:00
bmcUtil.cpp Mtce: Add ActionInfo extension support for reset operations. 2022-10-13 17:40:05 +00:00
bmcUtil.h Mtce: Add ActionInfo extension support for reset operations. 2022-10-13 17:40:05 +00:00
fitCodes.h Add mtcAgent socket initialization failure retry handling. 2020-04-01 19:24:22 +00:00
hostClass.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
hostClass.h Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
hostUtil.cpp Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
hostUtil.h Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
httpUtil.cpp Mtce: Fix bmc password fetch error handling 2022-06-01 15:21:05 +00:00
httpUtil.h Remove all nova and libvirt files from mtce-common 2019-03-19 15:23:36 -05:00
ipmiUtil.cpp Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
ipmiUtil.h Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
jsonUtil.cpp Mtce: Add ActionInfo extension support for reset operations. 2022-10-13 17:40:05 +00:00
jsonUtil.h Remove all nova and libvirt files from mtce-common 2019-03-19 15:23:36 -05:00
keyClass.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
keyClass.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
logMacros.h Disable Redfish BMC audit and improve reinstall failure handling 2020-11-16 15:15:22 +00:00
msgClass.cpp Debian: Redfishtool requests fail when IPV4 address has square brackets 2022-10-06 22:21:38 +00:00
msgClass.h Debian: Redfishtool requests fail when IPV4 address has square brackets 2022-10-06 22:21:38 +00:00
nlEvent.cpp Fix heartbeat messaging when interface is set to 'lo' 2020-06-26 14:16:41 +00:00
nlEvent.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
nodeBase.cpp Add Debian packaging for mtce packages 2021-10-29 09:17:00 -05:00
nodeBase.h Improved maintenance handling of spontaneous active controller reboot 2021-04-30 15:35:53 +00:00
nodeEvent.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeEvent.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeMacro.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeTimers.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
nodeTimers.h Make Mtce Power-Off FSM verify power-off 2020-11-22 13:38:33 +00:00
nodeUtil.cpp Add Debian packaging for mtce packages 2021-10-29 09:17:00 -05:00
nodeUtil.h Prevent pmond process recovery when system is not running 2020-06-15 11:09:47 -04:00
pingUtil.cpp Fix BMC access loss handling 2020-01-03 09:34:37 -05:00
pingUtil.h Fix BMC access loss handling 2020-01-03 09:34:37 -05:00
redfishUtil.cpp Mtce: Add ActionInfo extension support for reset operations. 2022-10-13 17:40:05 +00:00
redfishUtil.h Mtce: Add ActionInfo extension support for reset operations. 2022-10-13 17:40:05 +00:00
regexUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
regexUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
returnCodes.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
secretUtil.cpp Mtce: Fix bmc password fetch error handling 2022-06-01 15:21:05 +00:00
secretUtil.h Mtce: Fix bmc password fetch error handling 2022-06-01 15:21:05 +00:00
threadUtil.cpp Improve mtcAgent interrupted thread cleanup 2021-03-15 10:51:16 -04:00
threadUtil.h Debian: Redfishtool requests fail when IPV4 address has square brackets 2022-10-06 22:21:38 +00:00
timeUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
timeUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
tokenUtil.cpp Remove references to ceilometer in maintenance 2019-04-30 14:28:12 -04:00
tokenUtil.h MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00