StarlingX Bare Metal and Node Management, Hardware Maintenance
5c043f7ca9
There is the potential for a race condition that can lead to mtce incorrectly failing hosts due to heartbeat failure event messages sourced from the in-active controller. During a split brain recovery action scenario there was a swact which left the hbsAgent on the new stand-by controller thinking it was still on the active controller. This specific split brain failure mode was one where the active and then (after swact) stand-by controller was failing heartbeat to its peer and other nodes in the system even though the new active controller saw heartbeat working fine. The problem being, the in-active controller detected and sent a heartbeat loss message to mtce before mtce was able to update the in-active controller's heartbeat activity status which would have gated the loss event send. This update adds an additional layer of protection by intentionally ignoring heartbeat events from the in-active controller that might slip through due to this activity state change race condition. Also fixed a flooding log in the hbsAgent for big systems. Change-Id: I825a801166b3e80cbf67945c7f587851f4e0d90b Closes-Bug: 1813976 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> |
||
---|---|---|
api-ref/source | ||
bsp-files | ||
devstack | ||
doc | ||
installer | ||
inventory | ||
kickstart | ||
mtce | ||
mtce-common | ||
mtce-compute | ||
mtce-control | ||
mtce-storage | ||
python-inventoryclient | ||
releasenotes | ||
.gitignore | ||
.gitreview | ||
.zuul.yaml | ||
centos_iso_image.inc | ||
centos_pkg_dirs | ||
CONTRIBUTORS.wrs | ||
LICENSE | ||
README.rst | ||
test-requirements.txt | ||
tox.ini |
metal
StarlingX Bare Metal Management