StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald f2fedc0446 Add alarm retry support to maintenance alarm handling daemon
The maintenance alarm handling daemon (mtcalarmd) should not
drop alarm requests simply because FM process is not running.
Insteads it should retry for it and other FM error cases that
will likely succeed in time if they are retried.

Some error cases however do need to be dropped such as those
that are unlikely to succeed with retries.

Reviewed FM return codes with FM designer which lead to a list
of errors that should drop and others that should retry.

This update implements that handling with a posting and
servicing of a first-in / first-out alarm queue.

Typical retry case is the NOCONNECT error code which occurs
when FM is not running.

Alarm ordering and first try timestamp is maintained.
Retries and logs are throttled to avoid flooding.

Test Plan:

PASS: Verify success path alarm handling End-to-End.
PASS: Verify retry handling while FM is not running.
PASS: Verify handling of all FM error codes (fit tool).
PASS: Verify alarm handling under stress (inject-alarm script) soak.
PASS: verify no memory leak over stress soak.
PASS: Verify logging (success, retry, failure)
PASS: Verify alarm posted date is maintained over retry success.

Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86
Closes-Bug: 1841653
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-07 09:07:49 -04:00
api-ref/source Clean up and standardize landing pages 2019-01-09 09:34:38 -08:00
bsp-files Support custom kickstart addon for install from USB 2019-09-20 12:42:22 -04:00
devstack Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
doc Fix the error links for metal docs 2019-07-03 09:20:25 -04:00
installer Configurable Host HTTP/HTTPS Port Binding 2019-02-06 16:04:07 -06:00
inventory Merge "Add inventory specfile for opensuse" 2019-09-20 14:23:16 +00:00
kickstart Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce Add alarm retry support to maintenance alarm handling daemon 2019-10-07 09:07:49 -04:00
mtce-common Add alarm retry support to maintenance alarm handling daemon 2019-10-07 09:07:49 -04:00
mtce-compute Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce-control Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce-storage Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
python-inventoryclient Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
releasenotes Update config for release notes to include project name 2019-02-05 14:14:17 -08:00
.gitignore Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:33 +00:00
.zuul.yaml Minor zuul and tox cleanup related to package re-org 2019-09-09 10:35:11 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst Followup opendev cleanup and test jobs 2019-04-22 16:42:03 +00:00
centos_iso_image.inc Remove Resource Monitor ; aka rmon, from the load 2019-03-19 16:12:38 -04:00
centos_pkg_dirs SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
test-requirements.txt pep8 job enable and fix pep8 reported issue 2018-09-06 09:45:51 +08:00
tox.ini Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00

README.rst

metal

StarlingX Bare Metal Management