metal/mtce/src/pmon
Eric MacDonald 6cf5e84825 Add alarmed process audit to Process Monitor
A failure to query process monitor alarms from
FM during process startup can lead to a stuck
failed process alarm.

Rather than hold up the process monitor startup
sequence due to an unresponsive fault manager,
this update introduces an in-service alarm audit
that looks for asserted alarms and compares that
readout to the process monitor's runtime view.

A difference in view is considered a state mismatch
that requires corrective action. The runtime state
of the process monitor always takes precidence over
what is found in the FM database.

A mismatch is declared and corrective action is
taken if:

 - FM has a process failure alarm that pmond does not
   Corrective Action: Clear alarm in FM database

 - FM has a process failure alarm with a severity
   that differs from the pmond runtime state.
   Corrective Action: Update severity in FM database

 - FM has a process failure alarm for a process
   that pmond does not recognize.
   Corrective Action: Clear alarm in FM database

This update only runs the audit on process startup
until first successful query.
A future update may enable the audit in-service.

Test Plan:

PASS: Verify all mismatch case handling
PASS: Verify handling of valid active alarm
PASS: Verify handling severity mismatch ; unsupported
PASS: Verify pmond failure handling regression soak
PASS: Verify pmond process restart regression soak
PASS: Verify alarm handling over pmond process restart
PASS: Verify alarmed state audit period and logging
PASS: Verify pmond process failure alarm remains ignored by pmond
PASS: Verify handling of persistently failed process over pmond restart
PASS: Verify audit handling while FM is not running
      - audit retries every 50 seconds until fm query is successful

COND: Verify audit handling while FM is stopped/blocked/stalled
      - alarm query blocks till fm runs again or is killed
      - this is the reason the audit is not run in-service.

Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
Closes-Bug: 1892884
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-09 08:22:32 -05:00
..
scripts De-branding in starlingx/metal: Titanium Cloud -> StarlingX 2020-04-03 07:58:25 +02:00
Makefile Add EXTRALDFLAGS to linker in a number of Makefiles 2019-02-28 22:34:54 -06:00
pmon.h Add alarmed process audit to Process Monitor 2021-03-09 08:22:32 -05:00
pmonAlarm.cpp Add alarmed process audit to Process Monitor 2021-03-09 08:22:32 -05:00
pmonAlarm.h Add alarmed process audit to Process Monitor 2021-03-09 08:22:32 -05:00
pmonFsm.cpp Merge "Removing unused flag disable_worker_services" 2019-11-04 13:52:12 +00:00
pmonHdlr.cpp Add alarmed process audit to Process Monitor 2021-03-09 08:22:32 -05:00
pmonInit.cpp Output error Full_init_reqd parameter value in a debug log 2019-05-11 15:07:28 +08:00
pmonMsg.cpp Add 50 byte hostname support to maintenance 2019-07-12 12:20:08 +00:00