A failure to query process monitor alarms from
FM during process startup can lead to a stuck
failed process alarm.
Rather than hold up the process monitor startup
sequence due to an unresponsive fault manager,
this update introduces an in-service alarm audit
that looks for asserted alarms and compares that
readout to the process monitor's runtime view.
A difference in view is considered a state mismatch
that requires corrective action. The runtime state
of the process monitor always takes precidence over
what is found in the FM database.
A mismatch is declared and corrective action is
taken if:
- FM has a process failure alarm that pmond does not
Corrective Action: Clear alarm in FM database
- FM has a process failure alarm with a severity
that differs from the pmond runtime state.
Corrective Action: Update severity in FM database
- FM has a process failure alarm for a process
that pmond does not recognize.
Corrective Action: Clear alarm in FM database
This update only runs the audit on process startup
until first successful query.
A future update may enable the audit in-service.
Test Plan:
PASS: Verify all mismatch case handling
PASS: Verify handling of valid active alarm
PASS: Verify handling severity mismatch ; unsupported
PASS: Verify pmond failure handling regression soak
PASS: Verify pmond process restart regression soak
PASS: Verify alarm handling over pmond process restart
PASS: Verify alarmed state audit period and logging
PASS: Verify pmond process failure alarm remains ignored by pmond
PASS: Verify handling of persistently failed process over pmond restart
PASS: Verify audit handling while FM is not running
- audit retries every 50 seconds until fm query is successful
COND: Verify audit handling while FM is stopped/blocked/stalled
- alarm query blocks till fm runs again or is killed
- this is the reason the audit is not run in-service.
Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
Closes-Bug: 1892884
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>