A failure to query process monitor alarms from
FM during process startup can lead to a stuck
failed process alarm.
Rather than hold up the process monitor startup
sequence due to an unresponsive fault manager,
this update introduces an in-service alarm audit
that looks for asserted alarms and compares that
readout to the process monitor's runtime view.
A difference in view is considered a state mismatch
that requires corrective action. The runtime state
of the process monitor always takes precidence over
what is found in the FM database.
A mismatch is declared and corrective action is
taken if:
- FM has a process failure alarm that pmond does not
Corrective Action: Clear alarm in FM database
- FM has a process failure alarm with a severity
that differs from the pmond runtime state.
Corrective Action: Update severity in FM database
- FM has a process failure alarm for a process
that pmond does not recognize.
Corrective Action: Clear alarm in FM database
This update only runs the audit on process startup
until first successful query.
A future update may enable the audit in-service.
Test Plan:
PASS: Verify all mismatch case handling
PASS: Verify handling of valid active alarm
PASS: Verify handling severity mismatch ; unsupported
PASS: Verify pmond failure handling regression soak
PASS: Verify pmond process restart regression soak
PASS: Verify alarm handling over pmond process restart
PASS: Verify alarmed state audit period and logging
PASS: Verify pmond process failure alarm remains ignored by pmond
PASS: Verify handling of persistently failed process over pmond restart
PASS: Verify audit handling while FM is not running
- audit retries every 50 seconds until fm query is successful
COND: Verify audit handling while FM is stopped/blocked/stalled
- alarm query blocks till fm runs again or is killed
- this is the reason the audit is not run in-service.
Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
Closes-Bug: 1892884
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The maintenance process monitor (pmon) should only
recover failed processes when the system state is
'running' or 'degraded'.
The current implementation allowed process recovery
for other non-inservice states, including an unknown
state if systemd returns no data on the state query.
This update tighten's up the system state check by
adding retries to the state query utility and
restricting accepted states to 'running' and 'degraded'.
This change then prevents pmon from inadvertently killing
and recovering the mtcClient which indirectly kills off
the mtcClient's fail-safe sysreq reboot child thread
if pmon state query returns anything other than running
or degraded during a shut down.
Change-Id: I605ae8be06f8f8351a51afce98a4f8bae54a40fd
Closes-Bug: 1883519
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The pmon-restart service, through a call to respawn_process,
increments that process's restarts counter but does not clear
that counter after a successful restart.
So, each pmon-restart mistakenly contributes to that process's
failure count. This has the effect of pre-loading that process's
restart counter by one for every pmon-restart of that process.
The effect is best described by example.
Say a process is pmon-restart'ed 4 times during one day which
increments that process's restart counter to 4. So assuming its
conf file specifies its threshold is 3 ; its already exceeded
its threshold. Then, even days later that process experiences
a real failure pmon will immediate take the severity action
because the failure threshold had already been exceeded.
This update ensures a process's restart counter is cleared
after successful pmon-restart operation ; in the process pid
registration phase of recovery.
Test Plan:
PASS: Verify pmon-restart continues to work.
PASS: Verify proper thresholding of failed process following
many pmon-restart operations.
PEND: Verify pmon-restart and process failure automated test script
against this update. 5 loops, all processes.
Change-Id: Ib01446f2e053846cd30cb0ca0e06d7c987cdf581
Closes-Bug: 1853330
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Each monitored process's config file contains a startuptime
label that specifies how many seconds it takes for that newly
started process to stabalize and produce its pidfile.
The pmon-restart feature needs to delay monitoring
newly restarted process for 'startuptime' seconds.
Failing to do so can cause it to fail the restarted
process to early if there is pidfile creation delay.
Test Plan:
PASS: Verify collectd pmon-restart function with soak ;
> 5000+ collectd pmon-restarts.
PASS: Verify pmond regression test suite (test-pmon-action.sh)
> restart command ; graceful restart all monitored processes. (5 loops)
> kill command ; kill and recover all monitored processes. (5 loops)
Regression:
PASS: Verify pmon-stop command/function
PASS: Verify pmon-start command/function also honors the startuptime.
PASS: Verify pmon-stop auto start after auto-start timeout
PASS: Verify System Install
PASS: Verify Patching (soak)
Change-Id: I9fd7bba8e49fe4c28281539ab4930bdac370ef11
Closes-Bug: #1844724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The following warnings are being addressed.
- hbsUtil.cpp: The correct value for MAX_ENTRY_STR_LEN should be 13
this considering that values can be higher than 9999, the final space
and the leading '\n'.
- hwmonAlarm.cpp: As a result of a discussion with Eric Macdonald, he
suggested to remove the entire case for FM_ALARM_STATE_CLEAR as the
reason buffer is not used and thus, there's no need to store a string there.
- pmonHdlr.cpp: A truncation warning was shown due to a possible usage of a
unitialize buffer. The fix here is check by NULL.
Change-Id: I3c80cce99b2f521f8c7a9de4ce2b6036960dfaf6
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.
The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.
Test Plan
PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure
PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.
Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The commit shown below introduced a main loop audit that
mistakenly registers subfunction processes that are in the
waiting for /var/run/.compute_config_complete 'polling'
state during unlock enable.
By doing so inadvertently changes its monitor FSM stage
from 'Poll' to 'Manage' before configuration is complete.
Since config is not complete, the hbsClient has not initialized
its socket interface and is unable to service active monitoring
requests. This leads to quorum failure and watchdog reboot.
commit 537935bb0c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date: Mon Jul 9 08:36:22 2018 -0400
Reorder process restart operations to prevent pmond futex deadlock
The Fix: Don't run the audit for processes that are in the
waiting for 'polling' state.
Test Plan:
Provision AIO , verify no quorum failure and inspect logs for
correct behavior.
Change-Id: I179c78309517a34285783ee99bbb3d699915cb83
Closes-Bug: 1804318
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>