With a critical memory alarm raised, the collectd plugin fault
notifier's degrade list is injected with the reporting plugin's
name over a collectd process restart.
The recent introduction of multiple instance based memory alarms
has exposed a limitation in the management and content of the
degrade list that can lead to both stuck degrade (this case)
as well as missing degrade due to the lack of uniqueness of the
content injected into the degrade list based on degradable events.
This update modifies the content of the degrade list to ensure
all entries are unique by using an alarm's entity id rather than
the more generic plugin name.
An additional issue was identified with respect to how filesystem
usage overage alarms are managed, due to recent additions to the
list of monitored filesystems. Filesystem overage alarms are also
degrade list candidates so the aforementioned degrade list change
needed to account for filesystem as well.
One recently added monitored filesystem name conflicted with
how filesystem instances were tracked that lead to a bouncing
alarm if that filesystem experienced overage. Given that there
was already a special case handling for the root fs, rather
than add an additional special case to remedy this issue,
the method of mapping filesystem-instance to mountpoint was
changed from a list to a dictionary. With that cha nge there
is no longer a limitation or special case handling required for
filesystem mountpoints that conflicted with how the stock
collectd plugin reports filesystem instances
Test Plan:
PASS: Verify existing alarm and degrade management of
instance and non-instance based alarms ot both normal
runtime as well as over a collectd process restart.
PASS: Verify handling of non-instance based alarm(s)
over process restart when the alarm condition
no longer exists following the process restart.
PASS; Verify degrade list management and content.
PASS: Verify filesystem instance to mountpoint mapping.
PASS: Verify data model content using state audit and
list management with debug options turned on.
PASS: Verify alarm and degrade handling of a filesystem
and overage that follows the active controller.
PASS: Verify update as patch
Regression:
PASS: Verify alarm and degrade handling of 'all' collectd
plugins including over collectd process restarts.
PASS: Verify alarm and degrade management stress soak
that involved multiple plugins asserting/clearing
multiple alarm and degradable conditions over a
24 hour period.
Change-Id: I5ea389fb092a6404616d7ea0e8d54daa64ad7ea2
Closes-Bug: 1903731
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>