StarlingX system monitoring and repoting tools
Go to file
Eric MacDonald c6cab97ee0 Fix memory instance handling over collectd process restart
With a critical memory alarm raised, the collectd plugin fault
notifier's degrade list is injected with the reporting plugin's
name over a collectd process restart.

The recent introduction of multiple instance based memory alarms
has exposed a limitation in the management and content of the
degrade list that can lead to both stuck degrade (this case)
as well as missing degrade due to the lack of uniqueness of the
content injected into the degrade list based on degradable events.

This update modifies the content of the degrade list to ensure
all entries are unique by using an alarm's entity id rather than
the more generic plugin name.

An additional issue was identified with respect to how filesystem
usage overage alarms are managed, due to recent additions to the
list of monitored filesystems. Filesystem overage alarms are also
degrade list candidates so the aforementioned degrade list change
needed to account for filesystem as well.

One recently added monitored filesystem name conflicted with
how filesystem instances were tracked that lead to a bouncing
alarm if that filesystem experienced overage. Given that there
was already a special case handling for the root fs, rather
than add an additional special case to remedy this issue,
the method of mapping filesystem-instance to mountpoint was
changed from a list to a dictionary. With that cha nge there
is no longer a limitation or special case handling required for
filesystem mountpoints that conflicted with how the stock
collectd plugin reports filesystem instances

Test Plan:

PASS: Verify existing alarm and degrade management of
      instance and non-instance based alarms ot both normal
      runtime as well as over a collectd process restart.

PASS: Verify handling of non-instance based alarm(s)
      over process restart when the alarm condition
      no longer exists following the process restart.

PASS; Verify degrade list management and content.

PASS: Verify filesystem instance to mountpoint mapping.

PASS: Verify data model content using state audit and
      list management with debug options turned on.

PASS: Verify alarm and degrade handling of a filesystem
      and overage that follows the active controller.

PASS: Verify update as patch

Regression:

PASS: Verify alarm and degrade handling of 'all' collectd
      plugins including over collectd process restarts.

PASS: Verify alarm and degrade management stress soak
      that involved multiple plugins asserting/clearing
      multiple alarm and degradable conditions over a
      24 hour period.

Change-Id: I5ea389fb092a6404616d7ea0e8d54daa64ad7ea2
Closes-Bug: 1903731
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-30 11:30:15 -05:00
collectd-extensions Fix memory instance handling over collectd process restart 2020-11-30 11:30:15 -05:00
influxdb-extensions De-branding in starlingx/monitoring: Titanium Cloud -> StarlingX 2020-04-06 10:33:18 +02:00
kube-cpusets Add kube-cpusets tool to summarize kubernetes container cpusets 2020-06-17 13:14:50 -04:00
monitor-tools Refactor patches for initscripts package 2018-11-21 01:26:32 +00:00
vm-topology Enable Flake8 Docstring Errors 2019-04-18 11:50:45 -04:00
.gitignore Adding zuul jobs for new repo 2019-09-09 14:37:23 -05:00
.gitreview Add a .gitreview file to the new repo 2019-09-09 09:35:13 -05:00
.zuul.yaml Tox and Zuul job for the bandit code scan in stx/monitoring 2020-07-14 15:48:17 +00:00
CONTRIBUTING.rst Adding zuul jobs for new repo 2019-09-09 14:37:23 -05:00
HACKING.rst Adding zuul jobs for new repo 2019-09-09 14:37:23 -05:00
centos_build_layer.cfg Build layering, add layer build config file 2019-10-15 19:21:39 +08:00
centos_iso_image.inc Add kube-cpusets tool to summarize kubernetes container cpusets 2020-06-17 13:14:50 -04:00
centos_pkg_dirs Add kube-cpusets tool to summarize kubernetes container cpusets 2020-06-17 13:14:50 -04:00
github_sync.trigger Trigger upload job to upload repo to GitHub 2020-02-07 10:13:03 -05:00
requirements.txt Adding zuul jobs for new repo 2019-09-09 14:37:23 -05:00
test-requirements.txt Use newer flake8 on python3.8 zuul systems 2020-11-05 15:33:28 -06:00
tox.ini Use newer flake8 on python3.8 zuul systems 2020-11-05 15:33:28 -06:00