160 Commits

Author SHA1 Message Date
Leonardo Fagundes Luz Serrano
462750e14b Add debian package for monitor-tools
Add debian packaging infrastructure for monitor-tools
to build a debian package.

Test Plan: build pkg; build image; compare with RPM

PASS pkg builds
PASS image builds
PASS same contents and permissions as RPM

Story: 2009101
Task: 43960

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: I2ba30a627cf2c64c88a3d0586d97fcafe117e669
2022-01-24 18:55:28 +00:00
Zuul
821aff9947 Merge "Re-enable important py3k checks for monitoring" 2021-11-03 14:02:30 +00:00
Zuul
3ff8c48cc3 Merge "Re-enable important py3k checks for monitoring kube-memory" 2021-10-28 14:29:49 +00:00
Bernardo Decco
7a028a29c0 Re-enable important py3k checks for monitoring
Re-enabling some of the disabled tox warnings present on
the pylint.rc file

Re-enabling:

W1638: range-builtin-not-iterating
W1636: map-builtin-not-iterating

Test Plan: Sanity test run on AIO-SX:

PASS: test_system_health_pre_session[pods]
PASS: test_system_health_pre_session[alarms]
PASS: test_system_health_pre_session[system_apps]
PASS: test_wr_analytics[deploy_and_remove]
PASS: test_horizon_host_inventory_display
PASS: test_lock_unlock_host[controller]
PASS: test_pod_to_pod_connection
PASS: test_pod_to_service_connection
PASS: test_host_to_service_connection

Story: 2006796
Task: 43443
Signed-off-by: Bernardo Decco <bernardo.deccodesiqueira@windriver.com>
Change-Id: I6c13ae171ee4a41377dad55ed3c519ee710b4d88
2021-10-21 12:34:55 +00:00
Bernardo Decco
a469d8ad9b Re-enable important py3k checks for monitoring kube-memory
Re-enabling some of the disabled tox warnings present on
the pylint.rc file

Re-enabling:

W1619: old-division
W1633: round-builtin

Test Plan: Sanity test run on AIO-SX:

PASS: test_system_health_pre_session[pods]
PASS: test_system_health_pre_session[alarms]
PASS: test_system_health_pre_session[system_apps]
PASS: test_wr_analytics[deploy_and_remove]
PASS: test_horizon_host_inventory_display
PASS: test_lock_unlock_host[controller]
PASS: test_pod_to_pod_connection
PASS: test_pod_to_service_connection
PASS: test_host_to_service_connection

Story: 2006796
Task: 43444
Signed-off-by: Bernardo Decco <bernardo.deccodesiqueira@windriver.com>
Change-Id: I00dc37bbd8f60f475f85e4f0463b7c066a719f1f
2021-10-21 12:34:43 +00:00
Zuul
ea8ab4adf6 Merge "Add flux-helm to list of platform namespaces." 2021-10-08 20:16:07 +00:00
Tracey Bogue
b6096a6f98 Add flux-helm to list of platform namespaces.
Story: 2009138
Task: 43078

Signed-off-by: Tracey Bogue <tracey.bogue@windriver.com>
Change-Id: I24071faa51d90276b5b5787310ea7132e18cdb05
2021-10-04 08:29:30 -05:00
Bernardo Decco
e325886708 Removing py36 gates from zuul for monitoring
Removing redundant py36 Zuul jobs since we now have py39 Zuul jobs in
place with the debian nodeset

Story: 2006796
Task: 43489
Signed-off-by: Bernardo Decco <bernardo.deccodesiqueira@windriver.com>
Change-Id: I3e6fe3a146b3ac01218eb1428c3bad35b87c5c9c
2021-09-30 10:17:52 -03:00
Zuul
e5bdd73802 Merge "Add pylint py3 portability checks for the monitoring/kube-memory repo" 2021-09-15 21:10:17 +00:00
Fabricio Henrique Ramos
95715f15a4 Add pylint py3 portability checks for the monitoring/kube-memory repo
A lot of work has gone into making sure that StarlingX is python3
compatible. To ensure future compatibility, enable the python3
portability checks. Disable the checks that are raising errors.
Another set of commits will address the offending code.
Add following suppress warnings in pylint.rc:
- W1618: no-absolute-import
- W1619: old-division
- W1633: round-builtin

Story: 2006796
Task: 43134

Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: Ib3c97263d34328f6ffc27ef08690d23325654b42
2021-09-13 09:55:08 +00:00
Fabricio Henrique Ramos
95f00ae668 Add pylint py3 portability checks for the monitoring/kube-cpusets repo
A lot of work has gone into making sure that StarlingX is python3
compatible. To ensure future compatibility, enable the python3
portability checks. Disable the checks that are raising errors.
Another set of commits will address the offending code.
Add following suppress warnings in pylint.rc:
- W1618: no-absolute-import
- W1636: map-builtin-not-iterating
- W1638: range-builtin-not-iterating

Story: 2006796
Task: 43135
Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: I6ebe9c7215f1e4622a81b0dd79b36cfcc6a7d86f
2021-09-13 09:54:54 +00:00
Zuul
f8648f0b41 Merge "py3: Add support for python 3.9" 2021-08-31 13:45:14 +00:00
Charles Short
5b62a25ca4 py3: Add support for python 3.9
Enable python 3.9 in tox and zuul gate.

Story: 2009101
Task: 43104

Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: I3ebb23574ad34a1078fae3fbbc65ce5457d46c69
2021-08-27 11:42:44 -04:00
Eric MacDonald
fcc8ddda66 Change platform memory usage instance type to 'memory'
The platform memory data-set type is currently set to 'percent'.

It is possible to over subscribe platform memory usage to more
than 100%.

Collectd drops sample values that are greater than 100 when its
data-set type is 'percent'. Collectd considers a percent value
greater than 100 to be an invalid value.

This update changes the data-set type for platform memory usage
from 'percent' to 'memory' to allow memory usage values greater
than 100 to be handled.

Test Plan:

PASS: Verify that platform memory overage alarm value is reported
             as the 'actual' value in the alarm Reason Text.
PASS: Verify platform memory usage values that exceed the major
             threshold are alarmed 'major'.
PASS: Verify platform memory usage values that exceed the critical
             threshold are alarmed 'critical', even if the
             debounced value exceeds 100.
PASS: Verify ridiculously large values are still alarmed and that
             value is still included in the alarm Reason Text.

Change-Id: I7189671e20c92656f820fda74c4871504d89e73a
Closes-Bug: 1940875
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-08-23 20:05:20 -04:00
John Kung
ecd744ba0a Handle kube ApiException during collectd platform monitoring
During stress test/high platform load it is possible that the
kube-apiserver responds with an kube ApiException.

As platform monitoring of cpu and memory should not be affected by
unresponsive kube-api server, allow the kube ApiException to be handled
and the remaining platform resource utilization monitoring to proceed.

This could help identify the issue by allowing the raise of
the platform alarm (e.g. 100.101 Platform CPU threshold exceeded,
100.103 Memory threshold exceeded).

Verfied:
  o Platform CPU Alarm is raised with stress test
  o Platform CPU Alarm is raised with stress test
    and intermittent ApiException
  o Memory Alarm is raised with stress test
  o Memory Alarm is raised with stress test
    and intermittent ApiException
  o the above alarm conditions are cleared after
    debounce when stress condition is removed

Closes-Bug: 1939172
Signed-off-by: John Kung <john.kung@windriver.com>
Change-Id: I2c9c39a390af1d7ae752ad00db18384479cf6e99
2021-08-11 08:00:41 -05:00
Andrei Grosu
aa8665cebf Fix startup issues for collectd.
- Use encodeutils from olso library to handle string encodings.
 - Expand the generator into a list.
 - Use python3 iterator __next__().

Note: there needs to be a separate task to remove the Encoding parameter from python_plugins.conf which can be cherry-picked only for python3 deployments.

Story: 2008454
Task: 42647
Depends-On: Iaa7bd0cadd3b1d097b276dcc37ebceaeb208a6a5

Signed-off-by: Andrei Grosu <andrei.grosu@windriver.com>
Change-Id: I58cd4829806e98b1e15471ce97a7c7ba6a2fe135
(cherry picked from commit f2e5263206c87dde9b602c3660b8191398d9c555)
2021-07-27 08:46:39 -04:00
Fernando Theirs
27db764e67 Remove InfluxDB
InfluxDB was not fully productized, nor is it used by other end-users.
It should therefore be removed from all deployments to avoid it
consume unnecessary resources (cpu, memory and storage).

Parts of system's dependencies with InfluxDB were remove here.

Story: 2009018
Task: 42761
Depends-On: https://review.opendev.org/799502

Signed-off-by: Fernando Theirs <Fernando.Theirs@windriver.com>
Change-Id: I85acf8a94e54171162b9be6fbf816532cf602831
2021-07-15 12:02:38 -03:00
Zuul
80585f539d Merge "Fix zuul errors due to changes in dependencies" 2021-06-21 15:20:33 +00:00
Zuul
1900c2fdfa Merge "Better repair action for alarm 100.104" 2021-05-31 12:12:54 +00:00
Jerry Sun
b425fe849a Better repair action for alarm 100.104
This commit adds a better proposed repair action for filesystem
threshold alarm 100.104.

Closes-Bug: 1927155
Signed-off-by: Jerry Sun <jerry.sun@windriver.com>
Change-Id: I1a27d4bc438b98c00d0fe4eb3b30e4672552f90a
2021-05-28 12:51:07 -04:00
Zuul
ffafbeae6a Merge "Add kube-memory tool to summarize memory usage" 2021-05-25 17:32:57 +00:00
Enzo Candotti
36c8ae8395 Add kube-memory tool to summarize memory usage
This tool gathers memory usage information for all
kubernetes containers and system services displayed
in cgroup memory, that are running on current host.

This displays the total resident set size per namespace and container,
the aggregate memory usage per system service, and the platform memory
usage.

Closes-Bug: 1886868

Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
Change-Id: Id130ed0d2794cdd555bdb068e8453cb8e9bd29d2
2021-05-22 18:51:26 -03:00
Takamasa Takenaka
2ef5451f44 Format 2 lines ntpq data into 1 lines
The problem was logic expected one line data for
ntpq result. But it was 2 lines for each ntp server
entry. When peer server is selected, script checked
refid if refid is reliable or not but it could not
find because refid is in the following line.
This fix formats 2 lines data into 1 line.

The minor alarm "minor alarm "NTP cannot reach
external time source; syncing with peer controller
only" is removed because NTP does not prioritize
external time source over peer.

Closes-Bug: 1889101

Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc
2021-04-29 10:38:05 -03:00
Charles Short
6a9358c261 Fix zuul errors due to changes in dependencies
Pin hacking to < 4.0.1 to fix zuul gate issues.

Test:
Ran tox -e flake8 command to validate the flake8 job and result.

Related-Bug: 1926172

Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: Ia2e746ba513c0d073b60e76b2d2afdfe8b6c9745
2021-04-26 11:45:02 -04:00
Eric MacDonald
d37490b814 Add alarm audit to starlingx collectd fm notifier plugin
This update adds common plugin support for alarm state auditing.
The audit is able to detect and correct the following alarm
state errors:

   Error Case                Correction Action
   -----------------------   -----------------
 - stale alarm             ; delete alarm
 - missing alarm           ; assert alarm
 - alarm severity mismatch ; refresh alarm

The common audit is enabled for the fm_notifier plugin that supports
alarm managment for the following resources.

 - CPU with alarm id 100.101
 - Memory with alarm id 100.103
 - Filesystem with alarm id 100.104

Other plugins may use this common audit in the future but only the
above resources have the audit enabled for them by this update.

Test Plan:

PASS: Verify stale alarm detection/correction handling
PASS: Verify missing alarm detection/correction  handling
PASS: Verify alarm severity mismatch detection/correction handling
PASS: Verify hosts only audits its own specified alarms
PASS: Verify success path of monitoring a single and mix
      of base and instance alarms of varying severity while
      such alarm conditions come and go
PASS: Verify alarm audit of mix of base and instance alarms
      over a collectd process restart
PASS: Verify audit handling of alarm that migrates from
      major to critical to major to clear
PASS: Verify audit handling transition between alarm and
      no alarm conditions
PASS: Verify soak of random cpu, memory and filesystem
      overage alarm assertions and clears that also involve
      manual alarm deletions, assertions and severity changes
      that exercise new audit features

Regression:

PASS: Verify alarm and audit handling over Swact with mounted
      filesystem that has active alarm
PASS: Verify collectd logs following a system install and
      while alarms are managed during above soak
PASS: Verify behavior while FM is killed or stopped/started
PASS: Verify Standard system install with Sanity and Regression
PASS: Verify AIO DX/DC systems install with Sanity and Regression

Closes-Bug: 1925210
Change-Id: I1cafd17ad07ec769240de92ae4e67cb1357f0992
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-20 11:48:51 -04:00
Zuul
3628db6e77 Merge "Bandit should only be installed in py3 env" vr/stx.5.0 2021-04-12 18:04:21 +00:00
albailey
14e1a9a82b Bandit should only be installed in py3 env
Running tox for linters fails since the bandit being pulled
in is python3 only. This is similar to other bugs where a new
version is released which drops py2 support.

In this env, we only include bandit if we are testing and running
in py3.

Partial-Bug: 1922590
Change-Id: I11b7d974ae3b64e7846e1420521dee0d48128fc5
Signed-off-by: albailey <Al.Bailey@windriver.com>
2021-04-07 17:53:55 -04:00
Gerry Kopec
19460ecbd2 Add platform namespaces to collectd
Add missing platform namespaces (armada, cert-manager, portieris, vault
and notification) to collectd kubernetes system list.

Change-Id: I341d802210388e5e1f3fd2d7a11fa0593c44fa68
Closes-Bug: 1922629
Signed-off-by: Gerry Kopec <gerry.kopec@windriver.com>
2021-04-06 22:23:39 -04:00
Eric MacDonald
a2a2a88887 Avoid loading collectd's default plugins
The current opensource collectd rpm installs
several default plugins, some that overlap
starlingx developed plugins and others that
simply collect way too much data.

The plugins in question are:

/etc/collectd.d/90-default-plugins-syslog.conf
/etc/collectd.d/90-default-plugins-memory.conf
/etc/collectd.d/90-default-plugins-load.conf
/etc/collectd.d/90-default-plugins-interface.conf
/etc/collectd.d/90-default-plugins-cpu.conf

This update moves the value added starlingx
plugins to /etc/collectd.d/starlingx and
relies another puppet update to change the
collectd's plugin search path accordingly.

Test Plan:

PASS: Verify default plugins are not loaded
      and they samples are not collected.
PASS: Verify patch apply and remove.
      Note: is reboot required patch
PASS: Verify the daily influxdb usage
      drops by 80-85%.

Regression:

PASS: Verify collectd alarm/degrade regression soak

Change-Id: Ic7884ae69014fa274f0bd0515adec90b08747c67
Closes-Bug: 1905581
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-02-03 12:03:06 -05:00
Eric MacDonald
ea4b515f91 Add node ready check to collectd plugins
This update adds a second collectd plugin
initialization enhancement. First update
added a config complete gate:

https://review.opendev.org/c/starlingx/monitoring/+/736817

Turns out that not all plugins are ready to sample
immediately following the config complete state.
One example is FM on the active controller needs
time to get going before plugins can query their
alarms on startup. Also, some plugins need more
time than others.

To account for both cases this update adds a
thresholded node ready gate that can be tailored
to a plugin to hold off fm access and sampling
until its ready threshold is reached.

Test Plan:

PASS: Verify AIO SX and DX system install
PROG: Verify Storage system install
PASS: Verify AIO SX node lock and unlock
PASS: Verify AIO Standby controller lock and unlock
PASS: Verify Standard controller lock and unlock
PASS: Verify Compute and Storage node lock and unlock
PASS: Verify Dead-Office-Recovery (AIO DX)
PASS: Verify collectd sampling and logs

Partial-Bug: 1872979
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I044d812542a4222214c7d13e231ac4024cca9800
2021-01-26 12:00:16 -05:00
Zuul
b66c85287d Merge "Increase field widths of PID for schedtop" 2021-01-04 19:57:14 +00:00
Don Penney
3809c69d81 Add auto-version for remaining stx/monitoring packages
Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
the version is incremented above the hardcoded version.

Story: 2008455
Task: 41463
Signed-off-by: Don Penney <don.penney@windriver.com>
Change-Id: If41d630c97354014b12424ed305d6c5cbb022a5a
2020-12-17 13:25:29 -05:00
Zuul
1e951176df Merge "Fix memory instance handling over collectd process restart" 2020-12-10 14:23:24 +00:00
Carmen Rata
81b7727a2e Fix influxdb log file permissions
Update /var/log/influxdb/influxd.log permissions to 640 from 644
to disallow world readable but at the same time to allow group
read access.
The changes are made to comply as much as possible with
openscap rules security requirements.
Verified that installation is successful for AIO-SX
and Standard 2+2 system configurations.

Story: 2008037
Task: 40694

Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
Change-Id: I284fc6882043b4a4d271bd5963fca94bc7a1e390
2020-12-02 13:08:03 -05:00
Eric MacDonald
c6cab97ee0 Fix memory instance handling over collectd process restart
With a critical memory alarm raised, the collectd plugin fault
notifier's degrade list is injected with the reporting plugin's
name over a collectd process restart.

The recent introduction of multiple instance based memory alarms
has exposed a limitation in the management and content of the
degrade list that can lead to both stuck degrade (this case)
as well as missing degrade due to the lack of uniqueness of the
content injected into the degrade list based on degradable events.

This update modifies the content of the degrade list to ensure
all entries are unique by using an alarm's entity id rather than
the more generic plugin name.

An additional issue was identified with respect to how filesystem
usage overage alarms are managed, due to recent additions to the
list of monitored filesystems. Filesystem overage alarms are also
degrade list candidates so the aforementioned degrade list change
needed to account for filesystem as well.

One recently added monitored filesystem name conflicted with
how filesystem instances were tracked that lead to a bouncing
alarm if that filesystem experienced overage. Given that there
was already a special case handling for the root fs, rather
than add an additional special case to remedy this issue,
the method of mapping filesystem-instance to mountpoint was
changed from a list to a dictionary. With that cha nge there
is no longer a limitation or special case handling required for
filesystem mountpoints that conflicted with how the stock
collectd plugin reports filesystem instances

Test Plan:

PASS: Verify existing alarm and degrade management of
      instance and non-instance based alarms ot both normal
      runtime as well as over a collectd process restart.

PASS: Verify handling of non-instance based alarm(s)
      over process restart when the alarm condition
      no longer exists following the process restart.

PASS; Verify degrade list management and content.

PASS: Verify filesystem instance to mountpoint mapping.

PASS: Verify data model content using state audit and
      list management with debug options turned on.

PASS: Verify alarm and degrade handling of a filesystem
      and overage that follows the active controller.

PASS: Verify update as patch

Regression:

PASS: Verify alarm and degrade handling of 'all' collectd
      plugins including over collectd process restarts.

PASS: Verify alarm and degrade management stress soak
      that involved multiple plugins asserting/clearing
      multiple alarm and degradable conditions over a
      24 hour period.

Change-Id: I5ea389fb092a6404616d7ea0e8d54daa64ad7ea2
Closes-Bug: 1903731
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-30 11:30:15 -05:00
albailey
ee7ae99d41 Use newer flake8 on python3.8 zuul systems
flake8 2.5.5 fails on ubuntu-focal zuul machines running python3.8
with the following error:
AttributeError: 'FlakesChecker' object has no attribute 'CONSTANT'

Suppresses the following:
 W503 line break before binary operator
 W504 line break after binary operator
 W605 invalid escape sequence '\d'

 E117 over-indented
 E266 too many leading '#' for block comment
 E305 expected 2 blank lines after class or function definition, found 1
 E402 module level import not at top of file
 E722 do not use bare 'except'
 E741 ambiguous variable name 'I'

 F632 use ==/!= to compare constant literals
 F821 undefined name 'dpdk' (this is a flake8 bug)

Change-Id: I6c2ef05d765b57b7be0b038d6e384cb2af589054
Partial-Bug: 1895054
Signed-off-by: albailey <Al.Bailey@windriver.com>
2020-11-05 15:33:28 -06:00
Jim Gauld
23489af038 Increase field widths of PID for schedtop
This increases field width of TID, PID, and PPID to 7 wide for schedtop
engineering tool. Newer systems support larger PIDs.

Change-Id: I706b60d83e8ce341a7d07c4c067a74e7049acdad
Closes-Bug: 1902954
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
2020-11-04 17:22:34 -05:00
Sharath Kumar K
8ef034919c Tox and Zuul job for the bandit code scan in stx/monitoring
Setting up the bandit tool for the scanning of HIGH severity issues
in the python codes under Starlingx/monitoring folder.
Expecting this merge will enable zuul job for CI/CD of bandit scan.

Configuration files:
1. tox.ini for adding bandit environment and command.
2. test-requirements.txt for adding bandit version.
3. .zuul.yaml file for adding bandit job and configuring under
   check job to run code scan every time before code commit.

Test:
Run tox -e bandit command inside the fault folder to validate the
bandit scan and result.

Story: 2007541
Task: 39684
Depends-On: https://review.opendev.org/#/c/721294/

Change-Id: Ibcbe1dd2e380f80c4cbf6f2a7cf49065dc890803
Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>
2020-07-14 15:48:17 +00:00
Zuul
4d9f256bb5 Merge "Add consistent init and config complete checks to collectd plugins" v4.0.0.rc0 2020-06-30 11:26:21 +00:00
Jim Gauld
1bdd9200bb collectd cpu plugin does not always initialize
This changes the initialization of per cgroup cpuacct timings
to account for cgroup directories that may not be present at the time
the plugin starts. As an example, the docker cgroup is created often
much later or not at all.

Change-Id: Iaf279e650cc16966b40c24a9f55f53fa4696a92b
Closes-Bug: 1855733
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
2020-06-26 17:01:30 -04:00
Eric MacDonald
63c8d1e55a Add consistent init and config complete checks to collectd plugins
Some of the collectd plugins are not waiting for configuration
complete before starting to monitor or communicate with external
services such as fm. This leads to the collectd networking plugin
being triggered to run before or while the host is being configured
which has been seen to lead to collectd segfaults/coredumps within
the collectd's internal networking plugin.

To solve this issue, reduce startup thrash and a slew of plugin
startup error logs, this update adds consistent initialization
and configuration complete checks to all of the starlingX
plugins so monitoring and external service access is not
performed until the host configuration is complete.

Test Plan:

PASS: Verify no plugin sampling till after config is complete
PASS: Verify alarm assert and clear cycle for all plugins
PASS: Install AIO SX system install
PASS: Install AIO DX system install
PEND: Verify Standard system install
PASS: Verify logging

Change-Id: I90a5d1c8c3be77269a571738c9499b2e908e1fc5
Closes-Bug: 1872979
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-06-24 14:59:45 -04:00
Jim Gauld
1a5e6c4c3d Add kube-cpusets tool to summarize kubernetes container cpusets
This tool gathers cpuset usage information for all kubernetes
containers that are running on the current host.

With kubernetes CPUManager policy:
- 'none' -- the k8s-infra cpuset is used for all pods
- 'static' -- pods get exclusive cpuset for QoS Guaranteed or
  using isolcpus, otherwise pods inherit DefaultCPUSet.

This displays the cpusets per container and the mapping to numa nodes.
This displays the aggregate cpuset usage per system-level groupings
(i.e., platform, isolated, guaranteed, default), per numa-node.

Story: 2006999
Task: 39579

Change-Id: I7f1b12e2bbcf7d0b1606c1c948c545216ec454c5
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
2020-06-17 13:14:50 -04:00
Eric MacDonald
f7437000c7 Platform Memory usage alarm calculation incorrect
This update removes hugepage memory monitoring, sampling and
alarming for over usage.

Hugepage memory is only used by k8s pods or openstack vm's.
Therefore its usage and alarming should not be tied to the
platform.

Change-Id: Iab8104ff56fdd641c058a4fdc587313cbeec9faf
Closes-Bug: 1880605
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-05-25 16:51:53 -04:00
Thomas Gao
f9688c62f4 Added retry mechanism to clear port alarm
When the link state transitions from DOWN to UP, the current collectd
process attempts to clear the alarm once and once only. If such attempt
failed, no further attempts will be made, and the alarm will persist in
fm alarm-list.

This fix added an additional check to ensure that as long as port alarm
is not cleared and the link state is UP, it will attempt to clear the
port alarm.

Closes-Bug: 1871453

Change-Id: Iaa65f64808272a5760e655a33c14810df51e28b1
Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>
2020-05-05 14:52:05 -04:00
Kristine Bujold
58845f67b2 Increase the polling frequency for the ptp audit
Increase the polling frequency for the ptp audit from 300 secs to
30 secs.

Story: 2006759
Task: 39412

Change-Id: Ib40c02dfdcf19b2d2c66de33da1f04f77be515f0
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
2020-04-15 09:16:29 -04:00
Sharath Kumar K
0b8b39cb4e De-branding in starlingx/monitoring: Titanium Cloud -> StarlingX
1. Rename Titanium Cloud to StarlingX for .spec files

Test:
After the de-brand change, bootimage.iso has built in the flock layer
 and installed on the dev machine to validate the changes.

Please note, doing de-brand changes in batches, this is batch6 changes.

Story: 2006387
Task: 39276

Change-Id: I0a0c0619530746f7fe2da4d8fc704f9b97a20241
Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>
2020-04-06 10:33:18 +02:00
Zuul
e09eafffdf Merge "Throttle collectd OVS interface plugin startup wait log" 2020-03-12 15:52:22 +00:00
Eric MacDonald
4ab570a850 Throttle collectd OVS interface plugin startup wait log
This update turns a flooding failure log into a single
waiting log while the collectd OVS interface plugin
initialization sequence waits for a running ovs daemon.

A few pep8 long line warning are fixed.

Test Plan:

PASS: Verify plugin behavior of compute system install without Openstack
PASS: Verify plugin behavior in AIO-SX with Openstack
PASS: Verify plugin failure handling and recovery with Openstack
             Note: the ovs-vswitch pmon conf file was changed to allow
             process failure recovery to verify the plugin was able to
             handle transition from not running/waiting to running once
             process started.
PASS: Verify plugin failure handling when ovs-vswitchd process fails.
             Note: pmond does not try and recover and the following
             collectd log is produced every 10 seconds.
             'err ovs interface plugin failed to dump ports br-phy1 desc'

Change-Id: I95d308f771ebabc77dbeb5113feae283538d37d3
Closes-Bug: 1855597
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-03-02 15:00:14 +00:00
Zuul
a9f46a032c Merge "Support non-zero domains" 2020-02-18 16:36:24 +00:00
David Sullivan
e74efea7fe Support non-zero domains
Update the ptp extension to use the ptp4l conf during pmc commands. This
will allow the collectd extension to work when a non-zero domain is
specified.

Change-Id: Ied0fad0e1ef2998d791619df4e9a548d3d9a3f18
Story: 2006759
Task: 38772
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
2020-02-15 11:55:33 -05:00