Removing redundant py36 Zuul jobs since we now have py39 Zuul jobs in
place with the debian nodeset
Story: 2006796
Task: 43489
Signed-off-by: Bernardo Decco <bernardo.deccodesiqueira@windriver.com>
Change-Id: I3e6fe3a146b3ac01218eb1428c3bad35b87c5c9c
A lot of work has gone into making sure that StarlingX is python3
compatible. To ensure future compatibility, enable the python3
portability checks. Disable the checks that are raising errors.
Another set of commits will address the offending code.
Add following suppress warnings in pylint.rc:
- W1618: no-absolute-import
- W1619: old-division
- W1633: round-builtin
Story: 2006796
Task: 43134
Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: Ib3c97263d34328f6ffc27ef08690d23325654b42
A lot of work has gone into making sure that StarlingX is python3
compatible. To ensure future compatibility, enable the python3
portability checks. Disable the checks that are raising errors.
Another set of commits will address the offending code.
Add following suppress warnings in pylint.rc:
- W1618: no-absolute-import
- W1636: map-builtin-not-iterating
- W1638: range-builtin-not-iterating
Story: 2006796
Task: 43135
Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: I6ebe9c7215f1e4622a81b0dd79b36cfcc6a7d86f
Enable python 3.9 in tox and zuul gate.
Story: 2009101
Task: 43104
Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: I3ebb23574ad34a1078fae3fbbc65ce5457d46c69
The platform memory data-set type is currently set to 'percent'.
It is possible to over subscribe platform memory usage to more
than 100%.
Collectd drops sample values that are greater than 100 when its
data-set type is 'percent'. Collectd considers a percent value
greater than 100 to be an invalid value.
This update changes the data-set type for platform memory usage
from 'percent' to 'memory' to allow memory usage values greater
than 100 to be handled.
Test Plan:
PASS: Verify that platform memory overage alarm value is reported
as the 'actual' value in the alarm Reason Text.
PASS: Verify platform memory usage values that exceed the major
threshold are alarmed 'major'.
PASS: Verify platform memory usage values that exceed the critical
threshold are alarmed 'critical', even if the
debounced value exceeds 100.
PASS: Verify ridiculously large values are still alarmed and that
value is still included in the alarm Reason Text.
Change-Id: I7189671e20c92656f820fda74c4871504d89e73a
Closes-Bug: 1940875
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
During stress test/high platform load it is possible that the
kube-apiserver responds with an kube ApiException.
As platform monitoring of cpu and memory should not be affected by
unresponsive kube-api server, allow the kube ApiException to be handled
and the remaining platform resource utilization monitoring to proceed.
This could help identify the issue by allowing the raise of
the platform alarm (e.g. 100.101 Platform CPU threshold exceeded,
100.103 Memory threshold exceeded).
Verfied:
o Platform CPU Alarm is raised with stress test
o Platform CPU Alarm is raised with stress test
and intermittent ApiException
o Memory Alarm is raised with stress test
o Memory Alarm is raised with stress test
and intermittent ApiException
o the above alarm conditions are cleared after
debounce when stress condition is removed
Closes-Bug: 1939172
Signed-off-by: John Kung <john.kung@windriver.com>
Change-Id: I2c9c39a390af1d7ae752ad00db18384479cf6e99
- Use encodeutils from olso library to handle string encodings.
- Expand the generator into a list.
- Use python3 iterator __next__().
Note: there needs to be a separate task to remove the Encoding parameter from python_plugins.conf which can be cherry-picked only for python3 deployments.
Story: 2008454
Task: 42647
Depends-On: Iaa7bd0cadd3b1d097b276dcc37ebceaeb208a6a5
Signed-off-by: Andrei Grosu <andrei.grosu@windriver.com>
Change-Id: I58cd4829806e98b1e15471ce97a7c7ba6a2fe135
(cherry picked from commit f2e5263206c87dde9b602c3660b8191398d9c555)
InfluxDB was not fully productized, nor is it used by other end-users.
It should therefore be removed from all deployments to avoid it
consume unnecessary resources (cpu, memory and storage).
Parts of system's dependencies with InfluxDB were remove here.
Story: 2009018
Task: 42761
Depends-On: https://review.opendev.org/799502
Signed-off-by: Fernando Theirs <Fernando.Theirs@windriver.com>
Change-Id: I85acf8a94e54171162b9be6fbf816532cf602831
This commit adds a better proposed repair action for filesystem
threshold alarm 100.104.
Closes-Bug: 1927155
Signed-off-by: Jerry Sun <jerry.sun@windriver.com>
Change-Id: I1a27d4bc438b98c00d0fe4eb3b30e4672552f90a
This tool gathers memory usage information for all
kubernetes containers and system services displayed
in cgroup memory, that are running on current host.
This displays the total resident set size per namespace and container,
the aggregate memory usage per system service, and the platform memory
usage.
Closes-Bug: 1886868
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
Change-Id: Id130ed0d2794cdd555bdb068e8453cb8e9bd29d2
The problem was logic expected one line data for
ntpq result. But it was 2 lines for each ntp server
entry. When peer server is selected, script checked
refid if refid is reliable or not but it could not
find because refid is in the following line.
This fix formats 2 lines data into 1 line.
The minor alarm "minor alarm "NTP cannot reach
external time source; syncing with peer controller
only" is removed because NTP does not prioritize
external time source over peer.
Closes-Bug: 1889101
Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc
Pin hacking to < 4.0.1 to fix zuul gate issues.
Test:
Ran tox -e flake8 command to validate the flake8 job and result.
Related-Bug: 1926172
Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: Ia2e746ba513c0d073b60e76b2d2afdfe8b6c9745
This update adds common plugin support for alarm state auditing.
The audit is able to detect and correct the following alarm
state errors:
Error Case Correction Action
----------------------- -----------------
- stale alarm ; delete alarm
- missing alarm ; assert alarm
- alarm severity mismatch ; refresh alarm
The common audit is enabled for the fm_notifier plugin that supports
alarm managment for the following resources.
- CPU with alarm id 100.101
- Memory with alarm id 100.103
- Filesystem with alarm id 100.104
Other plugins may use this common audit in the future but only the
above resources have the audit enabled for them by this update.
Test Plan:
PASS: Verify stale alarm detection/correction handling
PASS: Verify missing alarm detection/correction handling
PASS: Verify alarm severity mismatch detection/correction handling
PASS: Verify hosts only audits its own specified alarms
PASS: Verify success path of monitoring a single and mix
of base and instance alarms of varying severity while
such alarm conditions come and go
PASS: Verify alarm audit of mix of base and instance alarms
over a collectd process restart
PASS: Verify audit handling of alarm that migrates from
major to critical to major to clear
PASS: Verify audit handling transition between alarm and
no alarm conditions
PASS: Verify soak of random cpu, memory and filesystem
overage alarm assertions and clears that also involve
manual alarm deletions, assertions and severity changes
that exercise new audit features
Regression:
PASS: Verify alarm and audit handling over Swact with mounted
filesystem that has active alarm
PASS: Verify collectd logs following a system install and
while alarms are managed during above soak
PASS: Verify behavior while FM is killed or stopped/started
PASS: Verify Standard system install with Sanity and Regression
PASS: Verify AIO DX/DC systems install with Sanity and Regression
Closes-Bug: 1925210
Change-Id: I1cafd17ad07ec769240de92ae4e67cb1357f0992
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Running tox for linters fails since the bandit being pulled
in is python3 only. This is similar to other bugs where a new
version is released which drops py2 support.
In this env, we only include bandit if we are testing and running
in py3.
Partial-Bug: 1922590
Change-Id: I11b7d974ae3b64e7846e1420521dee0d48128fc5
Signed-off-by: albailey <Al.Bailey@windriver.com>
The current opensource collectd rpm installs
several default plugins, some that overlap
starlingx developed plugins and others that
simply collect way too much data.
The plugins in question are:
/etc/collectd.d/90-default-plugins-syslog.conf
/etc/collectd.d/90-default-plugins-memory.conf
/etc/collectd.d/90-default-plugins-load.conf
/etc/collectd.d/90-default-plugins-interface.conf
/etc/collectd.d/90-default-plugins-cpu.conf
This update moves the value added starlingx
plugins to /etc/collectd.d/starlingx and
relies another puppet update to change the
collectd's plugin search path accordingly.
Test Plan:
PASS: Verify default plugins are not loaded
and they samples are not collected.
PASS: Verify patch apply and remove.
Note: is reboot required patch
PASS: Verify the daily influxdb usage
drops by 80-85%.
Regression:
PASS: Verify collectd alarm/degrade regression soak
Change-Id: Ic7884ae69014fa274f0bd0515adec90b08747c67
Closes-Bug: 1905581
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update adds a second collectd plugin
initialization enhancement. First update
added a config complete gate:
https://review.opendev.org/c/starlingx/monitoring/+/736817
Turns out that not all plugins are ready to sample
immediately following the config complete state.
One example is FM on the active controller needs
time to get going before plugins can query their
alarms on startup. Also, some plugins need more
time than others.
To account for both cases this update adds a
thresholded node ready gate that can be tailored
to a plugin to hold off fm access and sampling
until its ready threshold is reached.
Test Plan:
PASS: Verify AIO SX and DX system install
PROG: Verify Storage system install
PASS: Verify AIO SX node lock and unlock
PASS: Verify AIO Standby controller lock and unlock
PASS: Verify Standard controller lock and unlock
PASS: Verify Compute and Storage node lock and unlock
PASS: Verify Dead-Office-Recovery (AIO DX)
PASS: Verify collectd sampling and logs
Partial-Bug: 1872979
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I044d812542a4222214c7d13e231ac4024cca9800
Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
the version is incremented above the hardcoded version.
Story: 2008455
Task: 41463
Signed-off-by: Don Penney <don.penney@windriver.com>
Change-Id: If41d630c97354014b12424ed305d6c5cbb022a5a
Update /var/log/influxdb/influxd.log permissions to 640 from 644
to disallow world readable but at the same time to allow group
read access.
The changes are made to comply as much as possible with
openscap rules security requirements.
Verified that installation is successful for AIO-SX
and Standard 2+2 system configurations.
Story: 2008037
Task: 40694
Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
Change-Id: I284fc6882043b4a4d271bd5963fca94bc7a1e390
With a critical memory alarm raised, the collectd plugin fault
notifier's degrade list is injected with the reporting plugin's
name over a collectd process restart.
The recent introduction of multiple instance based memory alarms
has exposed a limitation in the management and content of the
degrade list that can lead to both stuck degrade (this case)
as well as missing degrade due to the lack of uniqueness of the
content injected into the degrade list based on degradable events.
This update modifies the content of the degrade list to ensure
all entries are unique by using an alarm's entity id rather than
the more generic plugin name.
An additional issue was identified with respect to how filesystem
usage overage alarms are managed, due to recent additions to the
list of monitored filesystems. Filesystem overage alarms are also
degrade list candidates so the aforementioned degrade list change
needed to account for filesystem as well.
One recently added monitored filesystem name conflicted with
how filesystem instances were tracked that lead to a bouncing
alarm if that filesystem experienced overage. Given that there
was already a special case handling for the root fs, rather
than add an additional special case to remedy this issue,
the method of mapping filesystem-instance to mountpoint was
changed from a list to a dictionary. With that cha nge there
is no longer a limitation or special case handling required for
filesystem mountpoints that conflicted with how the stock
collectd plugin reports filesystem instances
Test Plan:
PASS: Verify existing alarm and degrade management of
instance and non-instance based alarms ot both normal
runtime as well as over a collectd process restart.
PASS: Verify handling of non-instance based alarm(s)
over process restart when the alarm condition
no longer exists following the process restart.
PASS; Verify degrade list management and content.
PASS: Verify filesystem instance to mountpoint mapping.
PASS: Verify data model content using state audit and
list management with debug options turned on.
PASS: Verify alarm and degrade handling of a filesystem
and overage that follows the active controller.
PASS: Verify update as patch
Regression:
PASS: Verify alarm and degrade handling of 'all' collectd
plugins including over collectd process restarts.
PASS: Verify alarm and degrade management stress soak
that involved multiple plugins asserting/clearing
multiple alarm and degradable conditions over a
24 hour period.
Change-Id: I5ea389fb092a6404616d7ea0e8d54daa64ad7ea2
Closes-Bug: 1903731
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
flake8 2.5.5 fails on ubuntu-focal zuul machines running python3.8
with the following error:
AttributeError: 'FlakesChecker' object has no attribute 'CONSTANT'
Suppresses the following:
W503 line break before binary operator
W504 line break after binary operator
W605 invalid escape sequence '\d'
E117 over-indented
E266 too many leading '#' for block comment
E305 expected 2 blank lines after class or function definition, found 1
E402 module level import not at top of file
E722 do not use bare 'except'
E741 ambiguous variable name 'I'
F632 use ==/!= to compare constant literals
F821 undefined name 'dpdk' (this is a flake8 bug)
Change-Id: I6c2ef05d765b57b7be0b038d6e384cb2af589054
Partial-Bug: 1895054
Signed-off-by: albailey <Al.Bailey@windriver.com>
This increases field width of TID, PID, and PPID to 7 wide for schedtop
engineering tool. Newer systems support larger PIDs.
Change-Id: I706b60d83e8ce341a7d07c4c067a74e7049acdad
Closes-Bug: 1902954
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
Setting up the bandit tool for the scanning of HIGH severity issues
in the python codes under Starlingx/monitoring folder.
Expecting this merge will enable zuul job for CI/CD of bandit scan.
Configuration files:
1. tox.ini for adding bandit environment and command.
2. test-requirements.txt for adding bandit version.
3. .zuul.yaml file for adding bandit job and configuring under
check job to run code scan every time before code commit.
Test:
Run tox -e bandit command inside the fault folder to validate the
bandit scan and result.
Story: 2007541
Task: 39684
Depends-On: https://review.opendev.org/#/c/721294/
Change-Id: Ibcbe1dd2e380f80c4cbf6f2a7cf49065dc890803
Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>
This changes the initialization of per cgroup cpuacct timings
to account for cgroup directories that may not be present at the time
the plugin starts. As an example, the docker cgroup is created often
much later or not at all.
Change-Id: Iaf279e650cc16966b40c24a9f55f53fa4696a92b
Closes-Bug: 1855733
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
Some of the collectd plugins are not waiting for configuration
complete before starting to monitor or communicate with external
services such as fm. This leads to the collectd networking plugin
being triggered to run before or while the host is being configured
which has been seen to lead to collectd segfaults/coredumps within
the collectd's internal networking plugin.
To solve this issue, reduce startup thrash and a slew of plugin
startup error logs, this update adds consistent initialization
and configuration complete checks to all of the starlingX
plugins so monitoring and external service access is not
performed until the host configuration is complete.
Test Plan:
PASS: Verify no plugin sampling till after config is complete
PASS: Verify alarm assert and clear cycle for all plugins
PASS: Install AIO SX system install
PASS: Install AIO DX system install
PEND: Verify Standard system install
PASS: Verify logging
Change-Id: I90a5d1c8c3be77269a571738c9499b2e908e1fc5
Closes-Bug: 1872979
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This tool gathers cpuset usage information for all kubernetes
containers that are running on the current host.
With kubernetes CPUManager policy:
- 'none' -- the k8s-infra cpuset is used for all pods
- 'static' -- pods get exclusive cpuset for QoS Guaranteed or
using isolcpus, otherwise pods inherit DefaultCPUSet.
This displays the cpusets per container and the mapping to numa nodes.
This displays the aggregate cpuset usage per system-level groupings
(i.e., platform, isolated, guaranteed, default), per numa-node.
Story: 2006999
Task: 39579
Change-Id: I7f1b12e2bbcf7d0b1606c1c948c545216ec454c5
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
This update removes hugepage memory monitoring, sampling and
alarming for over usage.
Hugepage memory is only used by k8s pods or openstack vm's.
Therefore its usage and alarming should not be tied to the
platform.
Change-Id: Iab8104ff56fdd641c058a4fdc587313cbeec9faf
Closes-Bug: 1880605
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
When the link state transitions from DOWN to UP, the current collectd
process attempts to clear the alarm once and once only. If such attempt
failed, no further attempts will be made, and the alarm will persist in
fm alarm-list.
This fix added an additional check to ensure that as long as port alarm
is not cleared and the link state is UP, it will attempt to clear the
port alarm.
Closes-Bug: 1871453
Change-Id: Iaa65f64808272a5760e655a33c14810df51e28b1
Signed-off-by: Thomas Gao <Thomas.Gao@windriver.com>
Increase the polling frequency for the ptp audit from 300 secs to
30 secs.
Story: 2006759
Task: 39412
Change-Id: Ib40c02dfdcf19b2d2c66de33da1f04f77be515f0
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
1. Rename Titanium Cloud to StarlingX for .spec files
Test:
After the de-brand change, bootimage.iso has built in the flock layer
and installed on the dev machine to validate the changes.
Please note, doing de-brand changes in batches, this is batch6 changes.
Story: 2006387
Task: 39276
Change-Id: I0a0c0619530746f7fe2da4d8fc704f9b97a20241
Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>
This update turns a flooding failure log into a single
waiting log while the collectd OVS interface plugin
initialization sequence waits for a running ovs daemon.
A few pep8 long line warning are fixed.
Test Plan:
PASS: Verify plugin behavior of compute system install without Openstack
PASS: Verify plugin behavior in AIO-SX with Openstack
PASS: Verify plugin failure handling and recovery with Openstack
Note: the ovs-vswitch pmon conf file was changed to allow
process failure recovery to verify the plugin was able to
handle transition from not running/waiting to running once
process started.
PASS: Verify plugin failure handling when ovs-vswitchd process fails.
Note: pmond does not try and recover and the following
collectd log is produced every 10 seconds.
'err ovs interface plugin failed to dump ports br-phy1 desc'
Change-Id: I95d308f771ebabc77dbeb5113feae283538d37d3
Closes-Bug: 1855597
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Update the ptp extension to use the ptp4l conf during pmc commands. This
will allow the collectd extension to work when a non-zero domain is
specified.
Change-Id: Ied0fad0e1ef2998d791619df4e9a548d3d9a3f18
Story: 2006759
Task: 38772
Signed-off-by: David Sullivan <david.sullivan@windriver.com>