269 Commits

Author SHA1 Message Date
Jim Gauld
0232b8b9dc Update collectd cpu plugin and monitor-tools to diagnose cpu spikes
The collectd cpu plugin and monitor-tools are updated to
support diagnosing high cpu usage on shorter time scale.
This includes tools that assist SystemEngineering determine
the source where CPU time is coming from.

This collectd cpu plugin is updated to support Kubernetes services
under system.slice or k8splatform.slice.

This changes the frequency of read function sampling to 1 second.
We now see logs with instantaneous cpu spikes at the cgroup level.
This dispatch of results still occurs at the original plugin
interval of 30 seconds.  The logging of the 1 second sampling is
configurable via /etc/collectd.d/starlingx/python_plugins.conf
field 'hires = <true|false>. The hiresolution samples are always
collected and used for a histogram, but it is not always desired
to log this due to the volume of output.

This adds new logs for occupancy wait. This is similar to cpu
occupancy, but instead of realtime used, it measures the aggregate
percent of time a given cgroup is waiting to schedule. This is a
measure of CPU contention.

This adds new logs for occupancy histograms for all cgroups and
aggregated groupings based on the 1 second occupancy samples.
The histograms are displayed in hirunner order. This displays
the histogram, the mean, 95th-percentile, and max value.
The histograms are logged at 5 minute intervals.

This reduces collectd cgroup to 256 CPUShare from (1024).
This smoothes out behaviour of poorly behaved audits.

The 'schedtop' tool is updated to display 'cgroup' field. This
is the systemd cgroup name, or abbrieviated pod-name. This also
handles Kernel sched output format changes for 6.6.

New tool 'portscanner' is added to monitor-tools to diagnose
local host processes that are using specific ports. This has been
instrumental in discovering gunicorn/keystone API users.

New tool 'k8smetrics' is added to monitor-tools to display
the delay histogram and percentiles for kube-apiserver and
etdcserver. This gives a way to quantify performance as
a result of system load.

Partial-Bug: 2084714

TEST PLAN:
AIO-SX, AIO-DX, Standard, Storage, DC:
PASS: Fresh install ISO
PASS: Verify /var/log/collectd.logs for 1 second cpu/wait logs,
      and contains: etcd, kubelet, and containerd services.
PASS: Verify we are dispatching at 30 second granularity.
PASS: Verify we are displaying histograms every 5 minutes.
PASS: Verify we can enable/disable the display of hiresolution
      logs with /etc/collectd.d/starlingx/python_plugins.conf
      field 'hires = <true|false>'.
PASS: Verify schedtop contains 'cgroup' output.
PASS: Verify output from 'k8smetrics'.
      Cross check against Prometheus GUI for apiserver percentile.
PASS: Verify output from portscanner with port 5000.
      Verify 1-to-1 mapping against /var/log/keystone/keystone-all.log.

Change-Id: I82d4f414afdf1cecbcc99680b360cbad702ba140
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
vr/stx.10.0
2024-11-15 02:11:55 -05:00
Caio Bruchert
8f404ea66c Fix GM inconsistent data
In some uncommon cases, after disabling GNSS, the CGU information shows
the hardware state unstable for over 20 minutes. GNSS-1PPS and DPLL
status cycles through several states until stabilizing. During this
time, If the clock class is 6 and DPLL moves to freerun, collectd will
wrongly change the clock class to 248 but leave clockAccuracy,
offsetScaledLogVariance, timeSource, timeTraceable and
frequencyTraceable with the previous values.

This commit improves this situation with 3 fixes:
1. When changing the clock class to 248 when DPLL is in freerun, set
   other GM variables accordingly
2. During collectd cycle, read CGU only once per PCI address
3. Fix clock class moving to 248 and back when restarting collectd

Test Plan:
PASS: ensure GM variables are correct when going from clock class 6 to
      248 (freerun)
PASS: ensure CGU is read once per PCI address
PASS: ensure clock class does not change when restarting collectd

Closes-Bug: 2084933

Change-Id: Iad43b883938a5cc7218c1a53a6318ff23cd77cfe
Signed-off-by: Caio Bruchert <caio.bruchert@windriver.com>
vf/caracal
2024-10-18 14:43:08 -03:00
Cole Walker
79cd2ee2b0 Correct logic for checking ptp4l source lock
An update to the check_ptp_regular() function resulted in a case where
the ptp collectd plugin could crash by hitting a KeyError.

This fix prevents this by checking that the key is present before trying
to read the GNSS/SMA status. If the key is not present then we know that
GNSS/SMA is not configured for that NIC and we can skip the check.

Additional improvement: Throttle logs related to UTC offset checking.
There is no need to print these logs every cycle as the state is not
expected to change frequently.

Test plan:
Pass: Verify collectd plugin startup and operation
Pass: Verify ptp source lock alarm is raised when ptp master is lost
Pass: Verify UTC logs are throttled

Closes-Bug: 2070071

Change-Id: I60eb1602671b388b54ac5938e4a306e4321742e7
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2024-09-10 11:35:14 -04:00
Andre Mauricio Zelak
60b16c4b8b Ice driver 1.14.9.x support
Support the new NMEA serial port name, which changed
from /dev/ttyGNSS_XXXX_X to /dev/gnssX. The name
change impacted how the PCI address of the GNSS
device is derived.

Test plan:
PASS: Basic collecd start up and verify the ts2phc instance
log.

Story: 2011056
Task: 50885

Depends-On: https://review.opendev.org/c/starlingx/kernel/+/926116

Change-Id: Ib4a77b29cff437a7ae268e31587d0e9b080e27b6
Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>
2024-08-21 17:56:26 -03:00
Andre Mauricio Zelak
a1d13a000f Fix not locked to remote PTP Grand Master alarm
Fix the not locked to remote PTP Grand Master alarm raise logic. Even
when the NMEA serial port is set for a given network interface card,
the other NICs can be undisciplined by a local clock source. For those
cases where the NIC isn't locked to a clock source the PTP status is
checked, and the corresponding alarms are raised.

Test Plan:
Configure two ptp4l instances, one whose interfaces are on a NIC
disciplined by a GNSS 1PPS clock (ptp4l-inst1), and the the other
ptp4l instance (ptp4l-inst2) whose interfaces are on a NIC not
displined by neither a GNSS nor a SMA1 clock.
Configure a phc2sys and set the nmea serial port to the NIC
displined by GNSS. It's important to set the nmea serial port,
so a primary NIC is configured.
Start remote PTP sources for both ptp4l instances.

Pass: At start, as ptp4l-inst2 is not disciplined by local clock source
check the 1PPS signal loss state alarm is set.
Pass: Remove the remove GM connected to ptp4l-inst2 and check the not
locked to remote PTP Grand Master alarm is raised.
Pass: After some minutes, check the Precision Time Protocol (PTP)
clocking is out of tolerance alarm for ptp-inst2 is set.
Pass: Configure SMA1 local clock on the NIC linked to ptp4l-inst2, and
check both not locked to remote PTP Grand Master and the 1PPS signal
loss state alarms are cleared.
Pass: Remove the SMA1 configuration on the NIC linked to ptp4l-inst2,
and check both not locked to remote PTP Grand Master and the 1PPS
signal loss state alarms are raised.
Pass: Start the remote PTP connected to ptp4l-inst2 and check not
locked to remote PTP Grand Master alarm is cleared.
Pass: Remove the remote PTP connected to ptp4l-inst1. As it is
disciplined by a local source, check no alarms are raised. Than
restart the remote PTP.

Closes-Bug: 2070071

Change-Id: I82c0c745d65817a02d921d61e87c91c7aa3aae43
Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>
2024-07-12 13:00:12 -03:00
Andre Mauricio Zelak
629b92d26f PTP: fix handling query_pmc response
Depending on the query_pmc response it could cause the PTP plugin
to crash and become unavailable for some time. For example, when
the ptp4l is unavailable or incapable to answer the pmc query.

Test Plan:
Pass: Verify ptp plugin startup and operation
Pass: Verify operation when restarting ptp service
Pass: Verify operation when ptp4l is down
Pass: verify operation when a ptp4l is manually started and
stopped.

Closes-Bug: 2068025

Change-Id: I7f54377443489669c019d352682ba57f583976fe
Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>
2024-06-04 13:29:23 -03:00
Scott Little
1132ebd11b Remove CentOS/OpenSUSE build support
StarlingX stopped supporting CentOS builds in the after release 7.0.
This update will strip CentOS from our code base.  It will also remove
references to the failed OpenSUSE feature as well.

Story: 2011110
Task: 49957
Change-Id: I8eabea83f2c33630027a618bced17d339591bb8b
Signed-off-by: Scott Little <scott.little@windriver.com>
2024-04-26 14:14:49 -04:00
Zuul
fc72fcadac Merge "Raise additional phc2sys alarms on source loss" vf/bookworm vr/stx.9.0 2024-02-06 18:06:35 +00:00
Cole Walker
2708c7accf Raise additional phc2sys alarms on source loss
This change updates the behaviour for raising alarms for HA phc2sys. For
the orignal behaviour, if phc2sys was observed to have no valid source
clocks, a single no-source-clock alarm was raised and no further phc2sys
alarms were processed because all sources are degraded.

To improve visibility and troubleshooting operations, this change allows
for additional phc2sys alarms to be raised when there are no valid
sources, allowing operators to more easily identify the faults.

Updated the alarms for source selection change, forced lock, and low
priority interface, to allow them to be checked and raised even when no
valid sources are available.

Test Plan:
Pass: Verify ptp plugin startup and operation
Pass: Verify operation of no-valid-source alarm
Pass: Verify operation of clock-source-no-lock alarm
Pass: Verify source selection change, forced lock and low priority
source alarms can be raised when no valid sources are present

Story: 2010723
Task: 49486

Depends-on: https://review.opendev.org/c/starlingx/monitoring/+/906762

Change-Id: I7f11b951df3fa176782b31ccd0f0a1ff70fbbabf
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2024-02-02 15:03:55 -05:00
Zuul
f7a05e8f66 Merge "Improve logic for checking the status of multiple dplls" 2024-02-02 19:30:55 +00:00
Cole Walker
549b0cb3b0 Improve logic for checking the status of multiple dplls
Testing on systems with 3+ PTP NICs and 2+ GNSS connections requires
enhancements to the dpll status alarming logic.

This change provides the following enhancements:
1. Update the alarm string for 1-PPS-signal-loss and GNSS-signal-loss to
be distinct. Systems with multiple GNSS sources may be configured in
such a way that a given NIC can have both GNSS for time sync and SMA for
fallback 1PPS sync. Distinguishing these alarms allows them to be raised
independently.

2. In systems with multiple NICs, ptp monitoring needs to store a list
of PCI slots for each instance so that the associated DPLLs can be checked
for locked states.

3. Create a list of PCI slots for the ts2phc and clock instance types
which is populated during initialization. Collectd then reads the DPLLs
for these slots on each cycle and updates their state in the dpll_status
dict. The dpll_status dict is then consumed by various functions to
raise alarms if a DPLL is in a degraded state.

4. When initializing the alarm objects on plugin startup, ensure that
interface related alarms are created for each interface discovered in
ptp config files. These alarm objects can then be accessed using the
alarm type+interface name as keys. This fixes an earlier assumption that
there would only be 1 secondary interface for ts2phc and clock type
instances. It now properly supports n interfaces.

Test plan:
Pass: Verify collectd ptp plugin startup and operation
Pass: Verify no-lock alarms for GNSS/1PPS raised in configs with 1,2,3+
NICs
Pass: Verify alarms clear in above case
Pass: Verify clock class is updated for multiple ptp4l instances with
1,2,3+ NICs
Pass: Verify alarm objects are created for each configured interface and
can be accessed via (alarm type, interface) keys.

Story: 2010723
Task: 49469

Change-Id: I974a0873e2947befec446a0abdebb6a711fc54dc
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2024-02-01 11:59:50 -05:00
Andre Mauricio Zelak
4afb9305c6 Handle multiple HA domain numbers
Fix the domain number logic to get them for every
interface configured in the phc2sys.

Test plan:
PASS: Verify ptp plugin startup and general operation
PASS: Verify that alarm is raised when clock source has
no lock.
PASS: Verify that alarm is cleared when clock source
has lock.

Story: 2010723
Task: 49440

Change-Id: Ic604cf17626ec295eb5de19555b0bf9e32b8be26
Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>
vf/kernel-6.6
2024-01-22 14:20:38 -03:00
Cole Walker
53651e906f Update source selection alarm to 'msg' type
Change the alarm type for HA phc2sys source selection changes to type
'msg'. It is not necessary for this event to raise/clear an alarm, as
this warning is intended for informational purposes and does not require
intervention from the user. This allows for source selection changes to
be audited without being treated as a system fault.

Test plan:
PASS: Verify ptp plugin startup and general operation
PASS: Verify that source selection changes generate a log entry in 'fm
event-list'
PASS: Verify other HA phc2sys alarms are raised correctly and display in
fm alarm-list

Story: 2010723
Task: 49393

Change-Id: I7632e79ffd7522eb4e7caa34a1959aa62d3e8879
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2024-01-12 11:28:16 -05:00
Cole Walker
357314ef63 Support ha_phc2sys 'valid sources' alarm and multi-domain configs
Add support for checking the ha_phc2sys 'valid sources' query so that
collectd can properly raise an alarm when none of the configured clock
sources meet the quality requirements for selection by ha_phc2sys.

Add support for selecting the correct domainNumber value for pmc queries
when ha_phc2sys is using multiple ptp4l sources with different domains
numbers.

Bonus fix:
Adds special handling to ptp4l config parsing to ignore the
'unicast_master_table' section if present. This section may be included
in certain ptp4l deployments where users want to use unicast
transmission instead of the default multicast behaviour. There is
currently no need to handle this section in collectd.

Test plan:
Pass: Verify alarm is raised when ha_phc2sys has no valid
sources to select, and cleared when a valid source becomes available
Pass: Verify pmc commands succeed when ha_phc2sys has
- Global domainNumber
- Per interface domainNumber
- No domainNumber defined
Pass: Verify that ptp4l configs with and without 'unicast_master_table'
are parsed correctly

Story: 2010723
Task: 49012

Signed-off-by: Cole Walker <cole.walker@windriver.com>
Change-Id: I3a2d106cfa5d01b9538a7d315d3565b06050a8ce
2023-11-02 12:36:32 -04:00
Caio Bruchert
fe1b81ff5e PTP: fix handling phc2sys offset
When using phc2sys's -O 0 command line option, the out of tolerance
alarm is incorrectly raised.

This fix adds support to the ph2sys offset configuration along with
ptp4l's.

If phc2sys is disciplining the clock and the offset is configured using
the -O command line option, it will be used instead of ptp4l's offset.

Log errors will be generated in case there are more than one phc2sys
instance disciplining the clock and when there is a mismatch between
phc2sys and ptp4l's offset configuration.

Test Plan:
PASS: check cmd line options: no -O, -O 0, -O 36, -O 37
PASS: check cmd line options: no -c, -c CLOCK_REALTIME, -c /dev/ptp1
PASS: check config file configuration
PASS: verify that out-of-threshold alarm is raised when PHC is skewed,
      tested using 37s offset and 0s offset

Depends-on: https://review.opendev.org/c/starlingx/monitoring/+/895543
Closes-bug: 2038463

Change-Id: Idb6e1b68361f77a8a29b81b86026a0a6dd42054f
Signed-off-by: Caio Bruchert <caio.bruchert@windriver.com>
2023-10-05 15:28:22 +00:00
Cole Walker
91b4ee01ae Add support for additional G.8275.x Announce fields
Add additional logic to set the appropriate PTP Announce message fields
to conform with the G.8275.1 spec.

This change ensures that a PTP node operating as a GM sets values for
these fields when G.8275.x profile is enabled in ptp4l:
clockAccuracy
offsetScaledLogVariance
timeSource
timeTraceable
frequencyTraceable
currentUtcOffsetValid

This is handled by the addition of two new functions:

intitialize_ptp4l_state_fields() determines initial/static values after
the ptp4l config file is read

and

handle_ptp4l_g8275_fields() updates the dynamic values based on the
state of the Primary Reference Time Clock

An additional fix was implemented to remove the dependency on the
hardcoded leapsecond value and instead read it dynamically from the
running ptp4l instances.

Test Plan:
PASS: Identify PRTC presence and type
PASS: GM node sets correct Announce field values based on PRTC lock
state
PASS: Verify BC nodes receive and pass on Announce message values from a
configured GM
PASS: Ensure that BC nodes report a clockClass of 165 when connection to
GM is lost

Depends-on: https://review.opendev.org/c/starlingx/config/+/895547
Closes-bug: 2036431

Change-Id: I1f62b4081aedb654d80876f041afd2e274b1eb16
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2023-10-04 17:29:51 -04:00
sshathee
6bf4c23db0 Generalize platform pods check using platform label
This commit introduces a check for platform label, which if
present, the cpu and memory for these pods will be
considered in platform resources. Earlier, pod was
considered as part of platform resources by hardcoded namespaces,
which are tiresome to maintain. Now pods are treated as 'platform'
if the namespace is in set of hard-coded namespaces, or if
the component label is 'app.starlingx.io/component=platform'.

Test Plan:
   Pass: Code changes on AIO-SX and collectd was restarted successfully
   Pass: Verify that /var/log/collectd.log contains cpu and memory
  consumption info for platform pods.

Story: 2010904
Task: 48718
Change-Id: Ia8442717009f92dbe022f9512e226913c45d9473
Signed-off-by: sshathee <shunmugam.shatheesh@windriver.com>
2023-09-11 01:55:36 -04:00
Marcos Paulo Oliveira Silva
ccb106deec Add sriov-fec-system namespace to the platform infra list in collectd
Currently the pods installed in the sriov-fec-operator namespace run
on application cores. The sriov-fec-operator App is seen
as a platform app and therefore its pods need to run on platform
cores.

Accordingly, add the sriov-fec-system namespace to the list of
platform namespaces in collectd.

Test Plan:
PASS: Execute kube-memory tool and ensure its output contains fec
      namespace data.
PASS: Edit collectd tool source code to display the list of namespaces
      classified as K8S_NAMESPACE_SYSTEM and verify if the
      sriov-fec-operator namespace was found.

Story: 2010826
Task: 48639

Change-Id: I85a07e7a30018b28ea49e96f0100294d40ce4433
Signed-off-by: Marcos Paulo Oliveira Silva <Marcos.PauloOliveiraSilva@windriver.com>
2023-08-31 13:46:49 -03:00
Zuul
0d1ac162dc Merge "Enhance ptp alarming logic" 2023-08-29 17:02:07 +00:00
Cole Walker
b0171eb056 Enhance ptp alarming logic
Fix 1:
As per the HA phc2sys design, the query for checking the forced lock
state of HA phc2sys has been updated to 'forced lock'

Fix 2:
Correct the handling for setting the ptp4l clockClass when there is no
ts2phc instance configured to set time on that NIC.

Fix 3:
Upgrade the logic in process_ptp_synce(). NICs may be configured with
both a GNSS connection and an SMA connection, so it is necessary to
check for a GNSS lock first. If GNSS is locked, no further checking is
required. If GNSS is lost, check if SMA is configured and locked before
raising alarm.

Test plan:
Pass: Validate that 'forced lock' query works correctly
Pass: Validate that ptp4l clockClass remains 248 when there is no ts2phc
instance configured to set time on a NIC
Pass: Validate that a NIC with both GNSS and SMA connection configured
does not incorrectly raise an alarm if GNSS is present

Story: 2010723
Task: 48602

Change-Id: I92869c7c28c786c5e68cd493ae6582b9b4884c21
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2023-08-29 12:43:11 -04:00
Alyson Deives Pereira
847fad4eb8 Add intel-power and power-metrics to list of platform namespaces
The intel-power and power-metrics namespaces are used by
Kubernetes Power Manager [1] and  Power Metrics [2] StarlingX
platform applications. Therefore, their pods have to run at platform
cores.

This change enables collectd logs and kube-memory output to display
cpu and memory consumption from these namespaces.

[1] https://opendev.org/starlingx/app-kubernetes-power-manager
[2] https://opendev.org/starlingx/app-power-metrics

Test Plan:
- PASS: Execute kube-memory tool and ensure its output contains
  intel-power and power-metrics namespace info.
- PASS: Verify that /var/log/collectd.log contains cpu and memory
  consumption info from power-metrics and intel-power namespace
  processes.

Story: 2010773
Task: 48415

Depends-On: https://review.opendev.org/c/starlingx/monitoring/+/887744

Change-Id: Ifefa7950bd9ecb1e4177e44ca51743ae8837fc87
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
2023-08-11 14:13:17 +00:00
Zuul
f39d41bb5d Merge "Add the NFD namespace to the platform infrastructure list in collectd" 2023-08-11 12:17:51 +00:00
Zuul
23939e2a53 Merge "[PTP] Handle case where phc2sys has no source interface" 2023-08-10 21:25:45 +00:00
Marcos Paulo Oliveira Silva
419a0686ab Add the NFD namespace to the platform infrastructure list in collectd
Actually, the pods installed in the node-feature-discovery namespace run
on application cores. Although, the Node Feature Discovery App is seen
as a platform app, and therefore, its pods need to run on platform
cores.

So, in this change, the node-feature-discovery namespace will be added
in the collectd platform infrastructure list to follow the changes did
in kubelet

Test Plan:
PASS: Execute kube-memory tool and ensure its output contains nfd namespace data
PASS: Edit collectd tool source code to display the list of namespaces classified as K8S_NAMESPACE_SYSTEM and verify if the node-feature-discovery namespace was found.

Story: 2010769
Task: 48326

Depends-On: https://review.opendev.org/c/starlingx/integ/+/887743

Change-Id: Idc31c36fc4e12d4f91b2fa45ed9dee663cb024f9
Signed-off-by: Marcos Paulo Oliveira Silva <Marcos.PauloOliveiraSilva@windriver.com>
2023-07-31 20:10:42 +00:00
Cole Walker
24e74e2f3e [PTP] Handle case where phc2sys has no source interface
In preparation for the upcoming HA Phc2sys functionality, the ptp plugin
must handle the case where phc2sys reports that it has no source clock
interface.

When the string "None" is returned from the phc2sys communication
socket, query_phc2sys_socket() now returns None, which can be properly
handled by the existing logic.

Test Plan:
Pass: Verify that a phc2sys source interface value of "None" is accepted
and triggers the phc2sys no-source-clock alarm
Pass: Alarm clears when phc2sys selects a source interface

Story: 2010723
Task: 48520

Change-Id: I2b39e92ca9f6fe36b29c4c7b8ddf12a9206121a1
Signed-off-by: Cole Walker <cole.walker@windriver.com>
2023-07-31 15:40:18 -04:00
Zuul
ad30d38bab Merge "Add collectd alarming support for HA phc2sys" vf/antelope 2023-07-28 20:18:25 +00:00
Cole Walker
74754a243c Add collectd alarming support for HA phc2sys
Extend the ptp collectd plugin to support the required alarms for the
Redundant / HA PTP timing clock sources feature.

This change consists of three main parts. The first is the introduction
of the TimingInstance class which is used to read the phc2sys
configuration, as well as store and update phc2sys state data. The goal
of the TimingInstance class is to provide an interface that can be
extended for all ptp instance types in the future as the ptp plugin is
enhanced. This will serve to manage the config data and state data for
all instance types in a similar way.

The second part was to create the required alarms for monitoring HA
phc2sys configurations.
These alarms are handled in process_phc2sys_ha().

The new alarms are:

ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_SELECTION_CHANGE
ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_LOW_PRIORITY
ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_LOSS
ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_NO_LOCK
ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_OOT
ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_FORCED_SELECTION

The third change was to enhance the ts2phc monitoring to handle multiple
ts2phc instances, as this configuration must be supported for this
feature.

Test plan:
PASS: Verify existing alarm operation is unaffected
PASS: Verify new phc2sys alarms operate correctly
PASS: Verify alarming works with multiple ts2phc instances
PASS: Ensure that new phc2sys alarming only operates when a HA phc2sys
instance is configured on the system

Failure paths:
PASS: Ensure that ptp plugin correctly disables phc2sys HA monitoring
when HA mode is disabled or when phc2sys has no configured interfaces to
monitor

PASS: Ensure that loss of phc2sys socket is handled. Source loss alarm
raised, and cleared when socket connection is available again

PASS: Ensure that loss of PMC is handled. Source lock alarm is raised
and cleared once PMC returns.

Story: 2010723
Task: 48441
Change-Id: I794d8d8b54ab3435181c1bae814d72852b2fce0a
2023-07-28 12:33:54 -04:00
cpompeud
a15a1ad560 Fixes collectd memory extension
This change corrects the variable name that was misspelled
in the collectd memory plugin causing it to fail:

Error message:
Unhandled python exception in read callback: NameError:
 name 'slab' is not defined
Traceback (most recent call last):
  File "/opt/collectd/extensions/python/memory.py",
  line 703, in read_func
    obj.normal_nodes = calc_normal_memory_nodes()
  File "/opt/collectd/extensions/python/memory.py",
  line 354, in calc_normal_memory_nodes
    normal_nodes[node]['slab_MiB'] = slab
NameError: name 'slab' is not defined

Working Logs:
2023-05-18T18:09:59.124 controller-0 collectd[280000]: info
 4K memory usage: Anon: 18.6%, Anon: 4222.6 MiB,
 cgroup-rss: 4271.8 MiB, Avail: 18460.1 MiB,
 Total: 22682.7 MiB Slab: 867.3 MiB
2023-05-18T18:09:59.124 controller-0 collectd[280000]: info
 4K numa memory usage: node0, Anon: 18.15%,
 Anon: 4222.6 MiB, Avail: 19041.8 MiB,
 Total: 23264.4 MiB Slab: 867.3 MiB

Test Plan:
   - PASS: Build an image, install and bootstrap successfully
   - PASS: Apply monitor pods so addon logs would be installed.
   - PASS: Check that log entries are correctly displayed.

Closes-Bug: 2019007

Change-Id: Ic0089fd1c6922fe8ec02e9161f57421f9bb77209
Signed-off-by: cpompeud <Cesar.PompeudeBarrosBombonate@windriver.com>
2023-07-15 02:54:21 +00:00
Cesar Bombonate
fc336f95b6 Add additional logging for Collectd and fix non descriptive output.
This change adds additional logging for pods not in the kube-system
or in the kube-addon namespace that are logged every 30 minutes.

Additionally we have added additional information for pods
 where the UID was not found.

The logs now include entries for pods outside of
 kube-addon and kube-system namespaces:
2023-05-12T15:00:42.351 controller-0 collectd[72599]: info The pod:
cm-cert-manager-55659b97c7-w52bq running in
namespace:cert-manager has the following
processes{95662: {'rss': 55248.0, 'name': 'controller'}
, 95352: {'rss': 4.0, 'name': 'pause'}}

Non descriptive logs exemplified below:
2023-05-08T13:10:50.059 controller-0 collectd[72636]: info
 platform memory usage: uid 261d40cea94de12fc54c41279cf269c9 not found
2023-05-08T13:10:50.059 controller-0 collectd[72636]: info
 platform memory usage: uid e90a2332-5753-48bc-a706-f611b9fa4f2e not found
2023-05-08T13:10:50.059 controller-0 collectd[72636]: info
 platform memory usage: uid f38297b6-6940-437d-996b-addacb2cb330 not found

Thus we have changed this to now include the podname and namespace:
collectd.warning('%s: uid %s for pod %s not found in namespace %s' % (
                    PLUGIN, uid, pod.name, pod.namespace))



Test Plan:
   - PASS: Build an image, install and bootstrap successfully
   - PASS: Apply monitor pods so addon logs would be installed.
   - PASS: Check that log entries are correctly displayed.

Closes-Bug: 2019007
Signed-off-by: Cesar Bombonate <Cesar.PompeudeBarrosBombonate@windriver.com>
Change-Id: If9207b8d23aefe010d0475e36b0644343df911ea
2023-06-05 17:18:11 +00:00
Zuul
99a76c1d00 Merge "Fix github mirroring for this repo" 2023-05-01 20:51:03 +00:00
Zuul
bacf4e23c2 Merge "Remove python2 jobs from zuul for this repo" 2023-05-01 20:34:58 +00:00
Zuul
58a5726e58 Merge "Fix to prevent truncating IPv6 value when NTP alarm is triggered." 2023-05-01 20:09:35 +00:00
Al Bailey
9c48ac6611 Remove python2 jobs from zuul for this repo
- Remove the python2 jobs from zuul for this repo
 - Remove python2 entries from test-requirements and tox
 - Removed redundant basepython and other tox.ini entries
 - Updated the upper constraints for the newer python
 - Fix the test-requirements so 'cover' can run
 - Update .gitignore to show a clean repo after running tox
 - Added prettytable to the requirements files
 - Updated the versions of python in setup.cfg

These changes should only affect tox and zuul.
However, since the requirements.txt files were updated, an
ISO was also booted to verify no runtime impact.

Test Plan:
  PASS: Build packages and ISO
  PASS: Boot AIO-SX, bootstrap and unlocked.
  PASS: tox (able to run tox for all 3 tox.ini files)
  PASS: run kube-memory and kube-cpusets on controller

Story: 2010642
Task: 47882
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I4359f3659e75ddfda4208524a6b74360dfe5ee0c
2023-04-28 18:24:13 +00:00
Davlet Panech
f5437709d1 Fix github mirroring for this repo
Updating the rsa ssh host key based on:
https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/

Note: In the future, StarlingX should have a zuul job and
secret setup for all repos so we do not need to do this
for every repo.

Needed to rename the secret, because zuul fails if like-named
secrets have diffent values in different branches of the same
repo.

Partial-Bug: #2015246
Change-Id: I66254eebd2788ee510f5f9768b1c4275e1c7cfe5
Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
2023-04-28 12:38:52 -04:00
Cristian Mondo
99c893e966 Fix to prevent truncating IPv6 value when NTP alarm is triggered.
This fix is to allow the validation of IPv6 when ntpq command
output returns a invalid IPv6 format. In some cases the
truncated IPv6 only ends with single colon, internally the
a method is invoked to validate the IP family corresponding
to that format and since it is not a valid format, it fails.
This behavior causes the returned IP to always be a truncated IP.

The logic is corrected to validate only when the IP is version 4
only.

Test Plan:

PASS: Configure NTP with unreachable IPv6 peers to trigger the
NTP alarm
PASS: Configure NTP with reachable IPv6 peers to avoid alarms
PASS: Configure NTP with unreachable IPv4 peers to trigger the
NTP alarm
PASS: Configure NTP with reachable IPv4 peers to avoid alarms

Closes-Bug: 2004043

Review Ref: https://review.opendev.org/c/starlingx/monitoring/+/872036
Change-Id: I8b5b0080a4714cc864a4bdd0a7e8ad558e18adfa
Signed-off-by: Cristian Mondo <cristian.mondo@windriver.com>
2023-04-21 13:30:14 -03:00
cpompeud
d5aa0bf737 Collectd top 10 k8s system process list incorrectly has k8s addon
This change corrects the process list so that only
processes from the kube_system are displayed.

The list was changed from this:
2023-01-09T22:25:32.172 controller-0 collectd[153770]: info The top
10 memory rss processes for the Kubernetes System are :
[('java', '36.72 GiB')
, ('java', '26.87 GiB')
, ('java', '4.25 GiB')
, ('java', '2.71 GiB')
, ('autodetect', '860.24 MiB')
, ('java', '826.97 MiB')
, ('kube-apiserver', '801
.15 MiB')
, ('autodetect', '606.67 MiB')
, ('java', '363.57 MiB')
, ('metricbeat', '249.55 MiB')
]

To this after this fix was implemented.
2023-03-07T16:40:49.669 controller-0 collectd[65421]: info The top
10 memory rss processes for the Kubernetes System are :
[('kube-apiserver', '609.29 MiB')
, ('kube-controller', '137.29 MiB')
, ('helm-controller', '93.80 MiB')
, ('uwsgi', '88.61 MiB')
, ('uwsgi', '88.60 MiB')
, ('uwsgi', '88.60 MiB')
, ('uwsgi', '88.55 MiB')
, ('cephcsi', '81.06 MiB')
, ('cephcsi', '80.25 MiB')
, ('source-controll', '79.47 MiB')
]

Closes-Bug: 2009877

Test Plan:

PASS: Build an image, install and bootstrap successfully
PASS: Apply monitor pods so addon logs would be installed.
PASS: Ensure only Kubernetes System processes are displayed in the
top 10 Kubernetes System list.

Signed-off-by: cpompeud <Cesar.PompeudeBarrosBombonate@windriver.com>
Change-Id: I1361de835003fdaa7f70941f83b9dd79bfe75c60
2023-03-27 15:39:52 -03:00
Al Bailey
f26bbf8842 Update vm-topology debian package ver based on git
Update debian package versions to use git commits for:
 - vm-topology

Old version was: 1
New version is: 15

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Note:  vm-topology is not currently setup to build or
install on Debian.
 This is because it requires libvirt python components to
run which are not installed on the platform.
 In order to build this package, the debian_pkg_dirs file
needed to be temporarily updated to include vm-topology

Test Plan:
  PASS: build-pkgs -p vm-topology

Story: 2010550
Task: 47410
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I8aa233a6c3daaa68f0a2ad1af33365d320a68665
2023-02-22 17:05:02 +00:00
Zuul
0746efd0b0 Merge "Update kube-cpusets debian package ver based on git" 2023-02-22 14:06:08 +00:00
Zuul
5d51338ad0 Merge "Update kube-memory debian package ver based on git" 2023-02-22 13:58:24 +00:00
Zuul
07b5b87d8f Merge "Update monitor-tools debian package ver based on git" 2023-02-22 13:54:57 +00:00
Al Bailey
3b36737fae Update monitor-tools debian package ver based on git
Update debian package versions to use git commits for:
 - monitor-tools

Old version was: 3
New version is: 7

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p monitor-tools

Story: 2010550
Task: 47409

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I2fe106da8acfeaf28c371d92941023c356b95889
2023-02-21 21:39:42 +00:00
Al Bailey
7e8a560aaa Update kube-memory debian package ver based on git
Update debian package versions to use git commits for:
 - kube-memory

Old version was: 1
New version is: 10

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p kube-memory

Story: 2010550
Task: 47408

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I45cd41b2082707c218ce020e87c0d0428a412d6e
2023-02-21 21:33:09 +00:00
Al Bailey
08efb18479 Update kube-cpusets debian package ver based on git
Update debian package versions to use git commits for:
 - kube-cpusets

Old version was: 1
New version is: 7

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p kube-cpusets

Story: 2010550
Task: 47407
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I4bc094abdafbde265d3abd64bdae3d456ffced86
2023-02-21 21:19:55 +00:00
Mohammad Issa
8cd2324e18 Update collectd-extensions pkg ver based on git
Update debian package versions to use git commits for:
- collectd-extensions

Old version was: 2
New version is: 42

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p collectd-extensions

Story: 2010550
Task: 47406

Signed-off-by: Mohammad Issa <mohammad.issa@windriver.com>
Change-Id: I233e7b644b253a4f8361c1029ce191eada4b49e1
2023-02-17 18:53:43 +00:00
Cristian Mondo
e652193434 Fix to prevent truncating IPv6 value when NTP alarm is triggered.
When the NTP alarm is triggered indicating that the peer is not
reachable and if it is IPv6, the IP value is truncated.
This occurs because the NTP plugin relies on the output of the
ntpq -np command, which shows the truncated IPv6 as well.
This causes the IPv6 in the alarm to be truncated, showing its
partial information.
To fix this, a mechanism was implemented to invoke the ntpq
command but specifying the association corresponding to the IPv6
which is truncated. In this way, detailed information of the
association is retrieved, including the full IPv6.
That IPv6 will be the one that will be used as the value for
the alarm.

Closes-Bug: 2004043

Test Plan:

PASS: Configure NTP with unreachable IPv6 peers to trigger the
NTP alarm
PASS: Configure NTP with reachable IPv6 peers to avoid alarms
PASS: Configure NTP with unreachable IPv4 peers to trigger the
NTP alarm
PASS: Configure NTP with reachable IPv4 peers to avoid alarms

Signed-off-by: Cristian Mondo <cristian.mondo@windriver.com>
Change-Id: Id7e0af4f130f04c5eb037e5ff0d0a0cc5ce71b3e
vr/stx.8.0 __v.stx.test2
2023-01-30 13:54:01 +00:00
Cole Walker
cbf2f17f44 Update logic for PTP holdover transition on secondary NICs
If GNSS signal is lost on the primary NIC for the system, then all ptp4l
instances configured to take time from GNSS should transition to
holdover. This was previously only being applied to the ptp4l instance
associated with the primary GNSS NIC.

This fix resolves this by making ptp4l instances on secondary NICs check
the status of the primary NIC GNSS in order to determine if they should
transition to HOLDOVER.

Test plan:
Configure a node with ts2phc, two ptp4l instances (one per NIC) and SMA
connections.

PASS: Disable GNSS and verify that all ptp4l instances transition from
clockClass 6 -> 7 (HOLDOVER) -> 140 (FREERUN)

PASS: Restore GNSS and verify that all ptp4l instances return to
clockClass 6

PASS: Disable SMA connection and verify that ptp4l instances on
secondary NIC transition from 6 -> 7 -> 140

PASS: Restore SMA connection and verify that ptp4l instances return to
clockClass 6

Closes-Bug: 1995011

Signed-off-by: Cole Walker <cole.walker@windriver.com>
Change-Id: I29ac222ab0660c5a6e5691bd463b0a4332290839
2023-01-23 15:30:48 -05:00
Gustavo Pereira
f586043ff5 Fix for TypeError in kube-memory
This change fixes the TypeError issue
in kube-memory stacktrace feature.
Resolved issue by adding a decode
method to its function return.

Tested Plan:
Pass: Installed AIO-SX with full deployment,
and executed command kube-memory without errors.
Pass: Installed AIO-DX with full deployment,
and executed command kube-memory without errors.

Closes-bug: 1999673

Signed-off-by: Gustavo Pereira <gustavo.lyrapereira@windriver.com>
Change-Id: Ie18ab617cd38a0aad1020af7ffea388dbfa5e830
2023-01-04 11:13:38 -05:00
Al Bailey
4e2a8fcba8 Update tox.ini to work with tox 4
This change will allow this repo to pass zuul now
that this has merged:
https://review.opendev.org/c/zuul/zuul-jobs/+/866943

Tox 4 deprecated whitelist_externals.
Replace whitelist_externals with allowlist_externals

Also fixed the zuul configuration.

Partial-Bug: #2000399

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: Id7f7fb8e75a98df73bf4d14caeeecb9dc5ffb976
2022-12-27 01:20:28 +00:00
Eric MacDonald
4fad452db5 Enable Anon memory alarming
The starlingX collectd memory monitoring plugin is no longer
alarming Anon memory overage due to this previous commit.

https://opendev.org/
starlingx/monitoring/commit/fcc8ddda66b507e747a6e5f32c2300b84e4f7ad6

The Anon (Anonymous) memory 'val.type' dispatched also needs
to be changed from 'percent' to 'memory' like the platform
memory was in that commit so that the reading notification
is sent to the fm_notifier which manages alarm and degrade.

Test Plan: for both total and numa nodes

PASS: Verify Anon memory major alarming and clear
PASS: Verify Anon memory critical alarming, degrade and clear
PASS: Verify Anon memory alarms/degrade clear over collectd restart
PASS: Verify Anon memory degrade handling over multiple alarm
      severity threshold assertion/clear changes across different
      eids. Test for stuck degrade case.

Closes-Bug: 2000251
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I7c436a64886ecb619d2db751a1f92f2ffb1c4e9b
2022-12-21 10:29:48 -05:00
ksingh
f79dcc176f Fixed the Unhandled python exception in collectd.log
UnboundLocalError due to uid referenced before assignment.
As a result, the top 10 memory rss processes were not reflected
in collectd log file.

The collectd memory.py plugin main memory data structure is made
consistent for all groupings.
This also addressed a few minor logic fixes.

Test Plan:

PASS: AIO-SX: Verify collectd memory logs for top rss processes.
PASS: AIO-SX: Verify collectd memory logs contain pods.
PASS: Storage: Verify collectd memory logs for top rss processes.

Closes-Bug: 1999433

Signed-off-by: ksingh <kirti.singh@windriver.com>

Change-Id: Ibf8cb4bc9dae9baa7652c3160e34b29d51ac5c60
2022-12-13 11:25:16 -05:00