monitoring

Author	SHA1	Message	Date
Jim Gauld	0232b8b9dc	Update collectd cpu plugin and monitor-tools to diagnose cpu spikes The collectd cpu plugin and monitor-tools are updated to support diagnosing high cpu usage on shorter time scale. This includes tools that assist SystemEngineering determine the source where CPU time is coming from. This collectd cpu plugin is updated to support Kubernetes services under system.slice or k8splatform.slice. This changes the frequency of read function sampling to 1 second. We now see logs with instantaneous cpu spikes at the cgroup level. This dispatch of results still occurs at the original plugin interval of 30 seconds. The logging of the 1 second sampling is configurable via /etc/collectd.d/starlingx/python_plugins.conf field 'hires = <true\|false>. The hiresolution samples are always collected and used for a histogram, but it is not always desired to log this due to the volume of output. This adds new logs for occupancy wait. This is similar to cpu occupancy, but instead of realtime used, it measures the aggregate percent of time a given cgroup is waiting to schedule. This is a measure of CPU contention. This adds new logs for occupancy histograms for all cgroups and aggregated groupings based on the 1 second occupancy samples. The histograms are displayed in hirunner order. This displays the histogram, the mean, 95th-percentile, and max value. The histograms are logged at 5 minute intervals. This reduces collectd cgroup to 256 CPUShare from (1024). This smoothes out behaviour of poorly behaved audits. The 'schedtop' tool is updated to display 'cgroup' field. This is the systemd cgroup name, or abbrieviated pod-name. This also handles Kernel sched output format changes for 6.6. New tool 'portscanner' is added to monitor-tools to diagnose local host processes that are using specific ports. This has been instrumental in discovering gunicorn/keystone API users. New tool 'k8smetrics' is added to monitor-tools to display the delay histogram and percentiles for kube-apiserver and etdcserver. This gives a way to quantify performance as a result of system load. Partial-Bug: 2084714 TEST PLAN: AIO-SX, AIO-DX, Standard, Storage, DC: PASS: Fresh install ISO PASS: Verify /var/log/collectd.logs for 1 second cpu/wait logs, and contains: etcd, kubelet, and containerd services. PASS: Verify we are dispatching at 30 second granularity. PASS: Verify we are displaying histograms every 5 minutes. PASS: Verify we can enable/disable the display of hiresolution logs with /etc/collectd.d/starlingx/python_plugins.conf field 'hires = <true\|false>'. PASS: Verify schedtop contains 'cgroup' output. PASS: Verify output from 'k8smetrics'. Cross check against Prometheus GUI for apiserver percentile. PASS: Verify output from portscanner with port 5000. Verify 1-to-1 mapping against /var/log/keystone/keystone-all.log. Change-Id: I82d4f414afdf1cecbcc99680b360cbad702ba140 Signed-off-by: Jim Gauld <James.Gauld@windriver.com> vr/stx.10.0	2024-11-15 02:11:55 -05:00
Caio Bruchert	8f404ea66c	Fix GM inconsistent data In some uncommon cases, after disabling GNSS, the CGU information shows the hardware state unstable for over 20 minutes. GNSS-1PPS and DPLL status cycles through several states until stabilizing. During this time, If the clock class is 6 and DPLL moves to freerun, collectd will wrongly change the clock class to 248 but leave clockAccuracy, offsetScaledLogVariance, timeSource, timeTraceable and frequencyTraceable with the previous values. This commit improves this situation with 3 fixes: 1. When changing the clock class to 248 when DPLL is in freerun, set other GM variables accordingly 2. During collectd cycle, read CGU only once per PCI address 3. Fix clock class moving to 248 and back when restarting collectd Test Plan: PASS: ensure GM variables are correct when going from clock class 6 to 248 (freerun) PASS: ensure CGU is read once per PCI address PASS: ensure clock class does not change when restarting collectd Closes-Bug: 2084933 Change-Id: Iad43b883938a5cc7218c1a53a6318ff23cd77cfe Signed-off-by: Caio Bruchert <caio.bruchert@windriver.com> vf/caracal	2024-10-18 14:43:08 -03:00
Cole Walker	79cd2ee2b0	Correct logic for checking ptp4l source lock An update to the check_ptp_regular() function resulted in a case where the ptp collectd plugin could crash by hitting a KeyError. This fix prevents this by checking that the key is present before trying to read the GNSS/SMA status. If the key is not present then we know that GNSS/SMA is not configured for that NIC and we can skip the check. Additional improvement: Throttle logs related to UTC offset checking. There is no need to print these logs every cycle as the state is not expected to change frequently. Test plan: Pass: Verify collectd plugin startup and operation Pass: Verify ptp source lock alarm is raised when ptp master is lost Pass: Verify UTC logs are throttled Closes-Bug: 2070071 Change-Id: I60eb1602671b388b54ac5938e4a306e4321742e7 Signed-off-by: Cole Walker <cole.walker@windriver.com>	2024-09-10 11:35:14 -04:00
Andre Mauricio Zelak	60b16c4b8b	Ice driver 1.14.9.x support Support the new NMEA serial port name, which changed from /dev/ttyGNSS_XXXX_X to /dev/gnssX. The name change impacted how the PCI address of the GNSS device is derived. Test plan: PASS: Basic collecd start up and verify the ts2phc instance log. Story: 2011056 Task: 50885 Depends-On: https://review.opendev.org/c/starlingx/kernel/+/926116 Change-Id: Ib4a77b29cff437a7ae268e31587d0e9b080e27b6 Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>	2024-08-21 17:56:26 -03:00
Andre Mauricio Zelak	a1d13a000f	Fix not locked to remote PTP Grand Master alarm Fix the not locked to remote PTP Grand Master alarm raise logic. Even when the NMEA serial port is set for a given network interface card, the other NICs can be undisciplined by a local clock source. For those cases where the NIC isn't locked to a clock source the PTP status is checked, and the corresponding alarms are raised. Test Plan: Configure two ptp4l instances, one whose interfaces are on a NIC disciplined by a GNSS 1PPS clock (ptp4l-inst1), and the the other ptp4l instance (ptp4l-inst2) whose interfaces are on a NIC not displined by neither a GNSS nor a SMA1 clock. Configure a phc2sys and set the nmea serial port to the NIC displined by GNSS. It's important to set the nmea serial port, so a primary NIC is configured. Start remote PTP sources for both ptp4l instances. Pass: At start, as ptp4l-inst2 is not disciplined by local clock source check the 1PPS signal loss state alarm is set. Pass: Remove the remove GM connected to ptp4l-inst2 and check the not locked to remote PTP Grand Master alarm is raised. Pass: After some minutes, check the Precision Time Protocol (PTP) clocking is out of tolerance alarm for ptp-inst2 is set. Pass: Configure SMA1 local clock on the NIC linked to ptp4l-inst2, and check both not locked to remote PTP Grand Master and the 1PPS signal loss state alarms are cleared. Pass: Remove the SMA1 configuration on the NIC linked to ptp4l-inst2, and check both not locked to remote PTP Grand Master and the 1PPS signal loss state alarms are raised. Pass: Start the remote PTP connected to ptp4l-inst2 and check not locked to remote PTP Grand Master alarm is cleared. Pass: Remove the remote PTP connected to ptp4l-inst1. As it is disciplined by a local source, check no alarms are raised. Than restart the remote PTP. Closes-Bug: 2070071 Change-Id: I82c0c745d65817a02d921d61e87c91c7aa3aae43 Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>	2024-07-12 13:00:12 -03:00
Andre Mauricio Zelak	629b92d26f	PTP: fix handling query_pmc response Depending on the query_pmc response it could cause the PTP plugin to crash and become unavailable for some time. For example, when the ptp4l is unavailable or incapable to answer the pmc query. Test Plan: Pass: Verify ptp plugin startup and operation Pass: Verify operation when restarting ptp service Pass: Verify operation when ptp4l is down Pass: verify operation when a ptp4l is manually started and stopped. Closes-Bug: 2068025 Change-Id: I7f54377443489669c019d352682ba57f583976fe Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com>	2024-06-04 13:29:23 -03:00
Scott Little	1132ebd11b	Remove CentOS/OpenSUSE build support StarlingX stopped supporting CentOS builds in the after release 7.0. This update will strip CentOS from our code base. It will also remove references to the failed OpenSUSE feature as well. Story: 2011110 Task: 49957 Change-Id: I8eabea83f2c33630027a618bced17d339591bb8b Signed-off-by: Scott Little <scott.little@windriver.com>	2024-04-26 14:14:49 -04:00
Zuul	fc72fcadac	Merge "Raise additional phc2sys alarms on source loss" vf/bookworm vr/stx.9.0	2024-02-06 18:06:35 +00:00
Cole Walker	2708c7accf	Raise additional phc2sys alarms on source loss This change updates the behaviour for raising alarms for HA phc2sys. For the orignal behaviour, if phc2sys was observed to have no valid source clocks, a single no-source-clock alarm was raised and no further phc2sys alarms were processed because all sources are degraded. To improve visibility and troubleshooting operations, this change allows for additional phc2sys alarms to be raised when there are no valid sources, allowing operators to more easily identify the faults. Updated the alarms for source selection change, forced lock, and low priority interface, to allow them to be checked and raised even when no valid sources are available. Test Plan: Pass: Verify ptp plugin startup and operation Pass: Verify operation of no-valid-source alarm Pass: Verify operation of clock-source-no-lock alarm Pass: Verify source selection change, forced lock and low priority source alarms can be raised when no valid sources are present Story: 2010723 Task: 49486 Depends-on: https://review.opendev.org/c/starlingx/monitoring/+/906762 Change-Id: I7f11b951df3fa176782b31ccd0f0a1ff70fbbabf Signed-off-by: Cole Walker <cole.walker@windriver.com>	2024-02-02 15:03:55 -05:00
Zuul	f7a05e8f66	Merge "Improve logic for checking the status of multiple dplls"	2024-02-02 19:30:55 +00:00
Cole Walker	549b0cb3b0	Improve logic for checking the status of multiple dplls Testing on systems with 3+ PTP NICs and 2+ GNSS connections requires enhancements to the dpll status alarming logic. This change provides the following enhancements: 1. Update the alarm string for 1-PPS-signal-loss and GNSS-signal-loss to be distinct. Systems with multiple GNSS sources may be configured in such a way that a given NIC can have both GNSS for time sync and SMA for fallback 1PPS sync. Distinguishing these alarms allows them to be raised independently. 2. In systems with multiple NICs, ptp monitoring needs to store a list of PCI slots for each instance so that the associated DPLLs can be checked for locked states. 3. Create a list of PCI slots for the ts2phc and clock instance types which is populated during initialization. Collectd then reads the DPLLs for these slots on each cycle and updates their state in the dpll_status dict. The dpll_status dict is then consumed by various functions to raise alarms if a DPLL is in a degraded state. 4. When initializing the alarm objects on plugin startup, ensure that interface related alarms are created for each interface discovered in ptp config files. These alarm objects can then be accessed using the alarm type+interface name as keys. This fixes an earlier assumption that there would only be 1 secondary interface for ts2phc and clock type instances. It now properly supports n interfaces. Test plan: Pass: Verify collectd ptp plugin startup and operation Pass: Verify no-lock alarms for GNSS/1PPS raised in configs with 1,2,3+ NICs Pass: Verify alarms clear in above case Pass: Verify clock class is updated for multiple ptp4l instances with 1,2,3+ NICs Pass: Verify alarm objects are created for each configured interface and can be accessed via (alarm type, interface) keys. Story: 2010723 Task: 49469 Change-Id: I974a0873e2947befec446a0abdebb6a711fc54dc Signed-off-by: Cole Walker <cole.walker@windriver.com>	2024-02-01 11:59:50 -05:00
Andre Mauricio Zelak	4afb9305c6	Handle multiple HA domain numbers Fix the domain number logic to get them for every interface configured in the phc2sys. Test plan: PASS: Verify ptp plugin startup and general operation PASS: Verify that alarm is raised when clock source has no lock. PASS: Verify that alarm is cleared when clock source has lock. Story: 2010723 Task: 49440 Change-Id: Ic604cf17626ec295eb5de19555b0bf9e32b8be26 Signed-off-by: Andre Mauricio Zelak <andre.zelak@windriver.com> vf/kernel-6.6	2024-01-22 14:20:38 -03:00
Cole Walker	53651e906f	Update source selection alarm to 'msg' type Change the alarm type for HA phc2sys source selection changes to type 'msg'. It is not necessary for this event to raise/clear an alarm, as this warning is intended for informational purposes and does not require intervention from the user. This allows for source selection changes to be audited without being treated as a system fault. Test plan: PASS: Verify ptp plugin startup and general operation PASS: Verify that source selection changes generate a log entry in 'fm event-list' PASS: Verify other HA phc2sys alarms are raised correctly and display in fm alarm-list Story: 2010723 Task: 49393 Change-Id: I7632e79ffd7522eb4e7caa34a1959aa62d3e8879 Signed-off-by: Cole Walker <cole.walker@windriver.com>	2024-01-12 11:28:16 -05:00
Cole Walker	357314ef63	Support ha_phc2sys 'valid sources' alarm and multi-domain configs Add support for checking the ha_phc2sys 'valid sources' query so that collectd can properly raise an alarm when none of the configured clock sources meet the quality requirements for selection by ha_phc2sys. Add support for selecting the correct domainNumber value for pmc queries when ha_phc2sys is using multiple ptp4l sources with different domains numbers. Bonus fix: Adds special handling to ptp4l config parsing to ignore the 'unicast_master_table' section if present. This section may be included in certain ptp4l deployments where users want to use unicast transmission instead of the default multicast behaviour. There is currently no need to handle this section in collectd. Test plan: Pass: Verify alarm is raised when ha_phc2sys has no valid sources to select, and cleared when a valid source becomes available Pass: Verify pmc commands succeed when ha_phc2sys has - Global domainNumber - Per interface domainNumber - No domainNumber defined Pass: Verify that ptp4l configs with and without 'unicast_master_table' are parsed correctly Story: 2010723 Task: 49012 Signed-off-by: Cole Walker <cole.walker@windriver.com> Change-Id: I3a2d106cfa5d01b9538a7d315d3565b06050a8ce	2023-11-02 12:36:32 -04:00
Caio Bruchert	fe1b81ff5e	PTP: fix handling phc2sys offset When using phc2sys's -O 0 command line option, the out of tolerance alarm is incorrectly raised. This fix adds support to the ph2sys offset configuration along with ptp4l's. If phc2sys is disciplining the clock and the offset is configured using the -O command line option, it will be used instead of ptp4l's offset. Log errors will be generated in case there are more than one phc2sys instance disciplining the clock and when there is a mismatch between phc2sys and ptp4l's offset configuration. Test Plan: PASS: check cmd line options: no -O, -O 0, -O 36, -O 37 PASS: check cmd line options: no -c, -c CLOCK_REALTIME, -c /dev/ptp1 PASS: check config file configuration PASS: verify that out-of-threshold alarm is raised when PHC is skewed, tested using 37s offset and 0s offset Depends-on: https://review.opendev.org/c/starlingx/monitoring/+/895543 Closes-bug: 2038463 Change-Id: Idb6e1b68361f77a8a29b81b86026a0a6dd42054f Signed-off-by: Caio Bruchert <caio.bruchert@windriver.com>	2023-10-05 15:28:22 +00:00
Cole Walker	91b4ee01ae	Add support for additional G.8275.x Announce fields Add additional logic to set the appropriate PTP Announce message fields to conform with the G.8275.1 spec. This change ensures that a PTP node operating as a GM sets values for these fields when G.8275.x profile is enabled in ptp4l: clockAccuracy offsetScaledLogVariance timeSource timeTraceable frequencyTraceable currentUtcOffsetValid This is handled by the addition of two new functions: intitialize_ptp4l_state_fields() determines initial/static values after the ptp4l config file is read and handle_ptp4l_g8275_fields() updates the dynamic values based on the state of the Primary Reference Time Clock An additional fix was implemented to remove the dependency on the hardcoded leapsecond value and instead read it dynamically from the running ptp4l instances. Test Plan: PASS: Identify PRTC presence and type PASS: GM node sets correct Announce field values based on PRTC lock state PASS: Verify BC nodes receive and pass on Announce message values from a configured GM PASS: Ensure that BC nodes report a clockClass of 165 when connection to GM is lost Depends-on: https://review.opendev.org/c/starlingx/config/+/895547 Closes-bug: 2036431 Change-Id: I1f62b4081aedb654d80876f041afd2e274b1eb16 Signed-off-by: Cole Walker <cole.walker@windriver.com>	2023-10-04 17:29:51 -04:00
sshathee	6bf4c23db0	Generalize platform pods check using platform label This commit introduces a check for platform label, which if present, the cpu and memory for these pods will be considered in platform resources. Earlier, pod was considered as part of platform resources by hardcoded namespaces, which are tiresome to maintain. Now pods are treated as 'platform' if the namespace is in set of hard-coded namespaces, or if the component label is 'app.starlingx.io/component=platform'. Test Plan: Pass: Code changes on AIO-SX and collectd was restarted successfully Pass: Verify that /var/log/collectd.log contains cpu and memory consumption info for platform pods. Story: 2010904 Task: 48718 Change-Id: Ia8442717009f92dbe022f9512e226913c45d9473 Signed-off-by: sshathee <shunmugam.shatheesh@windriver.com>	2023-09-11 01:55:36 -04:00
Marcos Paulo Oliveira Silva	ccb106deec	Add sriov-fec-system namespace to the platform infra list in collectd Currently the pods installed in the sriov-fec-operator namespace run on application cores. The sriov-fec-operator App is seen as a platform app and therefore its pods need to run on platform cores. Accordingly, add the sriov-fec-system namespace to the list of platform namespaces in collectd. Test Plan: PASS: Execute kube-memory tool and ensure its output contains fec namespace data. PASS: Edit collectd tool source code to display the list of namespaces classified as K8S_NAMESPACE_SYSTEM and verify if the sriov-fec-operator namespace was found. Story: 2010826 Task: 48639 Change-Id: I85a07e7a30018b28ea49e96f0100294d40ce4433 Signed-off-by: Marcos Paulo Oliveira Silva <Marcos.PauloOliveiraSilva@windriver.com>	2023-08-31 13:46:49 -03:00
Zuul	0d1ac162dc	Merge "Enhance ptp alarming logic"	2023-08-29 17:02:07 +00:00
Cole Walker	b0171eb056	Enhance ptp alarming logic Fix 1: As per the HA phc2sys design, the query for checking the forced lock state of HA phc2sys has been updated to 'forced lock' Fix 2: Correct the handling for setting the ptp4l clockClass when there is no ts2phc instance configured to set time on that NIC. Fix 3: Upgrade the logic in process_ptp_synce(). NICs may be configured with both a GNSS connection and an SMA connection, so it is necessary to check for a GNSS lock first. If GNSS is locked, no further checking is required. If GNSS is lost, check if SMA is configured and locked before raising alarm. Test plan: Pass: Validate that 'forced lock' query works correctly Pass: Validate that ptp4l clockClass remains 248 when there is no ts2phc instance configured to set time on a NIC Pass: Validate that a NIC with both GNSS and SMA connection configured does not incorrectly raise an alarm if GNSS is present Story: 2010723 Task: 48602 Change-Id: I92869c7c28c786c5e68cd493ae6582b9b4884c21 Signed-off-by: Cole Walker <cole.walker@windriver.com>	2023-08-29 12:43:11 -04:00
Alyson Deives Pereira	847fad4eb8	Add intel-power and power-metrics to list of platform namespaces The intel-power and power-metrics namespaces are used by Kubernetes Power Manager [1] and Power Metrics [2] StarlingX platform applications. Therefore, their pods have to run at platform cores. This change enables collectd logs and kube-memory output to display cpu and memory consumption from these namespaces. [1] https://opendev.org/starlingx/app-kubernetes-power-manager [2] https://opendev.org/starlingx/app-power-metrics Test Plan: - PASS: Execute kube-memory tool and ensure its output contains intel-power and power-metrics namespace info. - PASS: Verify that /var/log/collectd.log contains cpu and memory consumption info from power-metrics and intel-power namespace processes. Story: 2010773 Task: 48415 Depends-On: https://review.opendev.org/c/starlingx/monitoring/+/887744 Change-Id: Ifefa7950bd9ecb1e4177e44ca51743ae8837fc87 Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>	2023-08-11 14:13:17 +00:00
Zuul	f39d41bb5d	Merge "Add the NFD namespace to the platform infrastructure list in collectd"	2023-08-11 12:17:51 +00:00
Zuul	23939e2a53	Merge "[PTP] Handle case where phc2sys has no source interface"	2023-08-10 21:25:45 +00:00
Marcos Paulo Oliveira Silva	419a0686ab	Add the NFD namespace to the platform infrastructure list in collectd Actually, the pods installed in the node-feature-discovery namespace run on application cores. Although, the Node Feature Discovery App is seen as a platform app, and therefore, its pods need to run on platform cores. So, in this change, the node-feature-discovery namespace will be added in the collectd platform infrastructure list to follow the changes did in kubelet Test Plan: PASS: Execute kube-memory tool and ensure its output contains nfd namespace data PASS: Edit collectd tool source code to display the list of namespaces classified as K8S_NAMESPACE_SYSTEM and verify if the node-feature-discovery namespace was found. Story: 2010769 Task: 48326 Depends-On: https://review.opendev.org/c/starlingx/integ/+/887743 Change-Id: Idc31c36fc4e12d4f91b2fa45ed9dee663cb024f9 Signed-off-by: Marcos Paulo Oliveira Silva <Marcos.PauloOliveiraSilva@windriver.com>	2023-07-31 20:10:42 +00:00
Cole Walker	24e74e2f3e	[PTP] Handle case where phc2sys has no source interface In preparation for the upcoming HA Phc2sys functionality, the ptp plugin must handle the case where phc2sys reports that it has no source clock interface. When the string "None" is returned from the phc2sys communication socket, query_phc2sys_socket() now returns None, which can be properly handled by the existing logic. Test Plan: Pass: Verify that a phc2sys source interface value of "None" is accepted and triggers the phc2sys no-source-clock alarm Pass: Alarm clears when phc2sys selects a source interface Story: 2010723 Task: 48520 Change-Id: I2b39e92ca9f6fe36b29c4c7b8ddf12a9206121a1 Signed-off-by: Cole Walker <cole.walker@windriver.com>	2023-07-31 15:40:18 -04:00
Zuul	ad30d38bab	Merge "Add collectd alarming support for HA phc2sys" vf/antelope	2023-07-28 20:18:25 +00:00
Cole Walker	74754a243c	Add collectd alarming support for HA phc2sys Extend the ptp collectd plugin to support the required alarms for the Redundant / HA PTP timing clock sources feature. This change consists of three main parts. The first is the introduction of the TimingInstance class which is used to read the phc2sys configuration, as well as store and update phc2sys state data. The goal of the TimingInstance class is to provide an interface that can be extended for all ptp instance types in the future as the ptp plugin is enhanced. This will serve to manage the config data and state data for all instance types in a similar way. The second part was to create the required alarms for monitoring HA phc2sys configurations. These alarms are handled in process_phc2sys_ha(). The new alarms are: ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_SELECTION_CHANGE ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_LOW_PRIORITY ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_LOSS ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_NO_LOCK ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_OOT ALARM_CAUSE__PHC2SYS_CLOCK_SOURCE_FORCED_SELECTION The third change was to enhance the ts2phc monitoring to handle multiple ts2phc instances, as this configuration must be supported for this feature. Test plan: PASS: Verify existing alarm operation is unaffected PASS: Verify new phc2sys alarms operate correctly PASS: Verify alarming works with multiple ts2phc instances PASS: Ensure that new phc2sys alarming only operates when a HA phc2sys instance is configured on the system Failure paths: PASS: Ensure that ptp plugin correctly disables phc2sys HA monitoring when HA mode is disabled or when phc2sys has no configured interfaces to monitor PASS: Ensure that loss of phc2sys socket is handled. Source loss alarm raised, and cleared when socket connection is available again PASS: Ensure that loss of PMC is handled. Source lock alarm is raised and cleared once PMC returns. Story: 2010723 Task: 48441 Change-Id: I794d8d8b54ab3435181c1bae814d72852b2fce0a	2023-07-28 12:33:54 -04:00
cpompeud	a15a1ad560	Fixes collectd memory extension This change corrects the variable name that was misspelled in the collectd memory plugin causing it to fail: Error message: Unhandled python exception in read callback: NameError: name 'slab' is not defined Traceback (most recent call last): File "/opt/collectd/extensions/python/memory.py", line 703, in read_func obj.normal_nodes = calc_normal_memory_nodes() File "/opt/collectd/extensions/python/memory.py", line 354, in calc_normal_memory_nodes normal_nodes[node]['slab_MiB'] = slab NameError: name 'slab' is not defined Working Logs: 2023-05-18T18:09:59.124 controller-0 collectd[280000]: info 4K memory usage: Anon: 18.6%, Anon: 4222.6 MiB, cgroup-rss: 4271.8 MiB, Avail: 18460.1 MiB, Total: 22682.7 MiB Slab: 867.3 MiB 2023-05-18T18:09:59.124 controller-0 collectd[280000]: info 4K numa memory usage: node0, Anon: 18.15%, Anon: 4222.6 MiB, Avail: 19041.8 MiB, Total: 23264.4 MiB Slab: 867.3 MiB Test Plan: - PASS: Build an image, install and bootstrap successfully - PASS: Apply monitor pods so addon logs would be installed. - PASS: Check that log entries are correctly displayed. Closes-Bug: 2019007 Change-Id: Ic0089fd1c6922fe8ec02e9161f57421f9bb77209 Signed-off-by: cpompeud <Cesar.PompeudeBarrosBombonate@windriver.com>	2023-07-15 02:54:21 +00:00
Cesar Bombonate	fc336f95b6	Add additional logging for Collectd and fix non descriptive output. This change adds additional logging for pods not in the kube-system or in the kube-addon namespace that are logged every 30 minutes. Additionally we have added additional information for pods where the UID was not found. The logs now include entries for pods outside of kube-addon and kube-system namespaces: 2023-05-12T15:00:42.351 controller-0 collectd[72599]: info The pod: cm-cert-manager-55659b97c7-w52bq running in namespace:cert-manager has the following processes{95662: {'rss': 55248.0, 'name': 'controller'} , 95352: {'rss': 4.0, 'name': 'pause'}} Non descriptive logs exemplified below: 2023-05-08T13:10:50.059 controller-0 collectd[72636]: info platform memory usage: uid 261d40cea94de12fc54c41279cf269c9 not found 2023-05-08T13:10:50.059 controller-0 collectd[72636]: info platform memory usage: uid e90a2332-5753-48bc-a706-f611b9fa4f2e not found 2023-05-08T13:10:50.059 controller-0 collectd[72636]: info platform memory usage: uid f38297b6-6940-437d-996b-addacb2cb330 not found Thus we have changed this to now include the podname and namespace: collectd.warning('%s: uid %s for pod %s not found in namespace %s' % ( PLUGIN, uid, pod.name, pod.namespace)) Test Plan: - PASS: Build an image, install and bootstrap successfully - PASS: Apply monitor pods so addon logs would be installed. - PASS: Check that log entries are correctly displayed. Closes-Bug: 2019007 Signed-off-by: Cesar Bombonate <Cesar.PompeudeBarrosBombonate@windriver.com> Change-Id: If9207b8d23aefe010d0475e36b0644343df911ea	2023-06-05 17:18:11 +00:00
Zuul	99a76c1d00	Merge "Fix github mirroring for this repo"	2023-05-01 20:51:03 +00:00
Zuul	bacf4e23c2	Merge "Remove python2 jobs from zuul for this repo"	2023-05-01 20:34:58 +00:00
Zuul	58a5726e58	Merge "Fix to prevent truncating IPv6 value when NTP alarm is triggered."	2023-05-01 20:09:35 +00:00
Al Bailey	9c48ac6611	Remove python2 jobs from zuul for this repo - Remove the python2 jobs from zuul for this repo - Remove python2 entries from test-requirements and tox - Removed redundant basepython and other tox.ini entries - Updated the upper constraints for the newer python - Fix the test-requirements so 'cover' can run - Update .gitignore to show a clean repo after running tox - Added prettytable to the requirements files - Updated the versions of python in setup.cfg These changes should only affect tox and zuul. However, since the requirements.txt files were updated, an ISO was also booted to verify no runtime impact. Test Plan: PASS: Build packages and ISO PASS: Boot AIO-SX, bootstrap and unlocked. PASS: tox (able to run tox for all 3 tox.ini files) PASS: run kube-memory and kube-cpusets on controller Story: 2010642 Task: 47882 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I4359f3659e75ddfda4208524a6b74360dfe5ee0c	2023-04-28 18:24:13 +00:00
Davlet Panech	f5437709d1	Fix github mirroring for this repo Updating the rsa ssh host key based on: https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/ Note: In the future, StarlingX should have a zuul job and secret setup for all repos so we do not need to do this for every repo. Needed to rename the secret, because zuul fails if like-named secrets have diffent values in different branches of the same repo. Partial-Bug: #2015246 Change-Id: I66254eebd2788ee510f5f9768b1c4275e1c7cfe5 Signed-off-by: Davlet Panech <davlet.panech@windriver.com>	2023-04-28 12:38:52 -04:00
Cristian Mondo	99c893e966	Fix to prevent truncating IPv6 value when NTP alarm is triggered. This fix is to allow the validation of IPv6 when ntpq command output returns a invalid IPv6 format. In some cases the truncated IPv6 only ends with single colon, internally the a method is invoked to validate the IP family corresponding to that format and since it is not a valid format, it fails. This behavior causes the returned IP to always be a truncated IP. The logic is corrected to validate only when the IP is version 4 only. Test Plan: PASS: Configure NTP with unreachable IPv6 peers to trigger the NTP alarm PASS: Configure NTP with reachable IPv6 peers to avoid alarms PASS: Configure NTP with unreachable IPv4 peers to trigger the NTP alarm PASS: Configure NTP with reachable IPv4 peers to avoid alarms Closes-Bug: 2004043 Review Ref: https://review.opendev.org/c/starlingx/monitoring/+/872036 Change-Id: I8b5b0080a4714cc864a4bdd0a7e8ad558e18adfa Signed-off-by: Cristian Mondo <cristian.mondo@windriver.com>	2023-04-21 13:30:14 -03:00
cpompeud	d5aa0bf737	Collectd top 10 k8s system process list incorrectly has k8s addon This change corrects the process list so that only processes from the kube_system are displayed. The list was changed from this: 2023-01-09T22:25:32.172 controller-0 collectd[153770]: info The top 10 memory rss processes for the Kubernetes System are : [('java', '36.72 GiB') , ('java', '26.87 GiB') , ('java', '4.25 GiB') , ('java', '2.71 GiB') , ('autodetect', '860.24 MiB') , ('java', '826.97 MiB') , ('kube-apiserver', '801 .15 MiB') , ('autodetect', '606.67 MiB') , ('java', '363.57 MiB') , ('metricbeat', '249.55 MiB') ] To this after this fix was implemented. 2023-03-07T16:40:49.669 controller-0 collectd[65421]: info The top 10 memory rss processes for the Kubernetes System are : [('kube-apiserver', '609.29 MiB') , ('kube-controller', '137.29 MiB') , ('helm-controller', '93.80 MiB') , ('uwsgi', '88.61 MiB') , ('uwsgi', '88.60 MiB') , ('uwsgi', '88.60 MiB') , ('uwsgi', '88.55 MiB') , ('cephcsi', '81.06 MiB') , ('cephcsi', '80.25 MiB') , ('source-controll', '79.47 MiB') ] Closes-Bug: 2009877 Test Plan: PASS: Build an image, install and bootstrap successfully PASS: Apply monitor pods so addon logs would be installed. PASS: Ensure only Kubernetes System processes are displayed in the top 10 Kubernetes System list. Signed-off-by: cpompeud <Cesar.PompeudeBarrosBombonate@windriver.com> Change-Id: I1361de835003fdaa7f70941f83b9dd79bfe75c60	2023-03-27 15:39:52 -03:00
Al Bailey	f26bbf8842	Update vm-topology debian package ver based on git Update debian package versions to use git commits for: - vm-topology Old version was: 1 New version is: 15 The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Note: vm-topology is not currently setup to build or install on Debian. This is because it requires libvirt python components to run which are not installed on the platform. In order to build this package, the debian_pkg_dirs file needed to be temporarily updated to include vm-topology Test Plan: PASS: build-pkgs -p vm-topology Story: 2010550 Task: 47410 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I8aa233a6c3daaa68f0a2ad1af33365d320a68665	2023-02-22 17:05:02 +00:00
Zuul	0746efd0b0	Merge "Update kube-cpusets debian package ver based on git"	2023-02-22 14:06:08 +00:00
Zuul	5d51338ad0	Merge "Update kube-memory debian package ver based on git"	2023-02-22 13:58:24 +00:00
Zuul	07b5b87d8f	Merge "Update monitor-tools debian package ver based on git"	2023-02-22 13:54:57 +00:00
Al Bailey	3b36737fae	Update monitor-tools debian package ver based on git Update debian package versions to use git commits for: - monitor-tools Old version was: 3 New version is: 7 The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p monitor-tools Story: 2010550 Task: 47409 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I2fe106da8acfeaf28c371d92941023c356b95889	2023-02-21 21:39:42 +00:00
Al Bailey	7e8a560aaa	Update kube-memory debian package ver based on git Update debian package versions to use git commits for: - kube-memory Old version was: 1 New version is: 10 The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p kube-memory Story: 2010550 Task: 47408 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I45cd41b2082707c218ce020e87c0d0428a412d6e	2023-02-21 21:33:09 +00:00
Al Bailey	08efb18479	Update kube-cpusets debian package ver based on git Update debian package versions to use git commits for: - kube-cpusets Old version was: 1 New version is: 7 The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p kube-cpusets Story: 2010550 Task: 47407 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I4bc094abdafbde265d3abd64bdae3d456ffced86	2023-02-21 21:19:55 +00:00
Mohammad Issa	8cd2324e18	Update collectd-extensions pkg ver based on git Update debian package versions to use git commits for: - collectd-extensions Old version was: 2 New version is: 42 The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p collectd-extensions Story: 2010550 Task: 47406 Signed-off-by: Mohammad Issa <mohammad.issa@windriver.com> Change-Id: I233e7b644b253a4f8361c1029ce191eada4b49e1	2023-02-17 18:53:43 +00:00
Cristian Mondo	e652193434	Fix to prevent truncating IPv6 value when NTP alarm is triggered. When the NTP alarm is triggered indicating that the peer is not reachable and if it is IPv6, the IP value is truncated. This occurs because the NTP plugin relies on the output of the ntpq -np command, which shows the truncated IPv6 as well. This causes the IPv6 in the alarm to be truncated, showing its partial information. To fix this, a mechanism was implemented to invoke the ntpq command but specifying the association corresponding to the IPv6 which is truncated. In this way, detailed information of the association is retrieved, including the full IPv6. That IPv6 will be the one that will be used as the value for the alarm. Closes-Bug: 2004043 Test Plan: PASS: Configure NTP with unreachable IPv6 peers to trigger the NTP alarm PASS: Configure NTP with reachable IPv6 peers to avoid alarms PASS: Configure NTP with unreachable IPv4 peers to trigger the NTP alarm PASS: Configure NTP with reachable IPv4 peers to avoid alarms Signed-off-by: Cristian Mondo <cristian.mondo@windriver.com> Change-Id: Id7e0af4f130f04c5eb037e5ff0d0a0cc5ce71b3e vr/stx.8.0 __v.stx.test2	2023-01-30 13:54:01 +00:00
Cole Walker	cbf2f17f44	Update logic for PTP holdover transition on secondary NICs If GNSS signal is lost on the primary NIC for the system, then all ptp4l instances configured to take time from GNSS should transition to holdover. This was previously only being applied to the ptp4l instance associated with the primary GNSS NIC. This fix resolves this by making ptp4l instances on secondary NICs check the status of the primary NIC GNSS in order to determine if they should transition to HOLDOVER. Test plan: Configure a node with ts2phc, two ptp4l instances (one per NIC) and SMA connections. PASS: Disable GNSS and verify that all ptp4l instances transition from clockClass 6 -> 7 (HOLDOVER) -> 140 (FREERUN) PASS: Restore GNSS and verify that all ptp4l instances return to clockClass 6 PASS: Disable SMA connection and verify that ptp4l instances on secondary NIC transition from 6 -> 7 -> 140 PASS: Restore SMA connection and verify that ptp4l instances return to clockClass 6 Closes-Bug: 1995011 Signed-off-by: Cole Walker <cole.walker@windriver.com> Change-Id: I29ac222ab0660c5a6e5691bd463b0a4332290839	2023-01-23 15:30:48 -05:00
Gustavo Pereira	f586043ff5	Fix for TypeError in kube-memory This change fixes the TypeError issue in kube-memory stacktrace feature. Resolved issue by adding a decode method to its function return. Tested Plan: Pass: Installed AIO-SX with full deployment, and executed command kube-memory without errors. Pass: Installed AIO-DX with full deployment, and executed command kube-memory without errors. Closes-bug: 1999673 Signed-off-by: Gustavo Pereira <gustavo.lyrapereira@windriver.com> Change-Id: Ie18ab617cd38a0aad1020af7ffea388dbfa5e830	2023-01-04 11:13:38 -05:00
Al Bailey	4e2a8fcba8	Update tox.ini to work with tox 4 This change will allow this repo to pass zuul now that this has merged: https://review.opendev.org/c/zuul/zuul-jobs/+/866943 Tox 4 deprecated whitelist_externals. Replace whitelist_externals with allowlist_externals Also fixed the zuul configuration. Partial-Bug: #2000399 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: Id7f7fb8e75a98df73bf4d14caeeecb9dc5ffb976	2022-12-27 01:20:28 +00:00
Eric MacDonald	4fad452db5	Enable Anon memory alarming The starlingX collectd memory monitoring plugin is no longer alarming Anon memory overage due to this previous commit. https://opendev.org/ starlingx/monitoring/commit/fcc8ddda66b507e747a6e5f32c2300b84e4f7ad6 The Anon (Anonymous) memory 'val.type' dispatched also needs to be changed from 'percent' to 'memory' like the platform memory was in that commit so that the reading notification is sent to the fm_notifier which manages alarm and degrade. Test Plan: for both total and numa nodes PASS: Verify Anon memory major alarming and clear PASS: Verify Anon memory critical alarming, degrade and clear PASS: Verify Anon memory alarms/degrade clear over collectd restart PASS: Verify Anon memory degrade handling over multiple alarm severity threshold assertion/clear changes across different eids. Test for stuck degrade case. Closes-Bug: 2000251 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: I7c436a64886ecb619d2db751a1f92f2ffb1c4e9b	2022-12-21 10:29:48 -05:00
ksingh	f79dcc176f	Fixed the Unhandled python exception in collectd.log UnboundLocalError due to uid referenced before assignment. As a result, the top 10 memory rss processes were not reflected in collectd log file. The collectd memory.py plugin main memory data structure is made consistent for all groupings. This also addressed a few minor logic fixes. Test Plan: PASS: AIO-SX: Verify collectd memory logs for top rss processes. PASS: AIO-SX: Verify collectd memory logs contain pods. PASS: Storage: Verify collectd memory logs for top rss processes. Closes-Bug: 1999433 Signed-off-by: ksingh <kirti.singh@windriver.com> Change-Id: Ibf8cb4bc9dae9baa7652c3160e34b29d51ac5c60	2022-12-13 11:25:16 -05:00

1 2 3 4 5 ...

269 Commits