monitoring/collectd-extensions/src
Jim Gauld 0232b8b9dc Update collectd cpu plugin and monitor-tools to diagnose cpu spikes
The collectd cpu plugin and monitor-tools are updated to
support diagnosing high cpu usage on shorter time scale.
This includes tools that assist SystemEngineering determine
the source where CPU time is coming from.

This collectd cpu plugin is updated to support Kubernetes services
under system.slice or k8splatform.slice.

This changes the frequency of read function sampling to 1 second.
We now see logs with instantaneous cpu spikes at the cgroup level.
This dispatch of results still occurs at the original plugin
interval of 30 seconds.  The logging of the 1 second sampling is
configurable via /etc/collectd.d/starlingx/python_plugins.conf
field 'hires = <true|false>. The hiresolution samples are always
collected and used for a histogram, but it is not always desired
to log this due to the volume of output.

This adds new logs for occupancy wait. This is similar to cpu
occupancy, but instead of realtime used, it measures the aggregate
percent of time a given cgroup is waiting to schedule. This is a
measure of CPU contention.

This adds new logs for occupancy histograms for all cgroups and
aggregated groupings based on the 1 second occupancy samples.
The histograms are displayed in hirunner order. This displays
the histogram, the mean, 95th-percentile, and max value.
The histograms are logged at 5 minute intervals.

This reduces collectd cgroup to 256 CPUShare from (1024).
This smoothes out behaviour of poorly behaved audits.

The 'schedtop' tool is updated to display 'cgroup' field. This
is the systemd cgroup name, or abbrieviated pod-name. This also
handles Kernel sched output format changes for 6.6.

New tool 'portscanner' is added to monitor-tools to diagnose
local host processes that are using specific ports. This has been
instrumental in discovering gunicorn/keystone API users.

New tool 'k8smetrics' is added to monitor-tools to display
the delay histogram and percentiles for kube-apiserver and
etdcserver. This gives a way to quantify performance as
a result of system load.

Partial-Bug: 2084714

TEST PLAN:
AIO-SX, AIO-DX, Standard, Storage, DC:
PASS: Fresh install ISO
PASS: Verify /var/log/collectd.logs for 1 second cpu/wait logs,
      and contains: etcd, kubelet, and containerd services.
PASS: Verify we are dispatching at 30 second granularity.
PASS: Verify we are displaying histograms every 5 minutes.
PASS: Verify we can enable/disable the display of hiresolution
      logs with /etc/collectd.d/starlingx/python_plugins.conf
      field 'hires = <true|false>'.
PASS: Verify schedtop contains 'cgroup' output.
PASS: Verify output from 'k8smetrics'.
      Cross check against Prometheus GUI for apiserver percentile.
PASS: Verify output from portscanner with port 5000.
      Verify 1-to-1 mapping against /var/log/keystone/keystone-all.log.

Change-Id: I82d4f414afdf1cecbcc99680b360cbad702ba140
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
2024-11-15 02:11:55 -05:00
..
collectd.conf.pmon Extend startuptime in collectd's pmon config file 2022-07-30 20:41:27 +00:00
collectd.service Update collectd cpu plugin and monitor-tools to diagnose cpu spikes 2024-11-15 02:11:55 -05:00
cpu.conf Add alarm debounce support to collectd alarm notifier 2019-10-25 19:39:59 +00:00
cpu.py Update collectd cpu plugin and monitor-tools to diagnose cpu spikes 2024-11-15 02:11:55 -05:00
df.conf Update collectd disk usage checks for debian 2022-06-21 18:22:36 -04:00
example.conf Add alarm debounce support to collectd alarm notifier 2019-10-25 19:39:59 +00:00
example.py Add node ready check to collectd plugins 2021-01-26 12:00:16 -05:00
fm_notifier.py Update collectd disk usage checks for debian 2022-06-21 18:22:36 -04:00
interface.conf Add alarm debounce support to collectd alarm notifier 2019-10-25 19:39:59 +00:00
interface.py Add node ready check to collectd plugins 2021-01-26 12:00:16 -05:00
LICENSE Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1 2018-07-03 11:06:24 -04:00
memory.conf Change platform memory usage instance type to 'memory' 2021-08-23 20:05:20 -04:00
memory.py Update collectd cpu plugin and monitor-tools to diagnose cpu spikes 2024-11-15 02:11:55 -05:00
ntpq.conf Add alarm debounce support to collectd alarm notifier 2019-10-25 19:39:59 +00:00
ntpq.py Fix to prevent truncating IPv6 value when NTP alarm is triggered. 2023-04-21 13:30:14 -03:00
ovs_interface.conf OVS collectd interface/port state monitoring 2019-10-30 19:11:00 +08:00
ovs_interface.py Update collectd cpu plugin and monitor-tools to diagnose cpu spikes 2024-11-15 02:11:55 -05:00
plugin_common.py Update collectd cpu plugin and monitor-tools to diagnose cpu spikes 2024-11-15 02:11:55 -05:00
ptp.conf Add alarm debounce support to collectd alarm notifier 2019-10-25 19:39:59 +00:00
ptp.py Fix GM inconsistent data 2024-10-18 14:43:08 -03:00
python_plugins.conf Update collectd cpu plugin and monitor-tools to diagnose cpu spikes 2024-11-15 02:11:55 -05:00
README Avoid loading collectd's default plugins 2021-02-03 12:03:06 -05:00
remotels.conf Add alarm debounce support to collectd alarm notifier 2019-10-25 19:39:59 +00:00
remotels.py Add node ready check to collectd plugins 2021-01-26 12:00:16 -05:00
service_res.conf Add new collectd plugin to monitor a service status 2021-11-30 17:12:40 -05:00
service_res.py Debian: Fix collectd service config on Debian 2022-06-16 21:53:13 +00:00

The upstream default plugin loader files in this
directory are not used.

The /etc/collectd.conf file is updated to load its
plugins from /etc/collectd.d/starlingx while the
starlingx collectd-extensions package is installed.