docs/doc/source/introduction/fault-and-performance-management-940c6f6b3f6e.rst
Oliver 6c78cd0121 Documentation enhancements for Security and Fault Management
Fixed commit message
Update with review comments for Patch Set 1
Added a new fault and performance section to the Introduction guide.
Change-Id: Ia9e38ab9e8b88430bf7f475200e8bb65b12e5a75
2022-06-28 00:01:45 +00:00

1.1 KiB

Performance and Fault Management

provides a number of tools to allow system administrators to manage performance and troubleshoot system issues.

Performance Management

utilizes collectd ( https://collectd.org/ ) to capture the following platform statistics and to generate threshold events based on these statistics:

  • CPU Usage of Platform Cores of hosts
  • Platform Memory Usage of hosts
  • Platform File Systems Usage
  • Platform Interface Usage
  • PTP Clock Skew Monitor

Any collectd threshold events trigger fault management Set/Clear Customer Alarms.

Fault Management

For an overview of fault management, see fault-management-overview.

For a listing of all fault management resources, including alarm log messages, see 'Alarm messages' and 'Log messages' in the Fault Management Contents <index-fault-kub-f45ef76b6f16> page.