Merge "Update documentation for 1.0"
This commit is contained in:
commit
1c12b277bf
File diff suppressed because it is too large
Load Diff
|
@ -368,12 +368,21 @@ file. This file has the following sections:
|
||||||
to that category of nodes. For example::
|
to that category of nodes. For example::
|
||||||
|
|
||||||
node_cluster_alarms:
|
node_cluster_alarms:
|
||||||
controller:
|
controller-nodes:
|
||||||
cpu: ['cpu-critical-controller', 'cpu-warning-controller']
|
apply_to_node: controller
|
||||||
root-fs: ['root-fs-critical', 'root-fs-warning']
|
alerting: enabled
|
||||||
log-fs: ['log-fs-critical', 'log-fs-warning']
|
members:
|
||||||
|
cpu:
|
||||||
|
alarms: ['cpu-critical-controller', 'cpu-warning-controller']
|
||||||
|
root-fs:
|
||||||
|
alarms: ['root-fs-critical', 'root-fs-warning']
|
||||||
|
log-fs:
|
||||||
|
alarms: ['log-fs-critical', 'log-fs-warning']
|
||||||
|
hdd-errors:
|
||||||
|
alerting: enabled_with_notification
|
||||||
|
alarms: ['hdd-errors-critical']
|
||||||
|
|
||||||
Creates three alarm groups for the cluster of nodes called 'controller':
|
Creates four alarm groups for the cluster of controller nodes:
|
||||||
|
|
||||||
* The *cpu* alarm group is mapped to two alarms defined in the ``alarms``
|
* The *cpu* alarm group is mapped to two alarms defined in the ``alarms``
|
||||||
section known as the 'cpu-critical-controller' and
|
section known as the 'cpu-critical-controller' and
|
||||||
|
@ -388,6 +397,13 @@ file. This file has the following sections:
|
||||||
section known as the 'log-fs-critical' and 'log-fs-warning' alarms. These
|
section known as the 'log-fs-critical' and 'log-fs-warning' alarms. These
|
||||||
alarms monitor the file system where the logs are created on the
|
alarms monitor the file system where the logs are created on the
|
||||||
controller nodes.
|
controller nodes.
|
||||||
|
* The *hdd-errors* alarm group is mapped to the 'hdd-errors-critical' alarm
|
||||||
|
defined in the ``alarms`` section. This alarm monitors the ``kern.log``
|
||||||
|
log entries containing critical IO errors detected by the kernel.
|
||||||
|
The *hdd-error* alarm obtains the *enabled_with_notification* alerting
|
||||||
|
attribute, meaning that the operator will be notified if any of the
|
||||||
|
controller nodes encounters a disk failure. Other alarms do not trigger
|
||||||
|
notification per node but at an aggregated cluster level.
|
||||||
|
|
||||||
.. note:: An *alarm group* is a mere implementation artifact (although it
|
.. note:: An *alarm group* is a mere implementation artifact (although it
|
||||||
has functional value) that is primarily used to distribute the alarms
|
has functional value) that is primarily used to distribute the alarms
|
||||||
|
@ -425,7 +441,7 @@ structure of that file.
|
||||||
important to keep exactly the same copy of
|
important to keep exactly the same copy of
|
||||||
``/etc/hiera/override/gse_filters.yaml`` across all the nodes of the
|
``/etc/hiera/override/gse_filters.yaml`` across all the nodes of the
|
||||||
OpenStack environment including the node(s) where Nagios is installed.
|
OpenStack environment including the node(s) where Nagios is installed.
|
||||||
|
|
||||||
The aggregation rules and correlation policies are defined in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
|
The aggregation rules and correlation policies are defined in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
|
||||||
|
|
||||||
This file has the following sections:
|
This file has the following sections:
|
||||||
|
@ -590,6 +606,7 @@ the service cluster aggregation rules::
|
||||||
output_metric_name: cluster_service_status
|
output_metric_name: cluster_service_status
|
||||||
interval: 10
|
interval: 10
|
||||||
warm_up_period: 20
|
warm_up_period: 20
|
||||||
|
alerting: enabled_with_notification
|
||||||
clusters:
|
clusters:
|
||||||
nova-api:
|
nova-api:
|
||||||
policy: highest_severity
|
policy: highest_severity
|
||||||
|
@ -638,6 +655,10 @@ Where
|
||||||
| The number of seconds after a (re)start that the GSE plugin will wait
|
| The number of seconds after a (re)start that the GSE plugin will wait
|
||||||
before emitting its metric messages.
|
before emitting its metric messages.
|
||||||
|
|
||||||
|
| alerting
|
||||||
|
| Type: string (one of 'disabled', 'enabled' or 'enabled_with_notification').
|
||||||
|
| The alerting configuration of the service clusters.
|
||||||
|
|
||||||
| clusters
|
| clusters
|
||||||
| Type: list
|
| Type: list
|
||||||
| The list of service clusters that the plugin handles. See
|
| The list of service clusters that the plugin handles. See
|
||||||
|
@ -720,6 +741,7 @@ cluster aggregation rules::
|
||||||
output_metric_name: cluster_node_status
|
output_metric_name: cluster_node_status
|
||||||
interval: 10
|
interval: 10
|
||||||
warm_up_period: 80
|
warm_up_period: 80
|
||||||
|
alerting: enabled_with_notification
|
||||||
clusters:
|
clusters:
|
||||||
controller:
|
controller:
|
||||||
policy: majority_of_members
|
policy: majority_of_members
|
||||||
|
@ -768,6 +790,10 @@ Where
|
||||||
| The number of seconds after a (re)start that the GSE plugin will wait
|
| The number of seconds after a (re)start that the GSE plugin will wait
|
||||||
before emitting its metric messages.
|
before emitting its metric messages.
|
||||||
|
|
||||||
|
| alerting
|
||||||
|
| Type: string (one of 'disabled', 'enabled' or 'enabled_with_notification').
|
||||||
|
| The alerting configuration of the node clusters.
|
||||||
|
|
||||||
| clusters
|
| clusters
|
||||||
| Type: list
|
| Type: list
|
||||||
| The list of node clusters that the plugin handles. See
|
| The list of node clusters that the plugin handles. See
|
||||||
|
|
|
@ -10,6 +10,34 @@ Release notes
|
||||||
Version 1.0.0
|
Version 1.0.0
|
||||||
+++++++++++++
|
+++++++++++++
|
||||||
|
|
||||||
|
The StackLight Collector plugin 1.0.0 for Fuel contains the following updates:
|
||||||
|
|
||||||
|
New alarms:
|
||||||
|
|
||||||
|
* Monitor RabbitMQ based on Pacemaker point-of-view
|
||||||
|
* Monitor all partitions and OSD disk(s)
|
||||||
|
* Horizon HTTP 5xx errors
|
||||||
|
* Keystone slow response times
|
||||||
|
* HDD errors
|
||||||
|
* SWAP percent usage
|
||||||
|
* Network packet drops
|
||||||
|
* Local OpenStack API checks
|
||||||
|
* Local checks for services: Apache, Memcached, MySQL, RabbitMQ, Pacemaker
|
||||||
|
|
||||||
|
Alarm enhancements:
|
||||||
|
|
||||||
|
* Added the ``group by`` attribute support for alarm rules
|
||||||
|
* Added support for ``pattern matching`` to filter metric dimensions
|
||||||
|
|
||||||
|
Bug fixes:
|
||||||
|
|
||||||
|
* Fixed the concurrent execution of logrotate.
|
||||||
|
See `#1455104 <https://bugs.launchpad.net/lma-toolchain/+bug/1455104>`_.
|
||||||
|
* Implemented the capability for the Elasticsearch bulk size to increase when
|
||||||
|
required. See `#1617211 <https://bugs.launchpad.net/lma-toolchain/+bug/1617211>`_.
|
||||||
|
* Implemented the capability to use RabbitMQ management API in place of the
|
||||||
|
:command:`rabbitmqctl` command.
|
||||||
|
|
||||||
Version 0.10.0
|
Version 0.10.0
|
||||||
++++++++++++++
|
++++++++++++++
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue