[docs] Edits Alarms and Appendix
Edits the following sections of the StackLight Collector plugin 0.10.0 documentation: * Configuring alarms * Appendix Change-Id: I534611a4eae9aeb97bfedb3971d7a8ec76e20bac
This commit is contained in:
parent
5c0d43aaec
commit
8581289600
|
@ -1,9 +1,13 @@
|
|||
.. _alarms:
|
||||
|
||||
.. raw:: latex
|
||||
|
||||
\pagebreak
|
||||
|
||||
List of built-in alarms
|
||||
-----------------------
|
||||
|
||||
Here is a list of all the alarms that are built-in in StackLight::
|
||||
The following is a list of StackLight built-in alarms::
|
||||
|
||||
alarms:
|
||||
- name: 'cpu-critical-controller'
|
||||
|
@ -732,5 +736,4 @@ Here is a list of all the alarms that are built-in in StackLight::
|
|||
threshold: 5
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
|
||||
function: min
|
|
@ -3,8 +3,8 @@
|
|||
List of metrics
|
||||
---------------
|
||||
|
||||
Here is a list of metrics that are emitted by the StackLight Collector.
|
||||
They are listed by category then by metric name.
|
||||
The following is a list of metrics that are emitted by the StackLight Collector.
|
||||
The metrics are listed by category, then by metric name.
|
||||
|
||||
System
|
||||
++++++
|
||||
|
@ -63,7 +63,7 @@ Clusters
|
|||
|
||||
.. include:: metrics/clusters.rst
|
||||
|
||||
Self Monitoring
|
||||
Self-monitoring
|
||||
+++++++++++++++
|
||||
|
||||
.. include:: metrics/lma.rst
|
||||
|
@ -78,4 +78,4 @@ Elasticsearch
|
|||
InfluxDB
|
||||
++++++++
|
||||
|
||||
.. include:: metrics/influxdb.rst
|
||||
.. include:: metrics/influxdb.rst
|
|
@ -3,139 +3,130 @@
|
|||
Overview
|
||||
--------
|
||||
|
||||
The process of running alarms in StackLight is not centralized
|
||||
(as it is often the case in more conventional monitoring systems)
|
||||
but distributed across all the StackLight Collectors.
|
||||
The process of running alarms in StackLight is not centralized, as it is often
|
||||
the case in more conventional monitoring systems, but distributed across all
|
||||
the StackLight Collectors.
|
||||
|
||||
Each Collector is individually responsible for monitoring the
|
||||
resources and the services that are deployed on the node and for reporting
|
||||
any anomaly or fault it has detected to the Aggregator.
|
||||
Each Collector is individually responsible for monitoring the resources and
|
||||
services that are deployed on the node and for reporting any anomaly or fault
|
||||
it has detected to the Aggregator.
|
||||
|
||||
The anomaly and fault detection logic in StackLight is designed
|
||||
more like an *expert system* in that the Collector and the Aggregator
|
||||
use artifacts we could refer to as *facts* and *rules*.
|
||||
The anomaly and fault detection logic in StackLight is designed more like an
|
||||
*expert system* in that the Collector and the Aggregator use artifacts we
|
||||
can refer to as *facts* and *rules*.
|
||||
|
||||
The *facts* are the operational data ingested in the StackLight's
|
||||
stream processing pipeline.
|
||||
The *rules* are either alarm rules or aggregation rules.
|
||||
They are declaratively defined in YAML files that can be modified.
|
||||
Those rules are turned into a collection of Lua plugins
|
||||
that are executed by the Collector and the Aggregator.
|
||||
They are generated dynamically using the Puppet modules of the StackLight
|
||||
Collector Plugin.
|
||||
stream-processing pipeline. The *rules* are either alarm rules or aggregation
|
||||
rules. They are declaratively defined in YAML files that can be modified.
|
||||
Those rules are turned into a collection of Lua plugins that are executed by
|
||||
the Collector and the Aggregator. They are generated dynamically using the
|
||||
Puppet modules of the StackLight Collector Plugin.
|
||||
|
||||
There are two types of Lua plugins related to the processing
|
||||
of alarms.
|
||||
The following are the two types of Lua plugins related to the processing of
|
||||
alarms:
|
||||
|
||||
* The **AFD plugin** for Anomaly and Fault Detection plugin.
|
||||
* The **GSE plugin** for Global Status Evaluation plugin.
|
||||
* The **AFD plugin** -- Anomaly and Fault Detection plugin
|
||||
* The **GSE plugin** -- Global Status Evaluation plugin
|
||||
|
||||
These plugins create a special type of metric called respectively
|
||||
the **AFD metric** and the **GSE metric**.
|
||||
These plugins create special types of metrics, as follows:
|
||||
|
||||
* The AFD metric contains information about the health status
|
||||
of a node or service in the OpenStack environment.
|
||||
The AFD metrics are sent on a regular basis to the Aggregator
|
||||
where they are further processed by the GSE plugins.
|
||||
* The GSE metric contains information about the health status
|
||||
of a cluster in the OpenStack environment. A cluster is a
|
||||
logical grouping of nodes or services. We call
|
||||
them node clusters and service clusters hereafter.
|
||||
A service cluster can be anything like a cluster of API endpoints
|
||||
or a cluster of workers. A cluster of nodes is a grouping of
|
||||
nodes that have the same role. For example 'compute' or 'storage'.
|
||||
* The **AFD metric**, which contains information about the health status of a
|
||||
node or service in the OpenStack environment. The AFD metrics are sent on a
|
||||
regular basis to the Aggregator where they are further processed by the GSE
|
||||
plugins.
|
||||
|
||||
.. note:: The AFD and GSE metrics are new types of metrics introduced
|
||||
in StackLight version 0.8.
|
||||
They contain detailed information about the fault and anomalies
|
||||
detected by StackLight. Please refer to the
|
||||
* The **GSE metric**, which contains information about the health status of a
|
||||
cluster in the OpenStack environment. A cluster is a logical grouping of
|
||||
nodes or services. We call them node clusters and service clusters hereafter.
|
||||
A service cluster can be anything like a cluster of API endpoints or a
|
||||
cluster of workers. A cluster of nodes is a grouping of nodes that have the
|
||||
same role. For example, *compute* or *storage*.
|
||||
|
||||
.. note:: The AFD and GSE metrics are new types of metrics introduced in
|
||||
StackLight version 0.8. They contain detailed information about the fault
|
||||
and anomalies detected by StackLight. For more information about the
|
||||
message structure of these metrics, refer to
|
||||
`Metrics section of the Developer Guide
|
||||
<http://lma-developer-guide.readthedocs.io/en/latest/metrics.html>`_
|
||||
for more information about the message structure of these metrics.
|
||||
<http://lma-developer-guide.readthedocs.io/en/latest/metrics.html>`_.
|
||||
|
||||
The StackLight stream processing pipeline workflow is shown in the figure below:
|
||||
The following figure shows the StackLight stream-processing pipeline workflow:
|
||||
|
||||
.. figure:: ../../images/AFD_and_GSE_message_flow.*
|
||||
:width: 800
|
||||
:alt: Message flow for the AFD and GSE metrics
|
||||
:align: center
|
||||
|
||||
.. raw:: latex
|
||||
|
||||
\pagebreak
|
||||
|
||||
The AFD and GSE plugins
|
||||
-----------------------
|
||||
|
||||
In the current version of StackLight, there are three types of GSE plugins:
|
||||
The current version of StackLight contains the following three types of GSE
|
||||
plugins:
|
||||
|
||||
* The **Service Cluster GSE Plugin** which receives AFD metrics for services
|
||||
* The **Service Cluster GSE Plugin**, which receives AFD metrics for services
|
||||
from the AFD plugins.
|
||||
* The **Node Cluster GSE Plugin** which receives AFD metrics for nodes
|
||||
* The **Node Cluster GSE Plugin**, which receives AFD metrics for nodes
|
||||
from the AFD plugins.
|
||||
* The **Global Cluster GSE Plugin** which receives GSE metrics from the
|
||||
GSE plugins above. It aggregates and correlates the GSE metrics to issue a global
|
||||
health status for the top-level clusters like Nova, MySQL and so forth.
|
||||
* The **Global Cluster GSE Plugin**, which receives GSE metrics from the
|
||||
GSE plugins above. It aggregates and correlates the GSE metrics to issue a
|
||||
global health status for the top-level clusters like Nova, MySQL, and others.
|
||||
|
||||
The health status exposed in the GSE metrics is as follow:
|
||||
The health status exposed in the GSE metrics is as follows:
|
||||
|
||||
* *Down*: One or several primary functions of a cluster has failed or is failing.
|
||||
For example, the API service for Nova or Cinder isn't accessible.
|
||||
* *Critical*: One or several primary functions of a
|
||||
cluster are severely degraded. The quality
|
||||
of service delivered to the end-user is severely impacted.
|
||||
* *Warning*: One or several primary functions of the
|
||||
cluster are slightly degraded. The quality
|
||||
of service delivered to the end-user is slightly
|
||||
* ``Down``: One or several primary functions of a cluster has failed or is
|
||||
failing. For example, the API service for Nova or Cinder is not accessible.
|
||||
* ``Critical``: One or several primary functions of a cluster are severely
|
||||
degraded. The quality of service delivered to the end user is severely
|
||||
impacted.
|
||||
* *Unknown*: There is not enough data to infer the actual
|
||||
health status of the cluster.
|
||||
* *Okay*: None of the above was found to be true.
|
||||
* ``Warning``: One or several primary functions of the cluster are slightly
|
||||
degraded. The quality of service delivered to the end user is slightly
|
||||
impacted.
|
||||
* ``Unknown``: There is not enough data to infer the actual health status of
|
||||
the cluster.
|
||||
* ``Okay``: None of the above was found to be true.
|
||||
|
||||
The AFD and GSE persisters
|
||||
--------------------------
|
||||
|
||||
The AFD and GSE metrics are also consumed by other types
|
||||
of Lua plugins called the **persisters**.
|
||||
The AFD and GSE metrics are also consumed by other types of Lua plugins called
|
||||
**persisters**:
|
||||
|
||||
* The **InfluxDB persister** transforms the GSE metrics
|
||||
into InfluxDB data-points and Grafana annotations. They
|
||||
are used in Grafana to graph the health status of
|
||||
the OpenStack clusters.
|
||||
* The **Elasticsearch persister** transforms the AFD metrics
|
||||
into events that are indexed in Elasticsearch. Using Kibana,
|
||||
these events can be searched to display a fault or an anomaly
|
||||
that occured in the environment (not implemented yet).
|
||||
* The **Nagios persister** transforms the GSE and AFD metrics
|
||||
into passive checks that are sent to Nagios for alerting and
|
||||
escalation.
|
||||
* The **InfluxDB persister** transforms the GSE metrics into InfluxDB data
|
||||
points and Grafana annotations. They are used in Grafana to graph the health
|
||||
status of the OpenStack clusters.
|
||||
* The **Elasticsearch persister** transforms the AFD metrics into events that
|
||||
are indexed in Elasticsearch. Using Kibana, these events can be searched to
|
||||
display a fault or an anomaly that occurred in the environment (not yet
|
||||
implemented).
|
||||
* The **Nagios persister** transforms the GSE and AFD metrics into passive
|
||||
checks that are sent to Nagios for alerting and escalation.
|
||||
|
||||
New persisters could be created easely to feed other
|
||||
systems with the operational insight contained in the
|
||||
AFD and GSE metrics.
|
||||
New persisters can be easily created to feed other systems with the
|
||||
operational insight contained in the AFD and GSE metrics.
|
||||
|
||||
.. _alarm_configuration:
|
||||
|
||||
Alarms configuration
|
||||
--------------------
|
||||
|
||||
StackLight comes with a predefined set of alarm rules.
|
||||
We have tried to make these rules as comprehensive and relevant
|
||||
as possible, but your mileage may vary depending on the specifics of
|
||||
your OpenStack environment and monitoring requirements.
|
||||
Therefore, it is possible to modify those predefined rules
|
||||
and create new ones.
|
||||
To do so, you will be required to modify the
|
||||
``/etc/hiera/override/alarming.yaml`` file
|
||||
and apply the :ref:`Puppet manifest <puppet_apply>`
|
||||
that will dynamically generate Lua plugins known as
|
||||
the AFD Plugins which are the actuators of the alarm rules.
|
||||
But before you proceed, you need to understand the structure
|
||||
of that file.
|
||||
StackLight comes with a predefined set of alarm rules. We have tried to make
|
||||
these rules as comprehensive and relevant as possible, but your mileage may
|
||||
vary depending on the specifics of your OpenStack environment and monitoring
|
||||
requirements. Therefore, it is possible to modify those predefined rules and
|
||||
create new ones. To do so, modify the ``/etc/hiera/override/alarming.yaml``
|
||||
file and apply the :ref:`Puppet manifest <puppet_apply>` that will dynamically
|
||||
generate Lua plugins, known as the AFD Plugins, which are the actuators of the
|
||||
alarm rules. But before you proceed, verify that understand the structure of
|
||||
that file.
|
||||
|
||||
.. _alarm_structure:
|
||||
|
||||
Alarm structure
|
||||
+++++++++++++++
|
||||
|
||||
An alarm rule is defined declaratively using the YAML syntax
|
||||
as shown in the example below::
|
||||
An alarm rule is defined declaratively using the YAML syntax. For example::
|
||||
|
||||
name: 'fs-warning'
|
||||
description: 'Filesystem free space is low'
|
||||
|
@ -180,7 +171,7 @@ as shown in the example below::
|
|||
|
||||
| logical_operator
|
||||
| Type: Enum('and' | '&&' | 'or' | '||')
|
||||
| The conjonction relation for the alarm rules.
|
||||
| The conjunction relation for the alarm rules
|
||||
|
||||
| metric
|
||||
| Type: unicode
|
||||
|
@ -192,24 +183,25 @@ as shown in the example below::
|
|||
|
||||
| fields
|
||||
| Type: list
|
||||
| List of field name / value pairs (a.k.a dimensions) used to select
|
||||
a particular device for the metric such as a network interface name or file
|
||||
system mount point. If value is specified as an empty string (""), then the rule
|
||||
is applied to all the aggregated values for the specified field name. For example
|
||||
the file system mount point.
|
||||
If value is specified as the '*' wildcard character,
|
||||
then the rule is applied to each of the metrics matching the metric name and field name.
|
||||
For example, the alarm definition sample given above would run the rule
|
||||
for each of the file system mount points associated with the *fs_space_percent_free* metric.
|
||||
| List of field name / value pairs, also known as dimensions, used to select
|
||||
a particular device for the metric, such as a network interface name or
|
||||
file system mount point. If the value is specified as an empty string (""),
|
||||
then the rule is applied to all the aggregated values for the specified
|
||||
field name. For example, the file system mount point. If value is
|
||||
specified as the '*' wildcard character, then the rule is applied to each
|
||||
of the metrics matching the metric name and field name. For example, the
|
||||
alarm definition sample given above would run the rule for each of the
|
||||
file system mount points associated with the *fs_space_percent_free*
|
||||
metric.
|
||||
|
||||
| window
|
||||
| Type: integer
|
||||
| The in memory time-series analysis window in seconds
|
||||
| The in-memory time-series analysis window in seconds
|
||||
|
||||
| periods
|
||||
| Type: integer
|
||||
| The number of prior time-series analysis window to compare the window with (this is
|
||||
| not implemented yet)
|
||||
| The number of prior time-series analysis window to compare the window with
|
||||
| (this is not implemented yet).
|
||||
|
||||
| function
|
||||
| Type: enum('last' | 'min' | 'max' | 'sum' | 'count' | 'avg' | 'median' | 'mode' | 'roc' | 'mww' | 'mww_nonparametric')
|
||||
|
@ -232,46 +224,49 @@ as shown in the example below::
|
|||
| returns the value that occurs most often in all the values
|
||||
| (not implemented yet)
|
||||
| roc:
|
||||
| The 'roc' function detects a significant rate
|
||||
of change when comparing current metrics values with historical data.
|
||||
To achieve this, it computes the average of the values in the current window,
|
||||
and the average of the values in the window before the current window and
|
||||
compare the difference against the standard deviation of the
|
||||
historical window. The function returns true if the difference
|
||||
| The 'roc' function detects a significant rate of change when comparing
|
||||
current metrics values with historical data. To achieve this, it
|
||||
computes the average of the values in the current window and the
|
||||
average of the values in the window before the current window and
|
||||
compares the difference against the standard deviation of the
|
||||
historical window. The function returns ``true`` if the difference
|
||||
exceeds the standard deviation multiplied by the 'threshold' value.
|
||||
This function uses the rate of change algorithm already available in the
|
||||
anomaly detection module of Heka. It can only be applied on normal
|
||||
distributions.
|
||||
With an alarm rule using the 'roc' function, the 'window' parameter
|
||||
specifies the duration in seconds of the current window and the 'periods'
|
||||
parameter specifies the number of windows used for the historical data.
|
||||
You need at least one period and so, the 'periods' parameter must not be zero.
|
||||
If you choose a period of 'p', the function will compute the rate of
|
||||
change using an historical data window of ('p' * window) seconds.
|
||||
For example, if you specify in the alarm rule:
|
||||
anomaly detection module of Heka. It can only be applied to normal
|
||||
distributions. With an alarm rule using the 'roc' function, the
|
||||
'window' parameter specifies the duration in seconds of the current
|
||||
window, and the 'periods' parameter specifies the number of windows
|
||||
used for the historical data. You need at least one period and the
|
||||
'periods' parameter must not be zero. If you choose a period of 'p',
|
||||
the function will compute the rate of change using a historical data
|
||||
window of ('p' * window) seconds. For example, if you specify the
|
||||
following in the alarm rule:
|
||||
|
|
||||
| window = 60
|
||||
| periods = 3
|
||||
| threshold = 1.5
|
||||
|
|
||||
| The function will store in a circular buffer the value of the metrics
|
||||
| the function will store in a circular buffer the value of the metrics
|
||||
received during the last 300 seconds (5 minutes) where:
|
||||
|
|
||||
| Current window (CW) = 60 sec
|
||||
| Previous window (PW) = 60 sec
|
||||
| Historical window (HW) = 180 sec
|
||||
|
|
||||
| And apply the following formula:
|
||||
| and apply the following formula:
|
||||
|
|
||||
| abs(avg(CW) - avg(PW)) > std(HW) * 1.5 ? true : false
|
||||
| mww:
|
||||
| returns the result (true, false) of the Mann-Whitney-Wilcoxon test function
|
||||
of Heka that can be used only with normal distributions (not implemented yet)
|
||||
| returns the result (true, false) of the Mann-Whitney-Wilcoxon test
|
||||
function of Heka that can be used only with normal distributions (not
|
||||
implemented yet)
|
||||
| mww-nonparametric:
|
||||
| returns the result (true, false) of the Mann-Whitney-Wilcoxon
|
||||
test function of Heka that can be used with non-normal distributions (not implemented yet)
|
||||
| returns the result (true, false) of the Mann-Whitney-Wilcoxon test
|
||||
function of Heka that can be used with non-normal distributions (not
|
||||
implemented yet)
|
||||
| diff:
|
||||
| returns the difference between the last value and the first value of all the values
|
||||
| returns the difference between the last value and the first value of
|
||||
all the values
|
||||
|
||||
| threshold
|
||||
| Type: float
|
||||
|
@ -281,15 +276,13 @@ as shown in the example below::
|
|||
Modify or create an alarm
|
||||
+++++++++++++++++++++++++
|
||||
|
||||
To modify (or create) an alarm, you need to edit the
|
||||
``/etc/hiera/override/alarming.yaml`` file.
|
||||
This file has four sections:
|
||||
To modify or create an alarm, edit the ``/etc/hiera/override/alarming.yaml``
|
||||
file. This file has the following sections:
|
||||
|
||||
1. The *alarms* section contains a global list of alarms that
|
||||
are executed by the Collectors. These alarms are global to
|
||||
the LMA toolchain and should be kept identical
|
||||
on all nodes of the OpenStack environment.
|
||||
Here is another example of the definition of an alarm::
|
||||
#. The ``alarms`` section contains a global list of alarms that are executed
|
||||
by the Collectors. These alarms are global to the LMA toolchain and should
|
||||
be kept identical on all nodes of the OpenStack environment. The following
|
||||
is another example of the definition of an alarm::
|
||||
|
||||
alarms:
|
||||
- name: 'cpu-critical-controller'
|
||||
|
@ -312,30 +305,29 @@ This file has four sections:
|
|||
periods: 0
|
||||
function: avg
|
||||
|
||||
This alarm is called 'cpu-critical-controller'.
|
||||
It says that CPU activity is critical (severity: 'critical')
|
||||
if any of the rules in the alarm definition evaluates to true.
|
||||
This alarm is called 'cpu-critical-controller'. It says that CPU activity
|
||||
is critical (severity: 'critical') if any of the rules in the alarm
|
||||
definition evaluate to true.
|
||||
|
||||
The rule says that the alarm
|
||||
will evaluate to 'true' if the value of the metric *cpu_idle*
|
||||
has been in average (function: avg) below or equal
|
||||
The rule says that the alarm will evaluate to 'true' if the value of the
|
||||
metric ``cpu_idle`` has been in average (function: avg), below or equal
|
||||
(relational_operator: <=) to 5 for the last 5 minutes (window: 120).
|
||||
|
||||
OR (logical_operator: 'or')
|
||||
|
||||
If the value of the metric **cpu_wait** has been in average
|
||||
(function: avg) superior or equal (relational_operator: >=) to 35
|
||||
for the last 5 minutes (window: 120)
|
||||
If the value of the metric **cpu_wait** has been in average (function: avg),
|
||||
superior or equal (relational_operator: >=) to 35 for the last 5 minutes
|
||||
(window: 120)
|
||||
|
||||
Note that these metrics are expressed in percentage.
|
||||
|
||||
What alarms are executed on which node depends on
|
||||
the mapping between the alarm definition and the
|
||||
definition of a cluster as described in the following sections.
|
||||
What alarms are executed on which node depends on the mapping between the
|
||||
alarm definition and the definition of a cluster as described in the
|
||||
following sections.
|
||||
|
||||
2. The *node_cluster_roles* section defines the mapping between
|
||||
the internal definition of a cluster of nodes and one or
|
||||
several Fuel roles. For example::
|
||||
#. The ``node_cluster_roles`` section defines the mapping between the internal
|
||||
definition of a cluster of nodes and one or several Fuel roles.
|
||||
For example::
|
||||
|
||||
node_cluster_roles:
|
||||
controller: ['primary-controller', 'controller']
|
||||
|
@ -343,22 +335,19 @@ This file has four sections:
|
|||
storage: ['cinder', 'ceph-osd']
|
||||
[ ... ]
|
||||
|
||||
Creates a mapping between the 'primary-controller'
|
||||
and 'controller' Fuel roles and the internal defintion of a cluster
|
||||
of nodes called 'controller'.
|
||||
Likewise, the internal definition of a cluster of nodes called
|
||||
'storage' is mapped to the 'cinder' and 'ceph-osd' Fuel roles.
|
||||
The internal definition of a cluster of nodes is used to assign
|
||||
the alarms to the relevant category of nodes.
|
||||
This mapping is also used to configure the **passive checks**
|
||||
in Nagios. This is the reason why, it is criticaly important
|
||||
to keep the exact same copy of ``/etc/hiera/override/alarming.yaml``
|
||||
across all the nodes of the OpenStack environment including the
|
||||
node(s) where Nagios is installed.
|
||||
Creates a mapping between the 'primary-controller' and 'controller' Fuel
|
||||
roles, and the internal definition of a cluster of nodes called 'controller'.
|
||||
Likewise, the internal definition of a cluster of nodes called 'storage' is
|
||||
mapped to the 'cinder' and 'ceph-osd' Fuel roles. The internal definition
|
||||
of a cluster of nodes is used to assign the alarms to the relevant category
|
||||
of nodes. This mapping is also used to configure the **passive checks**
|
||||
in Nagios. Therefore, it is critically important to keep exactly the same
|
||||
copy of ``/etc/hiera/override/alarming.yaml`` across all nodes of the
|
||||
OpenStack environment including the node(s) where Nagios is installed.
|
||||
|
||||
3. The *service_cluster_roles* section defines the mapping between
|
||||
the internal definition of a cluster of services and one or
|
||||
several Fuel roles. For example::
|
||||
#. The ``service_cluster_roles`` section defines the mapping between the
|
||||
internal definition of a cluster of services and one or several Fuel roles.
|
||||
For example::
|
||||
|
||||
service_cluster_roles:
|
||||
rabbitmq: ['primary-controller', 'controller']
|
||||
|
@ -366,18 +355,17 @@ This file has four sections:
|
|||
elasticsearch: ['primary-elasticsearch_kibana', 'elasticsearch_kibana']
|
||||
[ ... ]
|
||||
|
||||
Creates a mapping between the 'primary-controller'
|
||||
and 'controller' Fuel roles and the internal defintion of a cluster
|
||||
of services called 'rabbitmq'.
|
||||
Creates a mapping between the 'primary-controller' and 'controller' Fuel
|
||||
roles, and the internal definition of a cluster of services called 'rabbitmq'.
|
||||
Likewise, the internal definition of a cluster of services called
|
||||
'elasticsearch' is mapped to the 'primary-elasticsearch_kibana'
|
||||
and 'elasticsearch_kibana' Fuel roles.
|
||||
As for the clusters of nodes, the internal definition of a cluster
|
||||
of services is used to assign the alarns to the relevant category of services.
|
||||
'elasticsearch' is mapped to the 'primary-elasticsearch_kibana' and
|
||||
'elasticsearch_kibana' Fuel roles. As for the clusters of nodes, the
|
||||
internal definition of a cluster of services is used to assign the alarms
|
||||
to the relevant category of services.
|
||||
|
||||
4. The *node_cluster_alarms* section defines the mapping between
|
||||
the internal definition of a cluster of nodes and the alarms that
|
||||
are assigned to that category of nodes. For example::
|
||||
#. The ``node_cluster_alarms`` section defines the mapping between the
|
||||
internal definition of a cluster of nodes and the alarms that are assigned
|
||||
to that category of nodes. For example::
|
||||
|
||||
node_cluster_alarms:
|
||||
controller:
|
||||
|
@ -385,121 +373,105 @@ This file has four sections:
|
|||
root-fs: ['root-fs-critical', 'root-fs-warning']
|
||||
log-fs: ['log-fs-critical', 'log-fs-warning']
|
||||
|
||||
Creates three alarm groups for the cluster of nodes called
|
||||
'controller'.
|
||||
Creates three alarm groups for the cluster of nodes called 'controller':
|
||||
|
||||
* The *cpu* alarm group is mapped to two alarms defined in the
|
||||
*alarms* section known as the 'cpu-critical-controller' and
|
||||
'cpu-warning-controller' alarms. Those alarms monitor the
|
||||
CPU on the controller nodes. Note that the order matters
|
||||
here since the first alarm which evaluates to 'true' stops
|
||||
the evaluation. Hence, it is important to start the list
|
||||
with the most critical alarms.
|
||||
* The *root-fs* alarm group is mapped to two alarms defined
|
||||
in the *alarms* section known as the 'root-fs-critical'
|
||||
and 'root-fs-warning' alarms. Those alarms monitor the
|
||||
root file system on the controller nodes.
|
||||
* The *log-fs* alarm group is mapped to two alarms defined
|
||||
in the *alarms* section known as the 'log-fs-critical' and
|
||||
'log-fs-warning' alarms. Those alarms monitor the file
|
||||
system where the logs are created on the controller
|
||||
nodes.
|
||||
* The *cpu* alarm group is mapped to two alarms defined in the ``alarms``
|
||||
section known as the 'cpu-critical-controller' and
|
||||
'cpu-warning-controller' alarms. These alarms monitor the CPU on the
|
||||
controller nodes. The order matters here since the first alarm that
|
||||
evaluates to 'true' stops the evaluation. Therefore, it is important
|
||||
to start the list with the most critical alarms.
|
||||
* The *root-fs* alarm group is mapped to two alarms defined in the
|
||||
``alarms`` section known as the 'root-fs-critical' and 'root-fs-warning'
|
||||
alarms. These alarms monitor the root file system on the controller nodes.
|
||||
* The *log-fs* alarm group is mapped to two alarms defined in the ``alarms``
|
||||
section known as the 'log-fs-critical' and 'log-fs-warning' alarms. These
|
||||
alarms monitor the file system where the logs are created on the
|
||||
controller nodes.
|
||||
|
||||
.. note:: An *alarm group* is a mere implementaton artifact
|
||||
(although it has several functional usefulness) that is
|
||||
primarily used to distribute the alarms evaluation workload
|
||||
across several Lua plugins. Since the Lua plugins
|
||||
runtime is sandboxed within Heka, it is preferable to run
|
||||
smaller sets of alarms in different plugins rather than a
|
||||
large set of alarms in a single plugin. This is to avoid
|
||||
having alarms evaluation plugins shutdown by Heka.
|
||||
Furthermore, the alarm groups are used to identify what is
|
||||
called a *source*. A *source* is a tuple in which we associate
|
||||
a cluster with an alarm group. For example the tuple ['controller', 'cpu']
|
||||
is a *source*. It associates a 'controller' cluster with the 'cpu'
|
||||
alarm group. The tuple ['controller', 'root-fs'] is another *source*
|
||||
example. The *source* is used by the GSE Plugins to remember the
|
||||
AFD metrics it has received. If a GSE Plugin stops receiving
|
||||
AFD metrics it used to get, then the GSE Plugin will
|
||||
infer that the health status for the cluster associated
|
||||
with the source is *Unknown*.
|
||||
.. note:: An *alarm group* is a mere implementation artifact (although it
|
||||
has functional value) that is primarily used to distribute the alarms
|
||||
evaluation workload across several Lua plugins. Since the Lua plugins
|
||||
runtime is sandboxed within Heka, it is preferable to run smaller sets
|
||||
of alarms in different plugins rather than a large set of alarms in a
|
||||
single plugin. This is to avoid having alarms evaluation plugins
|
||||
shut down by Heka. Furthermore, the alarm groups are used to identify
|
||||
what is called a *source*. A *source* is a tuple in which we associate
|
||||
a cluster with an alarm group. For example, the tuple
|
||||
['controller', 'cpu'] is a *source*. It associates a 'controller'
|
||||
cluster with the 'cpu' alarm group. The tuple ['controller', 'root-fs']
|
||||
is another *source* example. The *source* is used by the GSE Plugins to
|
||||
remember the AFD metrics it has received. If a GSE Plugin stops receiving
|
||||
AFD metrics it used to get, then the GSE Plugin infers that the health
|
||||
status of the cluster associated with the source is *Unknown*.
|
||||
|
||||
This is evaluated every *ticker-interval*. By default,
|
||||
the *ticker interval* for the GSE Plugins is set to
|
||||
10 seconds.
|
||||
This is evaluated every *ticker-interval*. By default, the
|
||||
*ticker interval* for the GSE Plugins is set to 10 seconds.
|
||||
|
||||
.. _aggreg_correl_config:
|
||||
|
||||
Aggregation and correlation configuration
|
||||
-----------------------------------------
|
||||
|
||||
StackLight comes with a predefined set of aggregation rules and
|
||||
correlation policies. As for the alarms, it is possible to
|
||||
create new aggregation rules and correlation policies or modify
|
||||
existing ones. To do so, you will be required to modify the
|
||||
``/etc/hiera/override/gse_filters.yaml`` file
|
||||
and apply the :ref:`Puppet manifest <puppet_apply>`
|
||||
that will generate Lua plugins known as
|
||||
the GSE Plugins which are the actuators of these aggregation
|
||||
rules and correlation policies.
|
||||
But before you proceed, you need to undestand the structure
|
||||
of that file.
|
||||
StackLight comes with a predefined set of aggregation rules and correlation
|
||||
policies. However, you can create new aggregation rules and correlation
|
||||
policies or modify the existing ones. To do so, modify the ``/etc/hiera/override/gse_filters.yaml`` file and apply the
|
||||
:ref:`Puppet manifest <puppet_apply>` that will generate Lua plugins known as
|
||||
the GSE Plugins, which are the actuators of these aggregation rules and
|
||||
correlation policies. But before you proceed, verify that you understand the
|
||||
structure of that file.
|
||||
|
||||
.. note:: As for ``/etc/hiera/override/alarming.yaml``,
|
||||
it is criticaly important to keep the exact same copy of
|
||||
``/etc/hiera/override/gse_filters.yaml``
|
||||
across all the nodes of the OpenStack environment including the
|
||||
node(s) where Nagios is installed.
|
||||
.. note:: As for ``/etc/hiera/override/alarming.yaml``, it is critically
|
||||
important to keep exactly the same copy of
|
||||
``/etc/hiera/override/gse_filters.yaml`` across all the nodes of the
|
||||
OpenStack environment including the node(s) where Nagios is installed.
|
||||
|
||||
The aggregation rules and correlation policies are defined
|
||||
in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
|
||||
The aggregation rules and correlation policies are defined in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
|
||||
|
||||
This file has four sections:
|
||||
This file has the following sections:
|
||||
|
||||
1. The *gse_policies* section contains the :ref:`health status
|
||||
correlation policies <gse_policies>` that apply to the node
|
||||
clusters and service clusters.
|
||||
2. The *gse_cluster_service* section contains the :ref:`aggregation rules
|
||||
<gse_cluster_service>` for the service clusters. These
|
||||
aggregation rules are actuated by the Service Cluster GSE
|
||||
Plugin which runs on the Aggregator.
|
||||
3. The *gse_cluster_node* section contains the :ref:`aggreagion rules
|
||||
<gse_cluster_node>` for the node clusters. These aggregation rules
|
||||
are actuated by the Node Cluster GSE Plugin which runs on the
|
||||
Aggregator.
|
||||
4. The *gse_cluster_global* section contains the :ref:`aggregation
|
||||
rules <gse_cluster_global>` for the so-called top-level clusters.
|
||||
A global cluster is a kind of logical construct of node clusters
|
||||
and service clusters. These aggregation rules are actuated by
|
||||
the Global Cluster GSE Plugin which runs on the Aggregator.
|
||||
#. The ``gse_policies`` section contains the :ref:`health status correlation
|
||||
policies <gse_policies>` that apply to the node clusters and service
|
||||
clusters.
|
||||
#. The ``gse_cluster_service` section contains the :ref:`aggregation rules
|
||||
<gse_cluster_service>` for the service clusters. These aggregation rules
|
||||
are actuated by the Service Cluster GSE Plugin that runs on the Aggregator.
|
||||
#. The ``gse_cluster_node`` section contains the :ref:`aggregation rules
|
||||
<gse_cluster_node>` for the node clusters. These aggregation rules are
|
||||
actuated by the Node Cluster GSE Plugin that runs on the Aggregator.
|
||||
#. The ``gse_cluster_global`` section contains the :ref:`aggregation
|
||||
rules <gse_cluster_global>` for the so-called top-level clusters. A global
|
||||
cluster is a kind of logical construct of node clusters and service
|
||||
clusters. These aggregation rules are actuated by the Global Cluster GSE
|
||||
Plugin that runs on the Aggregator.
|
||||
|
||||
.. _gse_policies:
|
||||
|
||||
Health status policies
|
||||
++++++++++++++++++++++
|
||||
|
||||
The correlation logic implemented by the GSE plugins is policy-based.
|
||||
The policies define how the GSE plugins infer the health status of a
|
||||
cluster.
|
||||
The correlation logic implemented by the GSE plugins is policy-based. The
|
||||
policies define how the GSE plugins infer the health status of a cluster.
|
||||
|
||||
By default, two policies are defined:
|
||||
By default, there are two policies:
|
||||
|
||||
* The **highest_severity** policy defines that the cluster's status depends on the
|
||||
member with the highest severity, typically used for a cluster of services.
|
||||
* The **majority_of_members** policy defines that the cluster is healthy as long as
|
||||
(N+1)/2 members of the cluster are healthy. This is typically used for
|
||||
clusters managed by Pacemaker.
|
||||
* The **highest_severity** policy defines that the cluster's status depends on
|
||||
the member with the highest severity, typically used for a cluster of
|
||||
services.
|
||||
* The **majority_of_members** policy defines that the cluster is healthy as
|
||||
long as (N+1)/2 members of the cluster are healthy. This is typically used
|
||||
for clusters managed by Pacemaker.
|
||||
|
||||
A policy consists of a list of rules that are evaluated against the
|
||||
current status of the cluster's members. When one of the rules matches, the
|
||||
cluster's status gets the value associated with the rule and the evaluation
|
||||
stops here. The last rule of the list is usually a catch-all rule that
|
||||
defines the default status in case none of the previous rules could be matched.
|
||||
A policy consists of a list of rules that are evaluated against the current
|
||||
status of the cluster's members. When one of the rules matches, the cluster's
|
||||
status gets the value associated with the rule and the evaluation stops. The
|
||||
last rule of the list is usually a catch-all rule that defines the default
|
||||
status if none of the previous rules matches.
|
||||
|
||||
A policy rule is defined as shown in the example below::
|
||||
The following example shows the policy rule definition::
|
||||
|
||||
# The following rule definition reads as: "the cluster's status is critical
|
||||
# if more than 50% of its members are either down or criticial"
|
||||
# if more than 50% of its members are either down or critical"
|
||||
- status: critical
|
||||
trigger:
|
||||
logical_operator: or
|
||||
|
@ -517,7 +489,7 @@ Where
|
|||
|
||||
| logical_operator
|
||||
| Type: Enum('and' | '&&' | 'or' | '||')
|
||||
| The conjonction relation for the condition rules
|
||||
| The conjunction relation for the condition rules
|
||||
|
||||
| rules
|
||||
| Type: list
|
||||
|
@ -543,7 +515,7 @@ Where
|
|||
| Type: float
|
||||
| The threshold value
|
||||
|
||||
Lets take a closer look at the policy called *highest_severity*::
|
||||
Consider the policy called *highest_severity*::
|
||||
|
||||
gse_policies:
|
||||
|
||||
|
@ -582,28 +554,31 @@ Lets take a closer look at the policy called *highest_severity*::
|
|||
threshold: 0
|
||||
- status: unknown
|
||||
|
||||
The policy definition reads as:
|
||||
The policy definition reads as follows:
|
||||
|
||||
* The status of the cluster is *Down* if the status of at least one cluster's member is *Down*.
|
||||
* The status of the cluster is ``Down`` if the status of at least one
|
||||
cluster's member is ``Down``.
|
||||
|
||||
* Otherwise the status of the cluster is *Critical* if the status of at least one cluster's member is *Critical*.
|
||||
* Otherwise, the status of the cluster is ``Critical`` if the status of at
|
||||
least one cluster's member is ``Critical``.
|
||||
|
||||
* Otherwise the status of the cluster is *Warning* if the status of at least one cluster's member is *Warning*.
|
||||
* Otherwise, the status of the cluster is ``Warning`` if the status of at
|
||||
least one cluster's member is ``Warning``.
|
||||
|
||||
* Otherwise the status of the cluster is *Okay* if the status of at least one cluster's entity is *Okay*.
|
||||
* Otherwise, the status of the cluster is ``Okay`` if the status of at least
|
||||
one cluster's entity is *Okay*.
|
||||
|
||||
* Otherwise the status of the cluster is *Unknown*.
|
||||
* Otherwise, the status of the cluster is ``Unknown``.
|
||||
|
||||
.. _gse_cluster_service:
|
||||
|
||||
Service cluster aggregation rules
|
||||
+++++++++++++++++++++++++++++++++
|
||||
|
||||
The service cluster aggregation rules are used to designate
|
||||
the members of a service cluster along with
|
||||
the AFD metrics that must be taken into account to derive an
|
||||
health status for the service cluster.
|
||||
Here is an example of the service cluster aggregation rules::
|
||||
The service cluster aggregation rules are used to designate the members of a
|
||||
service cluster along with the AFD metrics that must be taken into account to
|
||||
derive a health status for the service cluster. The following is an example of
|
||||
the service cluster aggregation rules::
|
||||
|
||||
gse_cluster_service:
|
||||
input_message_types:
|
||||
|
@ -673,7 +648,7 @@ Where
|
|||
Service cluster definition
|
||||
++++++++++++++++++++++++++
|
||||
|
||||
The service clusters are defined as shown in the example below::
|
||||
The following example shows the service clusters definition::
|
||||
|
||||
gse_cluster_service:
|
||||
[...]
|
||||
|
@ -691,36 +666,36 @@ Where
|
|||
| members
|
||||
| Type: list
|
||||
| The list of cluster members.
|
||||
The AFD messages that are associated to the cluster when the *cluster_field*
|
||||
value is equal to the cluster name and the *member_field* value is in this
|
||||
list.
|
||||
The AFD messages that are associated with the cluster when the
|
||||
``cluster_field`` value is equal to the cluster name and the
|
||||
``member_field`` value is in this list.
|
||||
|
||||
| group_by
|
||||
| Type: Enum(member, hostname)
|
||||
| This parameter defines how the incoming AFD metrics are aggregated.
|
||||
|
|
||||
| member:
|
||||
| aggregation by member, irrespective of the host that emitted the AFD metric.
|
||||
| This setting is typically used for AFD metrics that are not host-centric.
|
||||
| aggregation by member, irrespective of the host that emitted the AFD
|
||||
| metric. This setting is typically used for AFD metrics that are not
|
||||
| host-centric.
|
||||
|
|
||||
| hostname:
|
||||
| aggregation by hostname then by member.
|
||||
| This setting is typically used for AFD metrics that are host-centric such as
|
||||
| those working on filesystem or CPU usage metrics.
|
||||
| This setting is typically used for AFD metrics that are host-centric,
|
||||
| such as those working on the file system or CPU usage metrics.
|
||||
|
||||
| policy:
|
||||
| Type: unicode
|
||||
| The policy to use for computing the service cluster status. See :ref:`gse_policies`
|
||||
for details.
|
||||
| The policy to use for computing the service cluster status.
|
||||
See :ref:`gse_policies` for details.
|
||||
|
||||
If we look more closely into the example above, it defines that the Service
|
||||
Cluster GSE plugin resulting from those rules will emit a
|
||||
*gse_service_cluster_metric* message every 10
|
||||
seconds to report the current status of the *nova-api* cluster. This
|
||||
status is computed using the *afd_service_metric* metric for which
|
||||
Fields[service] is 'nova-api' and Fields[source] is one of 'backends',
|
||||
'endpoint' or 'http_errors'. The 'nova-api' cluster's status is computed using
|
||||
the 'highest_severity' policy which means that it will be equal to the 'worst'
|
||||
A closer look into the example above defines that the Service Cluster GSE
|
||||
plugin resulting from those rules will emit a *gse_service_cluster_metric*
|
||||
message every 10 seconds to report the current status of the *nova-api*
|
||||
cluster. This status is computed using the *afd_service_metric* metric for
|
||||
which Fields[service] is 'nova-api' and Fields[source] is one of 'backends',
|
||||
'endpoint', or 'http_errors'. The 'nova-api' cluster's status is computed using
|
||||
the 'highest_severity' policy, which means that it will be equal to the 'worst'
|
||||
status across all members.
|
||||
|
||||
.. _gse_cluster_node:
|
||||
|
@ -728,11 +703,10 @@ status across all members.
|
|||
Node cluster aggregation rules
|
||||
++++++++++++++++++++++++++++++
|
||||
|
||||
The node cluster aggregation rules are used to designate
|
||||
the members of a node cluster along with
|
||||
the AFD metrics that must be taken into account to derive
|
||||
an health status for the node cluster.
|
||||
Here is an example of the node cluster aggregation rules::
|
||||
The node cluster aggregation rules are used to designate the members of a node
|
||||
cluster along with the AFD metrics that must be taken into account to derive
|
||||
a health status for the node cluster. The following is an example of the node
|
||||
cluster aggregation rules::
|
||||
|
||||
gse_cluster_node:
|
||||
input_message_types:
|
||||
|
@ -804,7 +778,7 @@ Where
|
|||
Node cluster definition
|
||||
+++++++++++++++++++++++
|
||||
|
||||
The node clusters are defined as shown in the example below::
|
||||
The following example shows the node clusters definition::
|
||||
|
||||
gse_cluster_node:
|
||||
[...]
|
||||
|
@ -822,36 +796,35 @@ Where
|
|||
| members
|
||||
| Type: list
|
||||
| The list of cluster members.
|
||||
The AFD messages are associated to the cluster when the *cluster_field*
|
||||
value is equal to the cluster name and the *member_field* value is in this
|
||||
list.
|
||||
The AFD messages are associated to the cluster when the ``cluster_field``
|
||||
value is equal to the cluster name and the ``member_field`` value is in
|
||||
this list.
|
||||
|
||||
| group_by
|
||||
| Type: Enum(member, hostname)
|
||||
| This parameter defines how the incoming AFD metrics are aggregated.
|
||||
|
|
||||
| member:
|
||||
| aggregation by member, irrespective of the host that emitted the AFD metric.
|
||||
| This setting is typically used for AFD metrics that are not host-centric.
|
||||
| aggregation by member, irrespective of the host that emitted the AFD
|
||||
| metric. This setting is typically used for AFD metrics that are not
|
||||
| host-centric.
|
||||
|
|
||||
| hostname:
|
||||
| aggregation by hostname then by member.
|
||||
| This setting is typically used for AFD metrics that are host-centric such as
|
||||
| those working on filesystem or CPU usage metrics.
|
||||
| This setting is typically used for AFD metrics that are host-centric,
|
||||
| such as those working on the file system or CPU usage metrics.
|
||||
|
||||
| policy:
|
||||
| Type: unicode
|
||||
| The policy to use for computing the node cluster status. See :ref:`gse_policies`
|
||||
for details.
|
||||
| The policy to use for computing the node cluster status.
|
||||
See :ref:`gse_policies` for details.
|
||||
|
||||
If we look more closely into the example above, it defines that the Node
|
||||
Cluster GSE plugin resulting from those rules will emit a
|
||||
*gse_node_cluster_metric* message every 10
|
||||
seconds to report the current status of the *controller* cluster. This
|
||||
A closer look into the example above defines that the Node Cluster GSE plugin
|
||||
resulting from those rules will emit a *gse_node_cluster_metric* message every
|
||||
10 seconds to report the current status of the *controller* cluster. This
|
||||
status is computed using the *afd_node_metric* metric for which
|
||||
Fields[node_role] is 'controller' and Fields[source] is one of 'cpu',
|
||||
'root-fs' or 'log-fs'. The 'controller' cluster's status is computed using
|
||||
the 'majority_of_members' policy which means that it will be equal to the 'majority'
|
||||
'root-fs' or 'log-fs'. The 'controller' cluster's status is computed using the 'majority_of_members' policy which means that it will be equal to the 'majority'
|
||||
status across all members.
|
||||
|
||||
.. _gse_cluster_global:
|
||||
|
@ -859,23 +832,20 @@ status across all members.
|
|||
Top-level cluster aggregation rules
|
||||
+++++++++++++++++++++++++++++++++++
|
||||
|
||||
The top-level agggregation rules aggregate GSE metrics from the
|
||||
Service Cluster GSE Plugin and the Node Cluster GSE Plugin.
|
||||
This is the last aggregation stage that issues health status
|
||||
for the top-level clusters. A top-level cluster is a logical
|
||||
contruct of service and node clustering. By default, we define
|
||||
that the health status of Nova, as a top-level cluster,
|
||||
depends on the health status of several service clusters
|
||||
related to Nova and the health status of the 'controller' and
|
||||
'compute' node clusters. But it can be anything. For example, you
|
||||
could define a 'control-plane' top-level cluster that would
|
||||
exclude the health status of the 'compute' node cluster if
|
||||
you wanted to... In summary, the top-level cluster aggregation
|
||||
rules are used to designate the node clusters and service
|
||||
clusters members of a top-level cluster along with
|
||||
the GSE metrics that must be taken into account to derive
|
||||
an health status for the top-level cluster.
|
||||
Here is an example of a top-level cluster aggregation rules::
|
||||
The top-level aggregation rules aggregate GSE metrics from the Service
|
||||
Cluster GSE Plugin and the Node Cluster GSE Plugin. This is the last
|
||||
aggregation stage that issues health status for the top-level clusters.
|
||||
A top-level cluster is a logical construct of service and node clustering.
|
||||
By default, we define that the health status of Nova, as a top-level cluster,
|
||||
depends on the health status of several service clusters related to Nova and
|
||||
the health status of the 'controller' and 'compute' node clusters. But it can
|
||||
be anything. For example, you can define a 'control-plane' top-level cluster
|
||||
that would exclude the health status of the 'compute' node cluster if required.
|
||||
The top-level cluster aggregation rules are used to designate the node
|
||||
clusters and service clusters members of a top-level cluster along with the
|
||||
GSE metrics that must be taken into account to derive a health status for the
|
||||
top-level cluster. The following is an example of a top-level cluster
|
||||
aggregation rules::
|
||||
|
||||
gse_cluster_global:
|
||||
input_message_types:
|
||||
|
@ -954,7 +924,7 @@ Where
|
|||
Top-level cluster definition
|
||||
++++++++++++++++++++++++++++
|
||||
|
||||
The top-level clusters are defined as shown in the example below::
|
||||
The following example shows the top-level clusters definition::
|
||||
|
||||
gse_cluster_global:
|
||||
[...]
|
||||
|
@ -987,15 +957,16 @@ Where
|
|||
| members
|
||||
| Type: list
|
||||
| The list of cluster members.
|
||||
| The GSE messages are associated to the cluster when the *member_field* value
|
||||
| (i.e *cluster_name*) is in this list.
|
||||
| The GSE messages are associated to the cluster when the ``member_field``
|
||||
| value (``cluster_name``), is on this list.
|
||||
|
||||
| hints
|
||||
| Type: list
|
||||
| The list of clusters that are indirectly associated with the top-level cluster.
|
||||
| The GSE messages are indirectly associated to the cluster when the *member_field* value
|
||||
| (i.e *cluster_name*) is in this list. This means that they are not used to derive
|
||||
| the health status of the top-level cluster but as 'hints' for root cause analysis.
|
||||
| The list of clusters that are indirectly associated with the top-level
|
||||
| cluster. The GSE messages are indirectly associated to the cluster when
|
||||
| the ``member_field`` value (``cluster_name``) is on this list. This means
|
||||
| that they are not used to derive the health status of the top-level
|
||||
| cluster but as 'hints' for root cause analysis.
|
||||
|
||||
| group_by
|
||||
| Type: Enum(member, hostname)
|
||||
|
@ -1004,8 +975,8 @@ Where
|
|||
|
||||
| policy:
|
||||
| Type: unicode
|
||||
| The policy to use for computing the top-level cluster status. See :ref:`gse_policies`
|
||||
for details.
|
||||
| The policy to use for computing the top-level cluster status.
|
||||
See :ref:`gse_policies` for details.
|
||||
|
||||
.. _puppet_apply:
|
||||
|
||||
|
@ -1015,11 +986,10 @@ Apply your configuration changes
|
|||
Once you have edited and saved your changes in
|
||||
``/etc/hiera/override/alarmaing.yaml`` and / or
|
||||
``/etc/hiera/override/gse_filters.yaml``,
|
||||
you need to apply the following Puppet manifest on
|
||||
all the nodes of your OpenStack
|
||||
environment (**including the node(s) where Nagios is installed**)
|
||||
apply the following Puppet manifest on all the nodes of your OpenStack
|
||||
environment **including the node(s) where Nagios is installed**
|
||||
for the changes to take effect::
|
||||
|
||||
# puppet apply --modulepath=/etc/fuel/plugins/lma_collector-<version>/puppet/modules:\
|
||||
/etc/puppet/modules \
|
||||
/etc/fuel/plugins/lma_collector-<version>/puppet/manifests/configure_afd_filters.pp
|
||||
/etc/fuel/plugins/lma_collector-<version>/puppet/manifests/configure_afd_filters.pp
|
|
@ -77,6 +77,10 @@ Plugin configuration
|
|||
|
||||
.. _plugin_verification:
|
||||
|
||||
.. raw:: latex
|
||||
|
||||
\pagebreak
|
||||
|
||||
Plugin verification
|
||||
-------------------
|
||||
|
||||
|
|
|
@ -1,108 +1,128 @@
|
|||
.. _Ceph_metrics:
|
||||
|
||||
|
||||
All Ceph metrics have a ``cluster`` field containing the name of the Ceph cluster
|
||||
(*ceph* by default).
|
||||
All Ceph metrics have a ``cluster`` field containing the name of the Ceph
|
||||
cluster (*ceph* by default).
|
||||
|
||||
See `cluster monitoring`_ and `RADOS monitoring`_ for further details.
|
||||
For details, see
|
||||
`Cluster monitoring <http://docs.ceph.com/docs/master/rados/operations/monitoring/>`_
|
||||
and `RADOS monitoring <http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/>`_.
|
||||
|
||||
Cluster
|
||||
^^^^^^^
|
||||
|
||||
* ``ceph_health``, the health status of the entire cluster where values ``1``, ``2``
|
||||
, ``3`` represent respectively ``OK``, ``WARNING`` and ``ERROR``.
|
||||
* ``ceph_health``, the health status of the entire cluster where values
|
||||
``1``, ``2``, ``3`` represent ``OK``, ``WARNING`` and ``ERROR``, respectively.
|
||||
|
||||
* ``ceph_monitor_count``, number of ceph-mon processes.
|
||||
* ``ceph_monitor_count``, the number of ceph-mon processes.
|
||||
|
||||
* ``ceph_quorum_count``, number of ceph-mon processes participating in the
|
||||
* ``ceph_quorum_count``, the number of ceph-mon processes participating in the
|
||||
quorum.
|
||||
|
||||
Pools
|
||||
^^^^^
|
||||
|
||||
* ``ceph_pool_total_avail_bytes``, total available size in bytes for all pools.
|
||||
* ``ceph_pool_total_bytes``, total number of bytes for all pools.
|
||||
* ``ceph_pool_total_number``, total number of pools.
|
||||
* ``ceph_pool_total_used_bytes``, total used size in bytes by all pools.
|
||||
* ``ceph_pool_total_avail_bytes``, the total available size in bytes for all
|
||||
pools.
|
||||
* ``ceph_pool_total_bytes``, the total number of bytes for all pools.
|
||||
* ``ceph_pool_total_number``, the total number of pools.
|
||||
* ``ceph_pool_total_used_bytes``, the total used size in bytes by all pools.
|
||||
|
||||
The folllowing metrics have a ``pool`` field that contains the name of the Ceph pool.
|
||||
The following metrics have a ``pool`` field that contains the name of the
|
||||
Ceph pool.
|
||||
|
||||
* ``ceph_pool_bytes_used``, amount of data in bytes used by the pool.
|
||||
* ``ceph_pool_max_avail``, available size in bytes for the pool.
|
||||
* ``ceph_pool_objects``, number of objects in the pool.
|
||||
* ``ceph_pool_op_per_sec``, number of operations per second for the pool.
|
||||
* ``ceph_pool_pg_num``, number of placement groups for the pool.
|
||||
* ``ceph_pool_read_bytes_sec``, number of bytes read by second for the pool.
|
||||
* ``ceph_pool_size``, number of data replications for the pool.
|
||||
* ``ceph_pool_write_bytes_sec``, number of bytes written by second for the pool.
|
||||
* ``ceph_pool_bytes_used``, the amount of data in bytes used by the pool.
|
||||
* ``ceph_pool_max_avail``, the available size in bytes for the pool.
|
||||
* ``ceph_pool_objects``, the number of objects in the pool.
|
||||
* ``ceph_pool_op_per_sec``, the number of operations per second for the pool.
|
||||
* ``ceph_pool_pg_num``, the number of placement groups for the pool.
|
||||
* ``ceph_pool_read_bytes_sec``, the number of bytes read by second for the pool.
|
||||
* ``ceph_pool_size``, the number of data replications for the pool.
|
||||
* ``ceph_pool_write_bytes_sec``, the number of bytes written by second for the
|
||||
pool.
|
||||
|
||||
Placement Groups
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
* ``ceph_pg_bytes_avail``, available size in bytes.
|
||||
* ``ceph_pg_bytes_total``, cluster total size in bytes.
|
||||
* ``ceph_pg_bytes_used``, data stored size in bytes.
|
||||
* ``ceph_pg_data_bytes``, stored data size in bytes before it is replicated, cloned
|
||||
or snapshotted.
|
||||
* ``ceph_pg_state``, number of placement groups in a given state. The metric
|
||||
contains a ``state`` field whose value is ``<state>`` is a combination
|
||||
* ``ceph_pg_bytes_avail``, the available size in bytes.
|
||||
* ``ceph_pg_bytes_total``, the cluster total size in bytes.
|
||||
* ``ceph_pg_bytes_used``, the data stored size in bytes.
|
||||
* ``ceph_pg_data_bytes``, the stored data size in bytes before it is
|
||||
replicated, cloned or snapshotted.
|
||||
* ``ceph_pg_state``, the number of placement groups in a given state. The
|
||||
metric contains a ``state`` field whose ``<state>`` value is a combination
|
||||
separated by ``+`` of 2 or more states of this list: ``creating``,
|
||||
``active``, ``clean``, ``down``, ``replay``, ``splitting``, ``scrubbing``,
|
||||
``degraded``, ``inconsistent``, ``peering``, ``repair``, ``recovering``,
|
||||
``recovery_wait``, ``backfill``, ``backfill-wait``, ``backfill_toofull``,
|
||||
``incomplete``, ``stale``, ``remapped``.
|
||||
* ``ceph_pg_total``, total number of placement groups.
|
||||
* ``ceph_pg_total``, the total number of placement groups.
|
||||
|
||||
OSD Daemons
|
||||
^^^^^^^^^^^
|
||||
|
||||
* ``ceph_osd_down``, number of OSD daemons DOWN.
|
||||
* ``ceph_osd_in``, number of OSD daemons IN.
|
||||
* ``ceph_osd_out``, number of OSD daemons OUT.
|
||||
* ``ceph_osd_up``, number of OSD daemons UP.
|
||||
* ``ceph_osd_down``, the number of OSD daemons DOWN.
|
||||
* ``ceph_osd_in``, the number of OSD daemons IN.
|
||||
* ``ceph_osd_out``, the number of OSD daemons OUT.
|
||||
* ``ceph_osd_up``, the number of OSD daemons UP.
|
||||
|
||||
The following metrics have an ``osd`` field that contains the OSD identifier.
|
||||
The following metrics have an ``osd`` field that contains the OSD identifier:
|
||||
|
||||
* ``ceph_osd_apply_latency``, apply latency in ms for the given OSD.
|
||||
* ``ceph_osd_commit_latency``, commit latency in ms for the given OSD.
|
||||
* ``ceph_osd_total``, total size in bytes for the given OSD.
|
||||
* ``ceph_osd_used``, data stored size in bytes for the given OSD.
|
||||
* ``ceph_osd_total``, the total size in bytes for the given OSD.
|
||||
* ``ceph_osd_used``, the data stored size in bytes for the given OSD.
|
||||
|
||||
OSD Performance
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
All the following metrics are retrieved per OSD daemon from the corresponding
|
||||
socket ``/var/run/ceph/ceph-osd.<ID>.asok`` by issuing the command ``perf dump``.
|
||||
``/var/run/ceph/ceph-osd.<ID>.asok`` socket by issuing the :command:`perf dump`
|
||||
command.
|
||||
|
||||
All metrics have an ``osd`` field that contains the OSD identifier.
|
||||
|
||||
.. note:: These metrics are not collected when a node has both the ceph-osd and controller roles.
|
||||
.. note:: These metrics are not collected when a node has both the ceph-osd
|
||||
and controller roles.
|
||||
|
||||
See `OSD performance counters`_ for further details.
|
||||
For details, see `OSD performance counters <http://ceph.com/docs/firefly/dev/perf_counters/>`_.
|
||||
|
||||
* ``ceph_perf_osd_op``, number of client operations.
|
||||
* ``ceph_perf_osd_op_in_bytes``, number of bytes received from clients for write operations.
|
||||
* ``ceph_perf_osd_op_latency``, average latency in ms for client operations (including queue time).
|
||||
* ``ceph_perf_osd_op_out_bytes``, number of bytes sent to clients for read operations.
|
||||
* ``ceph_perf_osd_op_process_latency``, average latency in ms for client operations (excluding queue time).
|
||||
* ``ceph_perf_osd_op_r``, number of client read operations.
|
||||
* ``ceph_perf_osd_op_r_latency``, average latency in ms for read operation (including queue time).
|
||||
* ``ceph_perf_osd_op_r_out_bytes``, number of bytes sent to clients for read operations.
|
||||
* ``ceph_perf_osd_op_r_process_latency``, average latency in ms for read operation (excluding queue time).
|
||||
* ``ceph_perf_osd_op_rw``, number of client read-modify-write operations.
|
||||
* ``ceph_perf_osd_op_rw_in_bytes``, number of bytes per second received from clients for read-modify-write operations.
|
||||
* ``ceph_perf_osd_op_rw_latency``, average latency in ms for read-modify-write operations (including queue time).
|
||||
* ``ceph_perf_osd_op_rw_out_bytes``, number of bytes per second sent to clients for read-modify-write operations.
|
||||
* ``ceph_perf_osd_op_rw_process_latency``, average latency in ms for read-modify-write operations (excluding queue time).
|
||||
* ``ceph_perf_osd_op_rw_rlat``, average latency in ms for read-modify-write operations with readable/applied.
|
||||
* ``ceph_perf_osd_op_w``, number of client write operations.
|
||||
* ``ceph_perf_osd_op_wip``, number of replication operations currently being processed (primary).
|
||||
* ``ceph_perf_osd_op_w_in_bytes``, number of bytes received from clients for write operations.
|
||||
* ``ceph_perf_osd_op_w_latency``, average latency in ms for write operations (including queue time).
|
||||
* ``ceph_perf_osd_op_w_process_latency``, average latency in ms for write operation (excluding queue time).
|
||||
* ``ceph_perf_osd_op_w_rlat``, average latency in ms for write operations with readable/applied.
|
||||
* ``ceph_perf_osd_recovery_ops``, number of recovery operations in progress.
|
||||
|
||||
.. _cluster monitoring: http://docs.ceph.com/docs/master/rados/operations/monitoring/
|
||||
.. _RADOS monitoring: http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/
|
||||
.. _OSD performance counters: http://ceph.com/docs/firefly/dev/perf_counters/
|
||||
* ``ceph_perf_osd_op``, the number of client operations.
|
||||
* ``ceph_perf_osd_op_in_bytes``, the number of bytes received from clients for
|
||||
write operations.
|
||||
* ``ceph_perf_osd_op_latency``, the average latency in ms for client operations
|
||||
(including queue time).
|
||||
* ``ceph_perf_osd_op_out_bytes``, the number of bytes sent to clients for read
|
||||
operations.
|
||||
* ``ceph_perf_osd_op_process_latency``, the average latency in ms for client
|
||||
operations (excluding queue time).
|
||||
* ``ceph_perf_osd_op_r``, the number of client read operations.
|
||||
* ``ceph_perf_osd_op_r_latency``, the average latency in ms for read operation
|
||||
(including queue time).
|
||||
* ``ceph_perf_osd_op_r_out_bytes``, the number of bytes sent to clients for
|
||||
read operations.
|
||||
* ``ceph_perf_osd_op_r_process_latency``, the average latency in ms for read
|
||||
operation (excluding queue time).
|
||||
* ``ceph_perf_osd_op_rw``, the number of client read-modify-write operations.
|
||||
* ``ceph_perf_osd_op_rw_in_bytes``, the number of bytes per second received
|
||||
from clients for read-modify-write operations.
|
||||
* ``ceph_perf_osd_op_rw_latency``, the average latency in ms for
|
||||
read-modify-write operations (including queue time).
|
||||
* ``ceph_perf_osd_op_rw_out_bytes``, the number of bytes per second sent to
|
||||
clients for read-modify-write operations.
|
||||
* ``ceph_perf_osd_op_rw_process_latency``, the average latency in ms for
|
||||
read-modify-write operations (excluding queue time).
|
||||
* ``ceph_perf_osd_op_rw_rlat``, the average latency in ms for read-modify-write
|
||||
operations with readable/applied.
|
||||
* ``ceph_perf_osd_op_w``, the number of client write operations.
|
||||
* ``ceph_perf_osd_op_wip``, the number of replication operations currently
|
||||
being processed (primary).
|
||||
* ``ceph_perf_osd_op_w_in_bytes``, the number of bytes received from clients
|
||||
for write operations.
|
||||
* ``ceph_perf_osd_op_w_latency``, the average latency in ms for write
|
||||
operations (including queue time).
|
||||
* ``ceph_perf_osd_op_w_process_latency``, the average latency in ms for write
|
||||
operation (excluding queue time).
|
||||
* ``ceph_perf_osd_op_w_rlat``, the average latency in ms for write operations
|
||||
with readable/applied.
|
||||
* ``ceph_perf_osd_recovery_ops``, the number of recovery operations in progress.
|
|
@ -3,24 +3,23 @@
|
|||
The cluster metrics are emitted by the GSE plugins. For details, see
|
||||
:ref:`Configuring alarms <configure_alarms>`.
|
||||
|
||||
* ``cluster_node_status``, the status of the node cluster.
|
||||
The metric contains a ``cluster_name`` field that identifies the node cluster.
|
||||
* ``cluster_node_status``, the status of the node cluster. The metric contains
|
||||
a ``cluster_name`` field that identifies the node cluster.
|
||||
|
||||
* ``cluster_service_status``, the status of the service cluster.
|
||||
The metric contains a ``cluster_name`` field that identifies the service cluster.
|
||||
|
||||
* ``cluster_status``, the status of the global cluster.
|
||||
The metric contains a ``cluster_name`` field that identifies the global cluster.
|
||||
* ``cluster_service_status``, the status of the service cluster. The metric
|
||||
contains a ``cluster_name`` field that identifies the service cluster.
|
||||
|
||||
* ``cluster_status``, the status of the global cluster. The metric contains a
|
||||
``cluster_name`` field that identifies the global cluster.
|
||||
|
||||
The supported values for these metrics are:
|
||||
|
||||
* `0` for the *Okay* status.
|
||||
* ``0`` for the *Okay* status.
|
||||
|
||||
* `1` for the *Warning* status.
|
||||
* ``1`` for the *Warning* status.
|
||||
|
||||
* `2` for the *Unknown* status.
|
||||
* ``2`` for the *Unknown* status.
|
||||
|
||||
* `3` for the *Critical* status.
|
||||
* ``3`` for the *Critical* status.
|
||||
|
||||
* `4` for the *Down* status.
|
||||
* ``4`` for the *Down* status.
|
|
@ -1,20 +1,19 @@
|
|||
.. _Elasticsearch:
|
||||
|
||||
The following metrics represent the simple status on the health of the cluster.
|
||||
See `cluster health`_ for further details.
|
||||
For details, see `Cluster health <https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster-health.html>`_.
|
||||
|
||||
* ``elasticsearch_cluster_active_primary_shards``, the number of active primary
|
||||
shards.
|
||||
* ``elasticsearch_cluster_active_shards``, the number of active shards.
|
||||
* ``elasticsearch_cluster_health``, the health status of the entire cluster
|
||||
where values ``1``, ``2`` , ``3`` represent respectively ``green``,
|
||||
``yellow`` and ``red``. The ``red`` status may also be reported when the
|
||||
Elasticsearch API returns an unexpected result (network failure for instance).
|
||||
where values ``1``, ``2`` , ``3`` represent ``green``, ``yellow`` and
|
||||
``red``, respectively. The ``red`` status may also be reported when the
|
||||
Elasticsearch API returns an unexpected result, for example, a network
|
||||
failure.
|
||||
* ``elasticsearch_cluster_initializing_shards``, the number of initializing
|
||||
shards.
|
||||
* ``elasticsearch_cluster_number_of_nodes``, the number of nodes in the cluster.
|
||||
* ``elasticsearch_cluster_number_of_pending_tasks``, the number of pending tasks.
|
||||
* ``elasticsearch_cluster_relocating_shards``, the number of relocating shards.
|
||||
* ``elasticsearch_cluster_unassigned_shards``, the number of unassigned shards.
|
||||
|
||||
.. _cluster health: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/cluster-health.html
|
||||
* ``elasticsearch_cluster_unassigned_shards``, the number of unassigned shards.
|
|
@ -1,6 +1,6 @@
|
|||
.. _haproxy_metrics:
|
||||
|
||||
``frontend`` and ``backend`` field values can be:
|
||||
The ``frontend`` and ``backend`` field values can be as follows:
|
||||
|
||||
* cinder-api
|
||||
* glance-api
|
||||
|
@ -35,7 +35,8 @@ Server
|
|||
Frontends
|
||||
^^^^^^^^^
|
||||
|
||||
The following metrics have a ``frontend`` field that contains the name of the frontend server.
|
||||
The following metrics have a ``frontend`` field that contains the name of the
|
||||
front-end server:
|
||||
|
||||
* ``haproxy_frontend_bytes_in``, the number of bytes received by the frontend.
|
||||
* ``haproxy_frontend_bytes_out``, the number of bytes transmitted by the frontend.
|
||||
|
@ -55,25 +56,33 @@ Backends
|
|||
^^^^^^^^
|
||||
.. _haproxy_backend_metric:
|
||||
|
||||
The following metrics have a ``backend`` field that contains the name of the backend server.
|
||||
The following metrics have a ``backend`` field that contains the name of the
|
||||
back-end server:
|
||||
|
||||
* ``haproxy_backend_bytes_in``, the number of bytes received by the backend.
|
||||
* ``haproxy_backend_bytes_out``, the number of bytes transmitted by the backend.
|
||||
* ``haproxy_backend_bytes_in``, the number of bytes received by the back end.
|
||||
* ``haproxy_backend_bytes_out``, the number of bytes transmitted by the back end.
|
||||
* ``haproxy_backend_denied_requests``, the number of denied requests.
|
||||
* ``haproxy_backend_denied_responses``, the number of denied responses.
|
||||
* ``haproxy_backend_downtime``, the total downtime in second.
|
||||
* ``haproxy_backend_downtime``, the total downtime in seconds.
|
||||
* ``haproxy_backend_error_connection``, the number of error connections.
|
||||
* ``haproxy_backend_error_responses``, the number of error responses.
|
||||
* ``haproxy_backend_queue_current``, the number of requests in queue.
|
||||
* ``haproxy_backend_redistributed``, the number of times a request was redispatched to another server.
|
||||
* ``haproxy_backend_redistributed``, the number of times a request was
|
||||
redispatched to another server.
|
||||
* ``haproxy_backend_response_1xx``, the number of HTTP responses with 1xx code.
|
||||
* ``haproxy_backend_response_2xx``, the number of HTTP responses with 2xx code.
|
||||
* ``haproxy_backend_response_3xx``, the number of HTTP responses with 3xx code.
|
||||
* ``haproxy_backend_response_4xx``, the number of HTTP responses with 4xx code.
|
||||
* ``haproxy_backend_response_5xx``, the number of HTTP responses with 5xx code.
|
||||
* ``haproxy_backend_response_other``, the number of HTTP responses with other code.
|
||||
* ``haproxy_backend_retries``, the number of times a connection to a server was retried.
|
||||
* ``haproxy_backend_servers``, the count of servers grouped by state. This metric has an additional ``state`` field that contains the state of the backends (either 'down' or 'up').
|
||||
* ``haproxy_backend_response_other``, the number of HTTP responses with other
|
||||
code.
|
||||
* ``haproxy_backend_retries``, the number of times a connection to a server
|
||||
was retried.
|
||||
* ``haproxy_backend_servers``, the count of servers grouped by state. This
|
||||
metric has an additional ``state`` field that contains the state of the
|
||||
back ends (either 'down' or 'up').
|
||||
* ``haproxy_backend_session_current``, the number of current sessions.
|
||||
* ``haproxy_backend_session_total``, the cumulative number of sessions.
|
||||
* ``haproxy_backend_status``, the global backend status where values ``0`` and ``1`` represent respectively ``DOWN`` (all backends are down) and ``UP`` (at least one backend is up).
|
||||
* ``haproxy_backend_status``, the global back-end status where values ``0``
|
||||
and ``1`` represent, respectively, ``DOWN`` (all back ends are down) and ``UP``
|
||||
(at least one back end is up).
|
||||
|
|
|
@ -1,37 +1,47 @@
|
|||
.. InfluxDB:
|
||||
|
||||
The following metrics are extracted from the output of ``show stats`` command.
|
||||
The values are reset to zero when InfluxDB is restarted.
|
||||
The following metrics are extracted from the output of the :command:`show stats`
|
||||
command. The values are reset to zero when InfluxDB is restarted.
|
||||
|
||||
cluster
|
||||
^^^^^^^
|
||||
|
||||
These metrics are only available if there are more than one node in the cluster.
|
||||
The following metrics are only available if there is more than one node in the
|
||||
cluster:
|
||||
|
||||
* ``influxdb_cluster_write_shard_points_requests``, the number of requests for writing a time series points to a shard.
|
||||
* ``influxdb_cluster_write_shard_requests``, the number of requests for writing to a shard.
|
||||
* ``influxdb_cluster_write_shard_points_requests``, the number of requests for
|
||||
writing a time series points to a shard.
|
||||
* ``influxdb_cluster_write_shard_requests``, the number of requests for writing
|
||||
to a shard.
|
||||
|
||||
httpd
|
||||
^^^^^
|
||||
|
||||
* ``influxdb_httpd_failed_auths``, the number of times failed authentications.
|
||||
* ``influxdb_httpd_failed_auths``, the number of failed authentications.
|
||||
* ``influxdb_httpd_ping_requests``, the number of ping requests.
|
||||
* ``influxdb_httpd_query_requests``, the number of query requests received.
|
||||
* ``influxdb_httpd_query_response_bytes``, the number of bytes returned to the client.
|
||||
* ``influxdb_httpd_query_response_bytes``, the number of bytes returned to the
|
||||
client.
|
||||
* ``influxdb_httpd_requests``, the number of requests received.
|
||||
* ``influxdb_httpd_write_points_ok``, the number of points successfully written.
|
||||
* ``influxdb_httpd_write_request_bytes``, the number of bytes received for write requests.
|
||||
* ``influxdb_httpd_write_request_bytes``, the number of bytes received for
|
||||
write requests.
|
||||
* ``influxdb_httpd_write_requests``, the number of write requests received.
|
||||
|
||||
write
|
||||
^^^^^
|
||||
|
||||
* ``influxdb_write_local_point_requests``, the number of write points requests from the local data node.
|
||||
* ``influxdb_write_local_point_requests``, the number of write points requests
|
||||
from the local data node.
|
||||
* ``influxdb_write_ok``, the number of successful writes of consistency level.
|
||||
* ``influxdb_write_point_requests``, the number of write points requests across all data nodes.
|
||||
* ``influxdb_write_remote_point_requests``, the number of write points requests to remote data nodes.
|
||||
* ``influxdb_write_requests``, the number of write requests across all data nodes.
|
||||
* ``influxdb_write_sub_ok``, the number of successful points send to subscriptions.
|
||||
* ``influxdb_write_point_requests``, the number of write points requests across
|
||||
all data nodes.
|
||||
* ``influxdb_write_remote_point_requests``, the number of write points requests
|
||||
to remote data nodes.
|
||||
* ``influxdb_write_requests``, the number of write requests across all data
|
||||
nodes.
|
||||
* ``influxdb_write_sub_ok``, the number of successful points sent to
|
||||
subscriptions.
|
||||
|
||||
runtime
|
||||
^^^^^^^
|
||||
|
@ -41,11 +51,12 @@ runtime
|
|||
* ``influxdb_heap_idle``, the number of bytes in idle spans.
|
||||
* ``influxdb_heap_in_use``, the number of bytes in non-idle spans.
|
||||
* ``influxdb_heap_objects``, the total number of allocated objects.
|
||||
* ``influxdb_heap_released``, the number of bytes released to the operating system.
|
||||
* ``influxdb_heap_released``, the number of bytes released to the operating
|
||||
system.
|
||||
* ``influxdb_heap_system``, the number of bytes obtained from the system.
|
||||
* ``influxdb_memory_alloc``, the number of bytes allocated and not yet freed.
|
||||
* ``influxdb_memory_frees``, the number of free operations.
|
||||
* ``influxdb_memory_lookups``, the number of pointer lookups.
|
||||
* ``influxdb_memory_mallocs``, the number of malloc operations.
|
||||
* ``influxdb_memory_system``, the number of bytes obtained from the system.
|
||||
* ``influxdb_memory_total_alloc``, the number of bytes allocated (even if freed).
|
||||
* ``influxdb_memory_total_alloc``, the number of bytes allocated (even if freed).
|
|
@ -1,6 +1,6 @@
|
|||
.. _libvirt-metrics:
|
||||
|
||||
Every metric contains an ``instance_id`` field which is the UUID of the
|
||||
Every metric contains an ``instance_id`` field, which is the UUID of the
|
||||
instance for the Nova service.
|
||||
|
||||
CPU
|
||||
|
@ -17,7 +17,7 @@ Disk
|
|||
^^^^
|
||||
|
||||
Metrics have a ``device`` field that contains the virtual disk device to which
|
||||
the metric applies (eg 'vda', 'vdb' and so on).
|
||||
the metric applies. For example, 'vda', 'vdb', and others.
|
||||
|
||||
* ``virt_disk_octets_read``, the number of octets (bytes) read per second.
|
||||
|
||||
|
@ -37,7 +37,7 @@ Network
|
|||
^^^^^^^
|
||||
|
||||
Metrics have an ``interface`` field that contains the interface name to which
|
||||
the metric applies (eg 'tap0dc043a6-dd', 'tap769b123a-2e' and so on).
|
||||
the metric applies. For example, 'tap0dc043a6-dd', 'tap769b123a-2e', and others.
|
||||
|
||||
* ``virt_if_dropped_rx``, the number of dropped packets per second when
|
||||
receiving from the interface.
|
||||
|
@ -61,4 +61,4 @@ the metric applies (eg 'tap0dc043a6-dd', 'tap769b123a-2e' and so on).
|
|||
interface.
|
||||
|
||||
* ``virt_if_packets_tx``, the number of packets transmitted per second by the
|
||||
interface.
|
||||
interface.
|
|
@ -3,49 +3,67 @@
|
|||
System
|
||||
^^^^^^
|
||||
|
||||
Metrics have a ``service`` field with the name of the service it applies to. Values can be: hekad, collectd, influxd, grafana-server or elasticsearch.
|
||||
The metrics have a ``service`` field with the name of the service it applies
|
||||
to. The values can be: ``hekad``, ``collectd``, ``influxd``, ``grafana-server``
|
||||
or ``elasticsearch``.
|
||||
|
||||
* ``lma_components_count_processes``, number of processes currently running.
|
||||
* ``lma_components_count_threads``, number of threads currently running.
|
||||
* ``lma_components_cputime_syst``, percentage of CPU time spent in system mode by the service.
|
||||
It can be greater than 100% when the node has more than one CPU.
|
||||
* ``lma_components_cputime_user``, percentage of CPU time spent in user mode by the service.
|
||||
It can be greater than 100% when the node has more than one CPU.
|
||||
* ``lma_components_disk_bytes_read``, number of bytes read from disk(s) per second.
|
||||
* ``lma_components_disk_bytes_write``, number of bytes written to disk(s) per second.
|
||||
* ``lma_components_disk_ops_read``, number of read operations from disk(s) per second.
|
||||
* ``lma_components_disk_ops_write``, number of write operations to disk(s) per second.
|
||||
* ``lma_components_memory_code``, physical memory devoted to executable code (bytes).
|
||||
* ``lma_components_memory_data``, physical memory devoted to other than executable code (bytes).
|
||||
* ``lma_components_memory_rss``, non-swapped physical memory used (bytes).
|
||||
* ``lma_components_memory_vm``, virtual memory size (bytes).
|
||||
* ``lma_components_count_processes``, the number of processes currently running.
|
||||
* ``lma_components_count_threads``, the number of threads currently running.
|
||||
* ``lma_components_cputime_syst``, the percentage of CPU time spent in system
|
||||
mode by the service. It can be greater than 100% when the node has more than
|
||||
one CPU.
|
||||
* ``lma_components_cputime_user``, the percentage of CPU time spent in user
|
||||
mode by the service. It can be greater than 100% when the node has more than
|
||||
one CPU.
|
||||
* ``lma_components_disk_bytes_read``, the number of bytes read from disk(s) per
|
||||
second.
|
||||
* ``lma_components_disk_bytes_write``, the number of bytes written to disk(s)
|
||||
per second.
|
||||
* ``lma_components_disk_ops_read``, the number of read operations from disk(s)
|
||||
per second.
|
||||
* ``lma_components_disk_ops_write``, the number of write operations to disk(s)
|
||||
per second.
|
||||
* ``lma_components_memory_code``, the physical memory devoted to executable code
|
||||
in bytes.
|
||||
* ``lma_components_memory_data``, the physical memory devoted to other than
|
||||
executable code in bytes.
|
||||
* ``lma_components_memory_rss``, the non-swapped physical memory used in bytes.
|
||||
* ``lma_components_memory_vm``, the virtual memory size in bytes.
|
||||
* ``lma_components_pagefaults_majflt``, major page faults per second.
|
||||
* ``lma_components_pagefaults_minflt``, minor page faults per second.
|
||||
* ``lma_components_stacksize``, absolute value of the start address (the bottom)
|
||||
* ``lma_components_stacksize``, the absolute value of the start address (the bottom)
|
||||
of the stack minus the address of the current stack pointer.
|
||||
|
||||
Heka pipeline
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Metrics have two fields: ``name`` that contains the name of the decoder or filter as defined by *Heka* and ``type`` that is either *decoder* or *filter*.
|
||||
The metrics have two fields: ``name`` that contains the name of the decoder
|
||||
or filter as defined by *Heka* and ``type`` that is either *decoder* or
|
||||
*filter*.
|
||||
|
||||
Metrics for both types:
|
||||
The metrics for both types are as follows:
|
||||
|
||||
* ``hekad_memory``, the total memory used by the Sandbox (in bytes).
|
||||
* ``hekad_msg_avg_duration``, the average time for processing the message (in nanoseconds).
|
||||
* ``hekad_msg_count``, the total number of messages processed by the decoder. This will reset to 0 when the process is restarted.
|
||||
* ``hekad_memory``, the total memory in bytes used by the Sandbox.
|
||||
* ``hekad_msg_avg_duration``, the average time in nanoseconds for processing
|
||||
the message.
|
||||
* ``hekad_msg_count``, the total number of messages processed by the decoder.
|
||||
This resets to ``0`` when the process is restarted.
|
||||
|
||||
Additional metrics for *filter* type:
|
||||
|
||||
* ``heakd_timer_event_avg_duration``, the average time for executing the *timer_event* function (in nanoseconds).
|
||||
* ``hekad_timer_event_count``, the total number of executions of the *timer_event* function. This will reset to 0 when the process is restarted.
|
||||
* ``heakd_timer_event_avg_duration``, the average time in nanoseconds for
|
||||
executing the *timer_event* function.
|
||||
* ``hekad_timer_event_count``, the total number of executions of the
|
||||
*timer_event* function. This resets to ``0`` when the process is restarted.
|
||||
|
||||
Backend checks
|
||||
^^^^^^^^^^^^^^
|
||||
Back-end checks
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
* ``http_check``, the backend's API status, 1 if it is responsive, if not 0.
|
||||
The metric contains a ``service`` field that identifies the LMA backend service being checked.
|
||||
* ``http_check``, the API status of the back end, ``1`` if it is responsive,
|
||||
if not, then ``0``. The metric contains a ``service`` field that identifies
|
||||
the LMA back-end service being checked.
|
||||
|
||||
``<service>`` is one of the following values (depending of which Fuel plugins are deployed in the environment):
|
||||
``<service>`` is one of the following values, depending on which Fuel plugins
|
||||
are deployed in the environment:
|
||||
|
||||
* 'influxdb'
|
||||
* 'influxdb'
|
|
@ -1,25 +1,26 @@
|
|||
.. _memcached_metrics:
|
||||
|
||||
* ``memcached_command_flush``, cumulative number of flush reqs.
|
||||
* ``memcached_command_get``, cumulative number of retrieval reqs.
|
||||
* ``memcached_command_set``, cumulative number of storage reqs.
|
||||
* ``memcached_command_touch``, cumulative number of touch reqs.
|
||||
* ``memcached_connections_current``, number of open connections.
|
||||
* ``memcached_df_cache_free``, current number of free bytes to store items.
|
||||
* ``memcached_df_cache_used``, current number of bytes used to store items.
|
||||
* ``memcached_items_current``, current number of items stored.
|
||||
* ``memcached_octets_rx``, total number of bytes read by this server from network.
|
||||
* ``memcached_octets_tx``, total number of bytes sent by this server to network.
|
||||
* ``memcached_ops_decr_hits``, number of successful decr reqs.
|
||||
* ``memcached_ops_decr_misses``, number of decr reqs against missing keys.
|
||||
* ``memcached_ops_evictions``, number of valid items removed from cache to free memory for new items.
|
||||
* ``memcached_ops_hits``, number of keys that have been requested.
|
||||
* ``memcached_ops_incr_hits``, number of successful incr reqs.
|
||||
* ``memcached_ops_incr_misses``, number of successful incr reqs.
|
||||
* ``memcached_ops_misses``, number of items that have been requested and not found.
|
||||
* ``memcached_percent_hitratio``, percentage of get command hits (in cache).
|
||||
* ``memcached_command_flush``, the cumulative number of flush reqs.
|
||||
* ``memcached_command_get``, the cumulative number of retrieval reqs.
|
||||
* ``memcached_command_set``, the cumulative number of storage reqs.
|
||||
* ``memcached_command_touch``, the cumulative number of touch reqs.
|
||||
* ``memcached_connections_current``, the number of open connections.
|
||||
* ``memcached_df_cache_free``, the current number of free bytes to store items.
|
||||
* ``memcached_df_cache_used``, the current number of bytes used to store items.
|
||||
* ``memcached_items_current``, the current number of items stored.
|
||||
* ``memcached_octets_rx``, the total number of bytes read by this server from
|
||||
the network.
|
||||
* ``memcached_octets_tx``, the total number of bytes sent by this server to
|
||||
the network.
|
||||
* ``memcached_ops_decr_hits``, the number of successful decr reqs.
|
||||
* ``memcached_ops_decr_misses``, the number of decr reqs against missing keys.
|
||||
* ``memcached_ops_evictions``, the number of valid items removed from cache to
|
||||
free memory for new items.
|
||||
* ``memcached_ops_hits``, the number of keys that have been requested.
|
||||
* ``memcached_ops_incr_hits``, the number of successful incr reqs.
|
||||
* ``memcached_ops_incr_misses``, the number of successful incr reqs.
|
||||
* ``memcached_ops_misses``, the number of items that have been requested and
|
||||
not found.
|
||||
* ``memcached_percent_hitratio``, the percentage of get command hits (in cache).
|
||||
|
||||
|
||||
See `memcached documentation`_ for further details.
|
||||
|
||||
.. _memcached documentation: https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L488
|
||||
For details, see the `Memcached documentation <https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L488>`_.
|
|
@ -4,8 +4,8 @@ Commands
|
|||
^^^^^^^^
|
||||
|
||||
``mysql_commands``, the number of times per second a given statement has been
|
||||
executed. The metric has a ``statement`` field that contains the statement to
|
||||
which it applies. The values can be:
|
||||
executed. The metric has a ``statement`` field that contains the statement to
|
||||
which it applies. The values can be as follows:
|
||||
|
||||
* ``change_db`` for the USE statement.
|
||||
* ``commit`` for the COMMIT statement.
|
||||
|
@ -29,7 +29,7 @@ Handlers
|
|||
|
||||
``mysql_handler``, the number of times per second a given handler has been
|
||||
executed. The metric has a ``handler`` field that contains the handler
|
||||
it applies to. The values can be:
|
||||
it applies to. The values can be as follows:
|
||||
|
||||
* ``commit`` for the internal COMMIT statements.
|
||||
* ``delete`` for the internal DELETE statements.
|
||||
|
@ -40,56 +40,69 @@ it applies to. The values can be:
|
|||
* ``read_prev`` for the requests that read the previous row in key order.
|
||||
* ``read_rnd`` for the requests that read a row based on a fixed position.
|
||||
* ``read_rnd_next`` for the requests that read the next row in the data file.
|
||||
* ``rollback`` the requests that perform rollback operation.
|
||||
* ``rollback`` the requests that perform the rollback operation.
|
||||
* ``update`` the requests that update a row in a table.
|
||||
* ``write`` the requests that insert a row in a table.
|
||||
|
||||
Locks
|
||||
^^^^^
|
||||
|
||||
* ``mysql_locks_immediate``, the number of times per second the requests for table locks could be granted immediately.
|
||||
* ``mysql_locks_waited``, the number of times per second the requests for table locks had to wait.
|
||||
* ``mysql_locks_immediate``, the number of times per second the requests for
|
||||
table locks could be granted immediately.
|
||||
* ``mysql_locks_waited``, the number of times per second the requests for
|
||||
table locks had to wait.
|
||||
|
||||
Network
|
||||
^^^^^^^
|
||||
|
||||
* ``mysql_octets_rx``, the number of bytes received per second by the server.
|
||||
* ``mysql_octets_tx``, the number of bytes sent per second by the server.
|
||||
* ``mysql_octets_rx``, the number of bytes per second received by the server.
|
||||
* ``mysql_octets_tx``, the number of bytes per second sent by the server.
|
||||
|
||||
Threads
|
||||
^^^^^^^
|
||||
|
||||
* ``mysql_threads_cached``, the number of threads in the thread cache.
|
||||
* ``mysql_threads_connected``, the number of currently open connections.
|
||||
* ``mysql_threads_created``, the number of threads created per second to handle connections.
|
||||
* ``mysql_threads_created``, the number of threads created per second to
|
||||
handle connections.
|
||||
* ``mysql_threads_running``, the number of threads that are not sleeping.
|
||||
|
||||
Cluster
|
||||
^^^^^^^
|
||||
|
||||
These metrics are collected with statement 'SHOW STATUS'. see `Percona documentation`_
|
||||
for further details.
|
||||
The following metrics are collected with statement 'SHOW STATUS'. For details,
|
||||
see `Percona documentation <http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html>`_.
|
||||
|
||||
* ``mysql_cluster_connected``, ``1`` when the node is connected to the cluster, if not ``0``.
|
||||
* ``mysql_cluster_local_cert_failures``, number of writesets that failed the certification test.
|
||||
* ``mysql_cluster_local_commits``, number of writesets commited on the node.
|
||||
* ``mysql_cluster_local_recv_queue``, the number of writesets waiting to be applied.
|
||||
* ``mysql_cluster_local_send_queue``, the number of writesets waiting to be sent.
|
||||
* ``mysql_cluster_ready``, ``1`` when the node is ready to accept queries, if not ``0``.
|
||||
* ``mysql_cluster_received``, total number of writesets received from other nodes.
|
||||
* ``mysql_cluster_received_bytes``, total size in bytes of writesets received from other nodes.
|
||||
* ``mysql_cluster_replicated``, total number of writesets sent to other nodes.
|
||||
* ``mysql_cluster_replicated_bytes`` total size in bytes of writesets sent to other nodes.
|
||||
* ``mysql_cluster_size``, current number of nodes in the cluster.
|
||||
* ``mysql_cluster_status``, ``1`` when the node is 'Primary', ``2`` if 'Non-Primary' and ``3`` if 'Disconnected'.
|
||||
* ``mysql_cluster_connected``, ``1`` when the node is connected to the cluster,
|
||||
if not, then ``0``.
|
||||
* ``mysql_cluster_local_cert_failures``, the number of write sets that failed
|
||||
the certification test.
|
||||
* ``mysql_cluster_local_commits``, the number of write sets committed on the
|
||||
node.
|
||||
* ``mysql_cluster_local_recv_queue``, the number of write sets waiting to be
|
||||
applied.
|
||||
* ``mysql_cluster_local_send_queue``, the number of write sets waiting to be
|
||||
sent.
|
||||
* ``mysql_cluster_ready``, ``1`` when the node is ready to accept queries, if
|
||||
not, then ``0``.
|
||||
* ``mysql_cluster_received``, the total number of write sets received from
|
||||
other nodes.
|
||||
* ``mysql_cluster_received_bytes``, the total size in bytes of write sets
|
||||
received from other nodes.
|
||||
* ``mysql_cluster_replicated``, the total number of write sets sent to other
|
||||
nodes.
|
||||
* ``mysql_cluster_replicated_bytes`` the total size in bytes of write sets sent
|
||||
to other nodes.
|
||||
* ``mysql_cluster_size``, the current number of nodes in the cluster.
|
||||
* ``mysql_cluster_status``, ``1`` when the node is 'Primary', ``2`` if
|
||||
'Non-Primary', and ``3`` if 'Disconnected'.
|
||||
|
||||
.. _Percona documentation: http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html
|
||||
|
||||
Slow Queries
|
||||
Slow queries
|
||||
^^^^^^^^^^^^
|
||||
|
||||
This metric is collected with statement 'SHOW STATUS where Variable_name = 'Slow_queries'.
|
||||
|
||||
* ``mysql_slow_queries``, number of queries that have taken more than X seconds,
|
||||
depending of the MySQL configuration parameter 'long_query_time' (10s per default)
|
||||
The following metric is collected with statement
|
||||
'SHOW STATUS where Variable_name = 'Slow_queries'.
|
||||
|
||||
* ``mysql_slow_queries``, the number of queries that have taken more than X
|
||||
seconds, depending on the MySQL configuration parameter 'long_query_time'
|
||||
(10s per default).
|
|
@ -4,10 +4,12 @@ Service checks
|
|||
^^^^^^^^^^^^^^
|
||||
.. _service_checks:
|
||||
|
||||
* ``openstack_check_api``, the service's API status, 1 if it is responsive, if not 0.
|
||||
The metric contains a ``service`` field that identifies the OpenStack service being checked.
|
||||
* ``openstack_check_api``, the service's API status, ``1`` if it is responsive,
|
||||
if not, then ``0``. The metric contains a ``service`` field that identifies
|
||||
the OpenStack service being checked.
|
||||
|
||||
``<service>`` is one of the following values with their respective resource checks:
|
||||
``<service>`` is one of the following values with their respective resource
|
||||
checks:
|
||||
|
||||
* 'ceilometer-api': '/v2/capabilities'
|
||||
* 'cinder-api': '/'
|
||||
|
@ -21,61 +23,75 @@ Service checks
|
|||
* 'swift-api': '/healthcheck'
|
||||
* 'swift-s3-api': '/healthcheck'
|
||||
|
||||
.. note:: All checks are performed without authentication except for Ceilometer.
|
||||
.. note:: All checks except for Ceilometer are performed without authentication.
|
||||
|
||||
Compute
|
||||
^^^^^^^
|
||||
|
||||
These metrics are emitted per compute node.
|
||||
The following metrics are emitted per compute node:
|
||||
|
||||
* ``openstack_nova_free_disk``, the disk space (in GB) available for new instances.
|
||||
* ``openstack_nova_free_ram``, the memory (in MB) available for new instances.
|
||||
* ``openstack_nova_free_vcpus``, the number of virtual CPU available for new instances.
|
||||
* ``openstack_nova_instance_creation_time``, the time (in seconds) it took to launch a new instance.
|
||||
* ``openstack_nova_instance_state``, the number of instances which entered a given state (the value is always 1).
|
||||
* ``openstack_nova_free_disk``, the disk space in GB available for new instances.
|
||||
* ``openstack_nova_free_ram``, the memory in MB available for new instances.
|
||||
* ``openstack_nova_free_vcpus``, the number of virtual CPU available for new
|
||||
instances.
|
||||
* ``openstack_nova_instance_creation_time``, the time in seconds it took to
|
||||
launch a new instance.
|
||||
* ``openstack_nova_instance_state``, the number of instances which entered a
|
||||
given state (the value is always ``1``).
|
||||
The metric contains a ``state`` field.
|
||||
* ``openstack_nova_running_instances``, the number of running instances.
|
||||
* ``openstack_nova_running_tasks``, the number of tasks currently executed.
|
||||
* ``openstack_nova_used_disk``, the disk space (in GB) used by the instances.
|
||||
* ``openstack_nova_used_ram``, the memory (in MB) used by the instances.
|
||||
* ``openstack_nova_used_vcpus``, the number of virtual CPU used by the instances.
|
||||
* ``openstack_nova_used_disk``, the disk space in GB used by the instances.
|
||||
* ``openstack_nova_used_ram``, the memory in MB used by the instances.
|
||||
* ``openstack_nova_used_vcpus``, the number of virtual CPU used by the
|
||||
instances.
|
||||
|
||||
These metrics are retrieved from the Nova API and represent the aggregated
|
||||
values across all compute nodes.
|
||||
The following metrics are retrieved from the Nova API and represent the
|
||||
aggregated values across all compute nodes.
|
||||
|
||||
* ``openstack_nova_total_free_disk``, the total amount of disk space (in GB) available for new instances.
|
||||
* ``openstack_nova_total_free_ram``, the total amount of memory (in MB) available for new instances.
|
||||
* ``openstack_nova_total_free_vcpus``, the total number of virtual CPU available for new instances.
|
||||
* ``openstack_nova_total_running_instances``, the total number of running instances.
|
||||
* ``openstack_nova_total_running_tasks``, the total number of tasks currently executed.
|
||||
* ``openstack_nova_total_used_disk``, the total amount of disk space (in GB) used by the instances.
|
||||
* ``openstack_nova_total_used_ram``, the total amount of memory (in MB) used by the instances.
|
||||
* ``openstack_nova_total_used_vcpus``, the total number of virtual CPU used by the instances.
|
||||
* ``openstack_nova_total_free_disk``, the total amount of disk space in GB
|
||||
available for new instances.
|
||||
* ``openstack_nova_total_free_ram``, the total amount of memory in MB available
|
||||
for new instances.
|
||||
* ``openstack_nova_total_free_vcpus``, the total number of virtual CPU
|
||||
available for new instances.
|
||||
* ``openstack_nova_total_running_instances``, the total number of running
|
||||
instances.
|
||||
* ``openstack_nova_total_running_tasks``, the total number of tasks currently
|
||||
executed.
|
||||
* ``openstack_nova_total_used_disk``, the total amount of disk space in GB
|
||||
used by the instances.
|
||||
* ``openstack_nova_total_used_ram``, the total amount of memory in MB used by
|
||||
the instances.
|
||||
* ``openstack_nova_total_used_vcpus``, the total number of virtual CPU used by
|
||||
the instances.
|
||||
|
||||
These metrics are retrieved from the Nova API.
|
||||
The following metrics are retrieved from the Nova API:
|
||||
|
||||
* ``openstack_nova_instances``, the total count of instances in a given state.
|
||||
The metric contains a ``state`` field which is one of 'active', 'deleted',
|
||||
'error', 'paused', 'resumed', 'rescued', 'resized', 'shelved_offloaded' or
|
||||
'suspended'.
|
||||
|
||||
These metrics are retrieved from the Nova database.
|
||||
The following metrics are retrieved from the Nova database:
|
||||
|
||||
.. _compute-service-state-metrics:
|
||||
|
||||
* ``openstack_nova_service``, the Nova service state (either 0 for 'up', 1 for 'down' or 2 for 'disabled').
|
||||
The metric contains a ``service`` field (one of 'compute', 'conductor', 'scheduler', 'cert'
|
||||
or 'consoleauth') and a ``state`` field (one of 'up', 'down' or 'disabled').
|
||||
* ``openstack_nova_service``, the Nova service state (either ``0`` for 'up',
|
||||
``1`` for 'down' or ``2`` for 'disabled').
|
||||
The metric contains a ``service`` field (one of 'compute', 'conductor',
|
||||
'scheduler', 'cert' or 'consoleauth') and a ``state`` field (one of 'up',
|
||||
'down' or 'disabled').
|
||||
|
||||
* ``openstack_nova_services``, the total count of Nova
|
||||
services by state. The metric contains a ``service`` field (one of 'compute',
|
||||
'conductor', 'scheduler', 'cert' or 'consoleauth') and a ``state`` field (one
|
||||
of 'up', 'down' or 'disabled').
|
||||
of 'up', 'down', or 'disabled').
|
||||
|
||||
Identity
|
||||
^^^^^^^^
|
||||
|
||||
These metrics are retrieved from the Keystone API.
|
||||
The following metrics are retrieved from the Keystone API:
|
||||
|
||||
* ``openstack_keystone_roles``, the total number of roles.
|
||||
* ``openstack_keystone_tenants``, the number of tenants by state. The metric
|
||||
|
@ -86,28 +102,37 @@ These metrics are retrieved from the Keystone API.
|
|||
Volume
|
||||
^^^^^^
|
||||
|
||||
These metrics are emitted per volume node.
|
||||
The following metrics are emitted per volume node:
|
||||
|
||||
* ``openstack_cinder_volume_creation_time``, the time (in seconds) it took to create a new volume.
|
||||
* ``openstack_cinder_volume_creation_time``, the time in seconds it took to
|
||||
create a new volume.
|
||||
|
||||
.. note:: When using Ceph as the backend storage for volumes, the ``hostname`` value is always set to ``rbd``.
|
||||
.. note:: When using Ceph as the back end storage for volumes, the ``hostname``
|
||||
value is always set to ``rbd``.
|
||||
|
||||
These metrics are retrieved from the Cinder API.
|
||||
The following metrics are retrieved from the Cinder API:
|
||||
|
||||
* ``openstack_cinder_snapshots``, the number of snapshots by state. The metric contains a ``state`` field.
|
||||
* ``openstack_cinder_snapshots_size``, the total size (in bytes) of snapshots by state. The metric contains a ``state`` field.
|
||||
* ``openstack_cinder_volumes``, the number of volumes by state. The metric contains a ``state`` field.
|
||||
* ``openstack_cinder_volumes_size``, the total size (in bytes) of volumes by state. The metric contains a ``state`` field.
|
||||
* ``openstack_cinder_snapshots``, the number of snapshots by state. The metric
|
||||
contains a ``state`` field.
|
||||
* ``openstack_cinder_snapshots_size``, the total size (in bytes) of snapshots
|
||||
by state. The metric contains a ``state`` field.
|
||||
* ``openstack_cinder_volumes``, the number of volumes by state. The metric
|
||||
contains a ``state`` field.
|
||||
* ``openstack_cinder_volumes_size``, the total size (in bytes) of volumes by
|
||||
state. The metric contains a ``state`` field.
|
||||
|
||||
``state`` is one of 'available', 'creating', 'attaching', 'in-use', 'deleting', 'backing-up', 'restoring-backup', 'error', 'error_deleting', 'error_restoring', 'error_extending'.
|
||||
``state`` is one of 'available', 'creating', 'attaching', 'in-use', 'deleting',
|
||||
'backing-up', 'restoring-backup', 'error', 'error_deleting', 'error_restoring',
|
||||
'error_extending'.
|
||||
|
||||
These metrics are retrieved from the Cinder database.
|
||||
The following metrics are retrieved from the Cinder database:
|
||||
|
||||
.. _volume-service-state-metrics:
|
||||
|
||||
* ``openstack_cinder_service``, the Cinder service state (either 0 for 'up', 1 for 'down' or 2 for 'disabled').
|
||||
The metric contains a ``service`` field (one of 'volume', 'backup', 'scheduler'),
|
||||
and a ``state`` field (one of 'up', 'down' or 'disabled').
|
||||
* ``openstack_cinder_service``, the Cinder service state (either ``0`` for
|
||||
'up', ``1`` for 'down', or ``2`` for 'disabled'). The metric contains a
|
||||
``service`` field (one of 'volume', 'backup', 'scheduler') and a ``state``
|
||||
field (one of 'up', 'down' or 'disabled').
|
||||
|
||||
* ``openstack_cinder_services``, the total count of Cinder services by state.
|
||||
The metric contains a ``service`` field (one of 'volume', 'backup',
|
||||
|
@ -116,17 +141,18 @@ These metrics are retrieved from the Cinder database.
|
|||
Image
|
||||
^^^^^
|
||||
|
||||
These metrics are retrieved from the Glance API.
|
||||
The following metrics are retrieved from the Glance API:
|
||||
|
||||
* ``openstack_glance_images``, the number of images by state and visibility.
|
||||
The metric contains ``state`` and ``visibility`` field.
|
||||
The metric contains ``state`` and ``visibility`` fields.
|
||||
* ``openstack_glance_images_size``, the total size (in bytes) of images by
|
||||
state and visibility. The metric contains ``state`` and ``visibility`` field.
|
||||
state and visibility. The metric contains ``state`` and ``visibility``
|
||||
fields.
|
||||
* ``openstack_glance_snapshots``, the number of snapshot images by state and
|
||||
visibility. The metric contains ``state`` and ``visibility`` field.
|
||||
visibility. The metric contains ``state`` and ``visibility`` fields.
|
||||
* ``openstack_glance_snapshots_size``, the total size (in bytes) of snapshots
|
||||
by state and visibility. The metric contains ``state`` and ``visibility``
|
||||
field.
|
||||
fields.
|
||||
|
||||
``state`` is one of 'queued', 'saving', 'active', 'killed', 'deleted',
|
||||
'pending_delete'. ``visibility`` is either 'public' or 'private'.
|
||||
|
@ -134,27 +160,32 @@ These metrics are retrieved from the Glance API.
|
|||
Network
|
||||
^^^^^^^
|
||||
|
||||
These metrics are retrieved from the Neutron API.
|
||||
The following metrics are retrieved from the Neutron API:
|
||||
|
||||
* ``openstack_neutron_floatingips``, the total number of floating IP addresses.
|
||||
* ``openstack_neutron_networks``, the number of virtual networks by state. The metric contains a ``state`` field.
|
||||
* ``openstack_neutron_ports``, the number of virtual ports by owner and state. The metric contains ``owner`` and ``state`` fields.
|
||||
* ``openstack_neutron_routers``, the number of virtual routers by state. The metric contains a ``state`` field.
|
||||
* ``openstack_neutron_networks``, the number of virtual networks by state. The
|
||||
metric contains a ``state`` field.
|
||||
* ``openstack_neutron_ports``, the number of virtual ports by owner and state.
|
||||
The metric contains ``owner`` and ``state`` fields.
|
||||
* ``openstack_neutron_routers``, the number of virtual routers by state. The
|
||||
metric contains a ``state`` field.
|
||||
* ``openstack_neutron_subnets``, the number of virtual subnets.
|
||||
|
||||
``<state>`` is one of 'active', 'build', 'down' or 'error'.
|
||||
|
||||
``<owner>`` is one of 'compute', 'dhcp', 'floatingip', 'floatingip_agent_gateway', 'router_interface', 'router_gateway', 'router_ha_interface', 'router_interface_distributed' or 'router_centralized_snat'.
|
||||
``<owner>`` is one of 'compute', 'dhcp', 'floatingip', 'floatingip_agent_gateway', 'router_interface', 'router_gateway', 'router_ha_interface',
|
||||
'router_interface_distributed', or 'router_centralized_snat'.
|
||||
|
||||
These metrics are retrieved from the Neutron database.
|
||||
The following metrics are retrieved from the Neutron database:
|
||||
|
||||
.. _network-agent-state-metrics:
|
||||
|
||||
.. note:: These metrics are not collected when the Contrail plugin is deployed.
|
||||
|
||||
* ``openstack_neutron_agent``, the Neutron agent state (either 0 for 'up', 1 for 'down' or 2 for 'disabled').
|
||||
The metric contains a ``service`` field (one of 'dhcp', 'l3', 'metadata' or 'openvswitch'),
|
||||
and a ``state`` field (one of 'up', 'down' or 'disabled').
|
||||
* ``openstack_neutron_agent``, the Neutron agent state (either ``0`` for 'up',
|
||||
``1`` for 'down', or ``2`` for 'disabled').
|
||||
The metric contains a ``service`` field (one of 'dhcp', 'l3', 'metadata', or
|
||||
'openvswitch'), and a ``state`` field (one of 'up', 'down' or 'disabled').
|
||||
|
||||
* ``openstack_neutron_agents``, the total number of Neutron agents by service
|
||||
and state. The metric contains ``service`` (one of 'dhcp', 'l3', 'metadata'
|
||||
|
@ -164,12 +195,17 @@ API response times
|
|||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* ``openstack_<service>_http_response_times``, HTTP response time statistics.
|
||||
The statistics are ``min``, ``max``, ``sum``, ``count``, ``upper_90`` (90 percentile) over 10 seconds.
|
||||
The metric contains ``http_method`` (eg 'GET', 'POST', and so forth) and ``http_status`` (eg '2xx', '4xx', and so forth) fields.
|
||||
The statistics are ``min``, ``max``, ``sum``, ``count``, ``upper_90``
|
||||
(90 percentile) over 10 seconds. The metric contains an ``http_method`` field,
|
||||
for example, 'GET', 'POST', and others, and an ``http_status`` field, for
|
||||
example, '2xx', '4xx', and others.
|
||||
|
||||
``<service>`` is one of 'cinder', 'glance', 'heat' 'keystone', 'neutron' or 'nova'.
|
||||
``<service>`` is one of 'cinder', 'glance', 'heat' 'keystone', 'neutron' or
|
||||
'nova'.
|
||||
|
||||
Logs
|
||||
^^^^
|
||||
|
||||
* ``log_messages``, the number of log messages per second for the given service and severity level. The metric contains ``service`` and ``level`` (one of 'debug', 'info', ... ) fields.
|
||||
* ``log_messages``, the number of log messages per second for the given
|
||||
service and severity level. The metric contains ``service`` and ``level``
|
||||
(one of 'debug', 'info', and others) fields.
|
||||
|
|
|
@ -4,6 +4,6 @@ Resource location
|
|||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
* ``pacemaker_resource_local_active``, ``1`` when the resource is located on
|
||||
the host reporting the metric, if not ``0``. The metric contains a
|
||||
the host reporting the metric, if not, then ``0``. The metric contains a
|
||||
``resource`` field which is one of 'vip__public', 'vip__management',
|
||||
'vip__vrouter_pub' or 'vip__vrouter'.
|
||||
'vip__vrouter_pub', or 'vip__vrouter'.
|
||||
|
|
|
@ -3,16 +3,23 @@
|
|||
Cluster
|
||||
^^^^^^^
|
||||
|
||||
* ``rabbitmq_connections``, total number of connections.
|
||||
* ``rabbitmq_consumers``, total number of consumers.
|
||||
* ``rabbitmq_channels``, total number of channels.
|
||||
* ``rabbitmq_exchanges``, total number of exchanges.
|
||||
* ``rabbitmq_messages``, total number of messages which are ready to be consumed or not yet acknowledged.
|
||||
* ``rabbitmq_queues``, total number of queues.
|
||||
* ``rabbitmq_running_nodes``, total number of running nodes in the cluster.
|
||||
* ``rabbitmq_disk_free``, the disk free space.
|
||||
* ``rabbitmq_disk_free_limit``, the minimum amount of free disk for RabbitMQ. When ``rabbitmq_disk_free`` drops below this value, all producers are blocked.
|
||||
* ``rabbitmq_remaining_disk``, the difference between ``rabbitmq_disk_free`` and ``rabbitmq_disk_free_limit``.
|
||||
* ``rabbitmq_connections``, the total number of connections.
|
||||
* ``rabbitmq_consumers``, the total number of consumers.
|
||||
* ``rabbitmq_channels``, the total number of channels.
|
||||
* ``rabbitmq_exchanges``, the total number of exchanges.
|
||||
* ``rabbitmq_messages``, the total number of messages which are ready to be
|
||||
consumed or not yet acknowledged.
|
||||
* ``rabbitmq_queues``, the total number of queues.
|
||||
* ``rabbitmq_running_nodes``, the total number of running nodes in the cluster.
|
||||
* ``rabbitmq_disk_free``, the free disk space.
|
||||
* ``rabbitmq_disk_free_limit``, the minimum amount of free disk space for
|
||||
RabbitMQ.
|
||||
When ``rabbitmq_disk_free`` drops below this value, all producers are blocked.
|
||||
* ``rabbitmq_remaining_disk``, the difference between ``rabbitmq_disk_free``
|
||||
and ``rabbitmq_disk_free_limit``.
|
||||
* ``rabbitmq_used_memory``, bytes of memory used by the whole RabbitMQ process.
|
||||
* ``rabbitmq_vm_memory_limit``, the maximum amount of memory allocated for RabbitMQ. When ``rabbitmq_used_memory`` uses more than this value, all producers are blocked.
|
||||
* ``rabbitmq_remaining_memory``, the difference between ``rabbitmq_vm_memory_limit`` and ``rabbitmq_used_memory``.
|
||||
* ``rabbitmq_vm_memory_limit``, the maximum amount of memory allocated for
|
||||
RabbitMQ. When ``rabbitmq_used_memory`` uses more than this value, all
|
||||
producers are blocked.
|
||||
* ``rabbitmq_remaining_memory``, the difference between
|
||||
``rabbitmq_vm_memory_limit`` and ``rabbitmq_used_memory``.
|
||||
|
|
|
@ -3,36 +3,45 @@
|
|||
CPU
|
||||
^^^
|
||||
|
||||
Metrics have a ``cpu_number`` field that contains the CPU number to which the metric applies.
|
||||
Metrics have a ``cpu_number`` field that contains the CPU number to which the
|
||||
metric applies.
|
||||
|
||||
* ``cpu_idle``, percentage of CPU time spent in the idle task.
|
||||
* ``cpu_interrupt``, percentage of CPU time spent servicing interrupts.
|
||||
* ``cpu_nice``, percentage of CPU time spent in user mode with low priority (nice).
|
||||
* ``cpu_softirq``, percentage of CPU time spent servicing soft interrupts.
|
||||
* ``cpu_steal``, percentage of CPU time spent in other operating systems.
|
||||
* ``cpu_system``, percentage of CPU time spent in system mode.
|
||||
* ``cpu_user``, percentage of CPU time spent in user mode.
|
||||
* ``cpu_wait``, percentage of CPU time spent waiting for I/O operations to complete.
|
||||
* ``cpu_idle``, the percentage of CPU time spent in the idle task.
|
||||
* ``cpu_interrupt``, the percentage of CPU time spent servicing interrupts.
|
||||
* ``cpu_nice``, the percentage of CPU time spent in user mode with low
|
||||
priority (nice).
|
||||
* ``cpu_softirq``, the percentage of CPU time spent servicing soft interrupts.
|
||||
* ``cpu_steal``, the percentage of CPU time spent in other operating systems.
|
||||
* ``cpu_system``, the percentage of CPU time spent in system mode.
|
||||
* ``cpu_user``, the percentage of CPU time spent in user mode.
|
||||
* ``cpu_wait``, the percentage of CPU time spent waiting for I/O operations to
|
||||
complete.
|
||||
|
||||
|
||||
Disk
|
||||
^^^^
|
||||
|
||||
Metrics have a ``device`` field that contains the disk device number the metric applies to (eg 'sda', 'sdb' and so on).
|
||||
Metrics have a ``device`` field that contains the disk device number the metric
|
||||
applies to. For example, 'sda', 'sdb', and others.
|
||||
|
||||
* ``disk_merged_read``, the number of read operations per second that could be merged with already queued operations.
|
||||
* ``disk_merged_write``, the number of write operations per second that could be merged with already queued operations.
|
||||
* ``disk_merged_read``, the number of read operations per second that could be
|
||||
merged with already queued operations.
|
||||
* ``disk_merged_write``, the number of write operations per second that could
|
||||
be merged with already queued operations.
|
||||
* ``disk_octets_read``, the number of octets (bytes) read per second.
|
||||
* ``disk_octets_write``, the number of octets (bytes) written per second.
|
||||
* ``disk_ops_read``, the number of read operations per second.
|
||||
* ``disk_ops_write``, the number of write operations per second.
|
||||
* ``disk_time_read``, the average time for a read operation to complete in the last interval.
|
||||
* ``disk_time_write``, the average time for a write operation to complete in the last interval.
|
||||
* ``disk_time_read``, the average time for a read operation to complete in the
|
||||
last interval.
|
||||
* ``disk_time_write``, the average time for a write operation to complete in
|
||||
the last interval.
|
||||
|
||||
File system
|
||||
^^^^^^^^^^^
|
||||
|
||||
Metrics have a ``fs`` field that contains the partition's mount point to which the metric applies (eg '/', '/var/lib' and so on).
|
||||
Metrics have a ``fs`` field that contains the partition's mount point to which
|
||||
the metric applies. For example, '/', '/var/lib', and others.
|
||||
|
||||
* ``fs_inodes_free``, the number of free inodes on the file system.
|
||||
* ``fs_inodes_percent_free``, the percentage of free inodes on the file system.
|
||||
|
@ -52,46 +61,53 @@ System load
|
|||
|
||||
* ``load_longterm``, the system load average over the last 15 minutes.
|
||||
* ``load_midterm``, the system load average over the last 5 minutes.
|
||||
* ``load_shortterm``, the system load averge over the last minute.
|
||||
* ``load_shortterm``, the system load average over the last minute.
|
||||
|
||||
Memory
|
||||
^^^^^^
|
||||
|
||||
* ``memory_buffered``, the amount of memory (in bytes) which is buffered.
|
||||
* ``memory_cached``, the amount of memory (in bytes) which is cached.
|
||||
* ``memory_free``, the amount of memory (in bytes) which is free.
|
||||
* ``memory_used``, the amount of memory (in bytes) which is used.
|
||||
* ``memory_buffered``, the amount of buffered memory in bytes.
|
||||
* ``memory_cached``, the amount of cached memory in bytes.
|
||||
* ``memory_free``, the amount of free memory in bytes.
|
||||
* ``memory_used``, the amount of used memory in bytes.
|
||||
|
||||
Network
|
||||
^^^^^^^
|
||||
|
||||
Metrics have a ``interface`` field that contains the interface name the metric applies to (eg 'eth0', 'eth1' and so on).
|
||||
Metrics have an ``interface`` field that contains the interface name the
|
||||
metric applies to. For example, 'eth0', 'eth1', and others.
|
||||
|
||||
* ``if_errors_rx``, the number of errors per second detected when receiving from the interface.
|
||||
* ``if_errors_tx``, the number of errors per second detected when transmitting from the interface.
|
||||
* ``if_octets_rx``, the number of octets (bytes) received per second by the interface.
|
||||
* ``if_octets_tx``, the number of octets (bytes) transmitted per second by the interface.
|
||||
* ``if_packets_rx``, the number of packets received per second by the interface.
|
||||
* ``if_packets_tx``, the number of packets transmitted per second by the interface.
|
||||
* ``if_errors_rx``, the number of errors per second detected when receiving
|
||||
from the interface.
|
||||
* ``if_errors_tx``, the number of errors per second detected when transmitting
|
||||
from the interface.
|
||||
* ``if_octets_rx``, the number of octets (bytes) received per second by the
|
||||
interface.
|
||||
* ``if_octets_tx``, the number of octets (bytes) transmitted per second by the
|
||||
interface.
|
||||
* ``if_packets_rx``, the number of packets received per second by the
|
||||
interface.
|
||||
* ``if_packets_tx``, the number of packets transmitted per second by the
|
||||
interface.
|
||||
|
||||
Processes
|
||||
^^^^^^^^^
|
||||
|
||||
* ``processes_count``, the number of processes in a given state. The metric has
|
||||
a ``state`` field (one of 'blocked', 'paging', 'running', 'sleeping', 'stopped'
|
||||
or 'zombies').
|
||||
a ``state`` field (one of 'blocked', 'paging', 'running', 'sleeping',
|
||||
'stopped' or 'zombies').
|
||||
* ``processes_fork_rate``, the number of processes forked per second.
|
||||
|
||||
Swap
|
||||
^^^^
|
||||
|
||||
* ``swap_cached``, the amount of cached memory (in bytes) which is in the swap.
|
||||
* ``swap_free``, the amount of free memory (in bytes) which is in the swap.
|
||||
* ``swap_cached``, the amount of cached memory (in bytes) that is in the swap.
|
||||
* ``swap_free``, the amount of free memory (in bytes) that is in the swap.
|
||||
* ``swap_io_in``, the number of swap pages written per second.
|
||||
* ``swap_io_out``, the number of swap pages read per second.
|
||||
* ``swap_used``, the amount of used memory (in bytes) which is in the swap.
|
||||
* ``swap_used``, the amount of used memory (in bytes) that is in the swap.
|
||||
|
||||
Users
|
||||
^^^^^
|
||||
|
||||
* ``logged_users``, the number of users currently logged-in.
|
||||
* ``logged_users``, the number of users currently logged in.
|
Loading…
Reference in New Issue