Merge "[docs] Edits verification, usage, and troubleshooting sections"
This commit is contained in:
commit
6e33c68cc7
Binary file not shown.
Before Width: | Height: | Size: 129 KiB After Width: | Height: | Size: 138 KiB |
|
@ -3,7 +3,7 @@ Welcome to the StackLight Infrastructure Alerting plugin for Fuel documentation!
|
|||
================================================================================
|
||||
|
||||
Overview
|
||||
========
|
||||
~~~~~~~~
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
@ -16,7 +16,7 @@ Overview
|
|||
references
|
||||
|
||||
Installing and configuring StackLight Infrastructure Alerting plugin for Fuel
|
||||
=============================================================================
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
@ -27,7 +27,7 @@ Installing and configuring StackLight Infrastructure Alerting plugin for Fuel
|
|||
verification
|
||||
|
||||
Using StackLight Infrastructure Alerting plugin for Fuel
|
||||
========================================================
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
|
|
@ -1,64 +1,75 @@
|
|||
.. _troubleshooting:
|
||||
|
||||
.. raw:: latex
|
||||
|
||||
\pagebreak
|
||||
|
||||
Troubleshooting
|
||||
---------------
|
||||
|
||||
If you cannot access the Nagios web UI, follow these troubleshooting tips.
|
||||
If you cannot access the Nagios web UI, use the following troubleshooting tips.
|
||||
|
||||
1. Check that the StackLight Collector are able to connect to the Nagios
|
||||
VIP address on port *80*.
|
||||
#. Verify that the StackLight Collector is able to connect to the Nagios VIP
|
||||
address on port ``80``.
|
||||
|
||||
2. Check that the Nagios configuration is valid::
|
||||
#. Verify that the Nagios configuration is valid:
|
||||
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
.. code-block:: console
|
||||
|
||||
[snip]
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
|
||||
Total Warnings: 0
|
||||
Total Errors: 0
|
||||
[snip]
|
||||
|
||||
Here, things look okay. No serious problems were detected during the pre-flight check.
|
||||
Total Warnings: 0
|
||||
Total Errors: 0
|
||||
|
||||
3. Check that the Nagios server is up and running::
|
||||
No serious problems were detected during the pre-flight check.
|
||||
|
||||
[root@node-13 ~]# crm resource status nagios3
|
||||
resource nagios3 is NOT running
|
||||
#. Verify that the Nagios server is up and running:
|
||||
|
||||
4. If Nagios is not running, start it::
|
||||
.. code-block:: console
|
||||
|
||||
[root@node-13 ~]# crm resource start nagios3
|
||||
[root@node-13 ~]# crm resource status nagios3
|
||||
resource nagios3 is NOT running
|
||||
|
||||
5. Check that Apache is up and running::
|
||||
#. If Nagios is not running, start it:
|
||||
|
||||
[root@node-13 ~]# crm resource status apache2-nagios
|
||||
.. code-block:: console
|
||||
|
||||
6. If Apache is not running, start it::
|
||||
[root@node-13 ~]# crm resource start nagios3
|
||||
|
||||
[root@node-13 ~]# crm resource start apache2-nagios
|
||||
#. Verify that Apache is up and running:
|
||||
|
||||
7. Look for errors in the Nagios log file:
|
||||
.. code-block:: console
|
||||
|
||||
* ``/var/nagios/nagios.log``.
|
||||
[root@node-13 ~]# crm resource status apache2-nagios
|
||||
|
||||
8. Look for errors in the Apache log files:
|
||||
#. If Apache is not running, start it:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
[root@node-13 ~]# crm resource start apache2-nagios
|
||||
|
||||
#. Look for errors in the Nagios ``/var/nagios/nagios.log`` log file:
|
||||
|
||||
#. Look for errors in the Apache log files:
|
||||
|
||||
* ``/var/log/apache2/nagios_error.log``
|
||||
* ``/var/log/apache2/nagios_wsgi_access.log``
|
||||
* ``/var/log/apache2/nagios_wsgi_error.log``
|
||||
|
||||
Finally, Nagios may report a host or service state as *UNKNOWN*.
|
||||
Two cases can be distinguished:
|
||||
Nagios may report a host or service state as *UNKNOWN*, for example:
|
||||
|
||||
* 'UNKNOWN: No datapoint have been received ever',
|
||||
* 'UNKNOWN: No datapoint have been received over the last X seconds'.
|
||||
* 'UNKNOWN: No datapoint have been received ever'
|
||||
* 'UNKNOWN: No datapoint have been received over the last X seconds'
|
||||
|
||||
Both cases indicate that Nagios doesn't receive regular passive checks from
|
||||
the StackLight Collector. This may be due to different problems:
|
||||
Both cases indicate that Nagios does not receive regular passive checks from
|
||||
the StackLight Collector. This may be due to different issues, for example:
|
||||
|
||||
* The 'hekad' process fails to communicate with Nagios,
|
||||
* The 'collectd' and/or 'hekad' process have crashed,
|
||||
* One or several alarm rules are misconfigured.
|
||||
* The 'hekad' process fails to communicate with Nagios
|
||||
* The 'collectd' and/or 'hekad' process have crashed
|
||||
* One or several alarm rules are misconfigured
|
||||
|
||||
To remedy to the above situations, follow the `troubleshooting tips
|
||||
For solutions, see the `Troubleshooting tips
|
||||
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
|
||||
of the *StackLight Collector Plugin User Guide*.
|
|
@ -3,73 +3,70 @@
|
|||
Using Nagios
|
||||
------------
|
||||
|
||||
The StackLight Infrastructure Alerting Plugin configures Nagios
|
||||
to display the health status of all the nodes and services running
|
||||
in the OpenStack environment. The alarms (or service checks in Nagios
|
||||
terms) are created in **passive mode** which means that the actual
|
||||
checks are not performed by Nagios itself, but by the Collector
|
||||
and Aggregator agents of the LMA toolchain.
|
||||
The StackLight Infrastructure Alerting plugin configures Nagios to display the
|
||||
health status of all the nodes and services running in the OpenStack
|
||||
environment. The alarms, or service checks in Nagios terms, are created in
|
||||
**passive mode**, which means that the actual checks are not performed by
|
||||
Nagios itself, but by the Collector and Aggregator agents of the LMA toolchain.
|
||||
|
||||
The best place to get an overview of your OpenStack environment
|
||||
is to go the **Services Dashboard**.
|
||||
If you click the *Services* link in the left panel of the
|
||||
Nagios web UI, you should see a page like this:
|
||||
**To get an overview of your OpenStack environment:**
|
||||
|
||||
.. image:: ../images/nagios_services.png
|
||||
:align: center
|
||||
:width: 800
|
||||
#. Log in to the Fuel web UI.
|
||||
#. Click :guilabel:`Dashboard`.
|
||||
#. Click :guilabel:`Nagios`.
|
||||
#. Click the :guilabel:`Services` link in the left panel of the Nagios web UI.
|
||||
You should see the following page:
|
||||
|
||||
In this dashboard, there are two 'virtual hosts' representing
|
||||
.. image:: ../images/nagios_services.png
|
||||
:width: 445pt
|
||||
|
||||
In this dashboard, there are two virtual hosts representing
|
||||
the health status of the so-called **global clusters** and
|
||||
**node clusters** entities:
|
||||
|
||||
* *00-global-clusters-env${ENVID}* is used to represent the
|
||||
aggregated health status of global clusters like 'Nova',
|
||||
'Keystone' or 'RabbiMQ' to name a few.
|
||||
* *00-global-clusters-env${ENVID}* is used to represent the aggregated
|
||||
health status of global clusters, such as 'Nova', 'Keystone', 'RabbiMQ',
|
||||
and others.
|
||||
|
||||
* *00-node-clusters-env${ENVID}* is used to represent the
|
||||
aggregated health status of node clusters like
|
||||
'Controller', 'Compute' and 'Storage'.
|
||||
* *00-node-clusters-env${ENVID}* is used to represent the aggregated health
|
||||
status of node clusters, such as 'Controller', 'Compute', and 'Storage'.
|
||||
|
||||
Following the 'virtual hosts' sections, there is a list
|
||||
of checks received for each of the nodes provisioned in the
|
||||
environment. These checks may vary depending on the role of
|
||||
the node being monitored.
|
||||
The virtual hosts section contains a list of checks received for each of the
|
||||
nodes provisioned in the environment. These checks may vary depending on the
|
||||
role of the node being monitored.
|
||||
|
||||
Alerting for the global cluster entities is enabled by default.
|
||||
Alerting for the nodes and clusters of nodes is disabled
|
||||
by default to avoid the alert fatigue since those alerts should
|
||||
not be representative of a critical condition affecting
|
||||
the overall health status of the global cluster entities.
|
||||
If you nonetheless want to enable those alerts, we can go
|
||||
to the service details page and click on the *Enable notifications
|
||||
for this service* link within the *Service Commands* panel as shown below.
|
||||
Alerting is enabled by default for the global cluster entities. For the nodes
|
||||
and clusters of nodes alerting is disabled by default to avoid the alert
|
||||
fatigue, since these alerts should not be representative of a critical
|
||||
condition affecting the overall health status of the global cluster entities.
|
||||
|
||||
.. image:: ../images/nagios_enable_notifs.png
|
||||
:align: center
|
||||
:width: 800
|
||||
**To enable alerting for nodes and clusters:**
|
||||
|
||||
Finally, you should pay attention to the fact that there is
|
||||
a direct dependency between the configuraton of the passive
|
||||
checks in Nagios and the `configuration of the alarms in
|
||||
the Collectors
|
||||
#. Click a particular service.
|
||||
#. Click the :guilabel:`Enable notifications for this service` link within
|
||||
the :guilabel:`Service Commands` panel as shown below.
|
||||
|
||||
.. image:: ../images/nagios_enable_notifs.png
|
||||
:width: 450pt
|
||||
|
||||
There is a direct dependency between the configuration of the passive checks in
|
||||
Nagios and the `configuration of the alarms in the Collectors
|
||||
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/alarms.html>`_.
|
||||
A change in ``/etc/hiera/override/alarming.yaml`` or
|
||||
``/etc/hiera/override/gse_filters.yaml`` on any of the
|
||||
nodes monitored by StackLight would require to reconfigure Nagios.
|
||||
It also implies that these two files should be maintained
|
||||
rigourously identical on all the nodes of the environment
|
||||
**including those where Nagios is installed**. Fortunately,
|
||||
StackLight provides Puppet artefacts to help you out with
|
||||
that task. To reconfigure the passive checks in Nagios
|
||||
when ``/etc/hiera/override/alarming.yaml`` or
|
||||
``/etc/hiera/override/gse_filters.yaml`` are modified
|
||||
you should run the command shown bellow on all the nodes where
|
||||
Nagios is installed::
|
||||
``/etc/hiera/override/gse_filters.yaml`` on any of the nodes monitored by
|
||||
StackLight would require to reconfigure Nagios. It also implies that these two
|
||||
files should be maintained rigorously identical on all the nodes of the
|
||||
environment **including those where Nagios is installed**. StackLight provides
|
||||
Puppet artifacts to help you out with that task. To reconfigure the passive
|
||||
checks in Nagios when ``/etc/hiera/override/alarming.yaml`` or
|
||||
``/etc/hiera/override/gse_filters.yaml`` are modified,
|
||||
run the following command on all the nodes where Nagios is installed:
|
||||
|
||||
# puppet apply --modulepath=/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/modules:\
|
||||
/etc/puppet/modules \
|
||||
/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
|
||||
.. code-block:: console
|
||||
|
||||
# puppet apply --modulepath=/etc/fuel/plugins/\
|
||||
lma_infrastructure_alerting-<version>/puppet/modules:/etc/puppet/modules \
|
||||
/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
|
||||
|
||||
Configuring service checks using the InfluxDB metrics
|
||||
-----------------------------------------------------
|
||||
|
@ -80,68 +77,75 @@ metrics stored in InfluxDB's time-series.
|
|||
For example, you could define active checks to be notified
|
||||
when the CPU activity of particular process is too high.
|
||||
|
||||
Let's assume the following scenario.
|
||||
Consider the following scenario:
|
||||
|
||||
* You want to monitor the Elasticsearch server
|
||||
* The CPU activity of the Elasticsearch server is captured
|
||||
in a time-series stored in InfluxDB.
|
||||
* You want to receive an alert at the 'warning' level
|
||||
when the CPU load exceeds 30% of system activity.
|
||||
* You want to receive an alert at the 'critical' level
|
||||
when the CPU load exceeds 50% of system activity.
|
||||
* You want to monitor the Elasticsearch server.
|
||||
* The CPU activity of the Elasticsearch server is captured in a time-series
|
||||
stored in InfluxDB.
|
||||
* You want to receive an alert at the 'warning' level when the CPU load
|
||||
exceeds 30% of system activity.
|
||||
* You want to receive an alert at the 'critical' level when the CPU load
|
||||
exceeds 50% of system activity.
|
||||
|
||||
The steps to create such an alarms in Nagios would be as follow:
|
||||
The steps to create such alarms in Nagios are as follows:
|
||||
|
||||
1. Connect to each of the nodes running Nagios.
|
||||
#. Connect to each of the nodes running Nagios.
|
||||
|
||||
2. Install the Nagios plugin for querying InfluxDB::
|
||||
#. Install the Nagios plugin for querying InfluxDB:
|
||||
|
||||
[root@node-13 ~]# pip install influx-nagios-plugin
|
||||
.. code-block:: console
|
||||
|
||||
3. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file::
|
||||
[root@node-13 ~]# pip install influx-nagios-plugin
|
||||
|
||||
# Replace <INFLUXDB_HOST>, <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by
|
||||
# the appropriate values for your deployment
|
||||
define command {
|
||||
command_line /usr/local/bin/check_influx \
|
||||
-h <INFLUXDB_HOST> -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma \
|
||||
-q "select max(value) from lma_components_cputime_syst \
|
||||
where time > now() - 5m and service='$ARG1$' \
|
||||
group by time(5m) limit 1" \
|
||||
-w $ARG2$ -c $ARG3$
|
||||
command_name check_cpu_metric
|
||||
}
|
||||
#. Define the command and the service check in the
|
||||
``/etc/nagios3/conf.d/influxdb_services.conf`` file::
|
||||
|
||||
define service {
|
||||
service_description Elasticsearch system CPU
|
||||
host node-13
|
||||
check_command check_cpu_metric!elasticsearch!30!50:
|
||||
use generic-service
|
||||
}
|
||||
# Replace <INFLUXDB_HOST>, <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by
|
||||
# the appropriate values for your deployment
|
||||
define command {
|
||||
command_line /usr/local/bin/check_influx \
|
||||
-h <INFLUXDB_HOST> -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma \
|
||||
-q "select max(value) from lma_components_cputime_syst \
|
||||
where time > now() - 5m and service='$ARG1$' \
|
||||
group by time(5m) limit 1" \
|
||||
-w $ARG2$ -c $ARG3$
|
||||
command_name check_cpu_metric
|
||||
}
|
||||
|
||||
4. Verify that the Nagios configuration is valid::
|
||||
define service {
|
||||
service_description Elasticsearch system CPU
|
||||
host node-13
|
||||
check_command check_cpu_metric!elasticsearch!30!50:
|
||||
use generic-service
|
||||
}
|
||||
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
#. Verify that the Nagios configuration is valid:
|
||||
|
||||
[snip]
|
||||
.. code-block:: console
|
||||
|
||||
Total Warnings: 0
|
||||
Total Errors: 0
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
|
||||
Here, things look okay. No serious problems were detected during the pre-flight check.
|
||||
[snip]
|
||||
|
||||
5. Restart the Nagios server::
|
||||
Total Warnings: 0
|
||||
Total Errors: 0
|
||||
|
||||
[root@node-13 ~]# crm resource restart nagios3
|
||||
No serious problems were detected during the pre-flight check.
|
||||
|
||||
6. Go to the Nagios Web UI to verify that the service check has been added.
|
||||
#. Restart the Nagios server:
|
||||
|
||||
You can define additional service checks for different nodes or
|
||||
node groups using the same ``check_influx`` command.
|
||||
You will just need to provide these three required arguments for defining new service checks:
|
||||
.. code-block:: console
|
||||
|
||||
[root@node-13 ~]# crm resource restart nagios3
|
||||
|
||||
#. Go to the Nagios Web UI to verify that the service check has been added.
|
||||
|
||||
You can define additional service checks for different nodes or node groups
|
||||
using the same :command:`check_influx` command. To define new service checks,
|
||||
provide the following required arguments:
|
||||
|
||||
* A valid InfluxDB query that should return only one row with a single value.
|
||||
Check the `InfluxDB documentation <https://docs.influxdata.com/influxdb/v0.10/query_language/>`_
|
||||
See `InfluxDB documentation <https://docs.influxdata.com/influxdb/v0.10/query_language/>`_
|
||||
to learn how to use the InfluxDB's query language.
|
||||
* A range specification for the warning threshold.
|
||||
* A range specification for the critical threshold.
|
||||
|
@ -152,19 +156,18 @@ You will just need to provide these three required arguments for defining new se
|
|||
Using an external SMTP server with STARTTLS
|
||||
-------------------------------------------
|
||||
|
||||
If your SMTP server requires STARTTLS, you need to make some
|
||||
manual adjustements to the Nagios configuration after the deployment of
|
||||
your environment.
|
||||
If your SMTP server requires STARTTLS, perform some manual adjustments to the
|
||||
Nagios configuration after the deployment of your environment.
|
||||
|
||||
.. note:: Prior to enabling STARTTLS, you need to configure the *SMTP Authentication method*
|
||||
.. note:: Prior to enabling STARTTLS, configure the *SMTP Authentication method*
|
||||
parameter in the plugin's settings to use either *Plain*, *Login* or *CRAM-MD5*.
|
||||
|
||||
1. Login to the *LMA Infrastructure Alerting* node.
|
||||
#. Log in to the *LMA Infrastructure Alerting* node.
|
||||
|
||||
2. Edit the
|
||||
#. Edit the
|
||||
``/etc/nagios3/conf.d/cmd_notify-service-by-smtp-with-long-service-output.cfg``
|
||||
file to add the ``-S smtp-use-starttls`` option to the `mail` command. For
|
||||
example::
|
||||
file to add the ``-S smtp-use-starttls`` option to the :command:`mail`
|
||||
command. For example::
|
||||
|
||||
define command{
|
||||
command_name notify-service-by-smtp-with-long-service-output
|
||||
|
@ -184,17 +187,21 @@ your environment.
|
|||
$CONTACTEMAIL$
|
||||
}
|
||||
|
||||
.. note:: If the server certificate isn't present in the standard directory (eg
|
||||
``/etc/ssl/certs`` on Ubuntu), you can specify its location by adding the ``-S
|
||||
ssl-ca-file=<FILE>`` option.
|
||||
.. note:: If the server certificate is not present in the standard
|
||||
directory, for example, ``/etc/ssl/certs`` on Ubuntu, specify its
|
||||
location by adding the ``-S ssl-ca-file=<FILE>`` option.
|
||||
|
||||
If you want to disable the verification of the SSL/TLS server
|
||||
certificate altogether, you should add the ``-S ssl-verify=ignore`` option instead.
|
||||
To disable the verification of the SSL/TLS server certificate altogether,
|
||||
add the ``-S ssl-verify=ignore`` option instead.
|
||||
|
||||
3. Verify that the Nagios configuration is correct::
|
||||
#. Verify that the Nagios configuration is correct:
|
||||
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
.. code-block:: console
|
||||
|
||||
4. Restart the Nagios service::
|
||||
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
|
||||
|
||||
[root@node-13 ~]# crm resource restart nagios3
|
||||
#. Restart the Nagios service:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
[root@node-13 ~]# crm resource restart nagios3
|
|
@ -1,30 +1,29 @@
|
|||
.. _verification:
|
||||
|
||||
.. raw:: latex
|
||||
|
||||
\pagebreak
|
||||
|
||||
Plugin verification
|
||||
-------------------
|
||||
|
||||
Be aware, that depending on the number of nodes and deployment setup,
|
||||
deploying a Mirantis OpenStack environment may typically take between
|
||||
20 minutes to several hours. Once your deployment is complete,
|
||||
you should see a deployment success notification message with
|
||||
a link to the Nagios web UI as shown below.
|
||||
Depending on the number of nodes and deployment setup, deploying a Mirantis
|
||||
OpenStack environment may take between 20 minutes to several hours. Once your
|
||||
deployment is complete, you should see a deployment success notification
|
||||
message with a link to the Nagios web UI as shown below.
|
||||
|
||||
.. image:: ../images/deployment_notification.png
|
||||
:align: center
|
||||
:width: 800
|
||||
:width: 460pt
|
||||
|
||||
Click on the *Nagios* link.
|
||||
|
||||
Once you are authenticated,
|
||||
you should be redirected to the **Nagios Home Page** as shown below.
|
||||
Click :guilabel:`Nagios`. Once authenticated, you should be redirected to the
|
||||
Nagios home page as shown below.
|
||||
|
||||
.. image:: ../images/nagios_homepage.png
|
||||
:align: center
|
||||
:width: 800
|
||||
:width: 470pt
|
||||
|
||||
.. note:: *username* is ``nagiosadmin`` by default, *password* is defined
|
||||
.. note:: The username is ``nagiosadmin`` by default, the password is defined
|
||||
in the settings.
|
||||
|
||||
.. note:: Be aware that if Nagios is installed on the *management network*,
|
||||
you may not have direct access to the Nagios web UI. Some extra network
|
||||
configuration may be required to create an SSH tunnel to the *management network*.
|
||||
.. note:: If Nagios is installed on the *management network*, you may not have
|
||||
direct access to the Nagios web UI. Extra network configuration may be
|
||||
required to create an SSH tunnel to the *management network*.
|
Loading…
Reference in New Issue