Merge "[docs] Edits verification, usage, and troubleshooting sections"

This commit is contained in:
Jenkins 2016-07-21 08:13:05 +00:00 committed by Gerrit Code Review
commit 6e33c68cc7
5 changed files with 181 additions and 164 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 129 KiB

After

Width:  |  Height:  |  Size: 138 KiB

View File

@ -3,7 +3,7 @@ Welcome to the StackLight Infrastructure Alerting plugin for Fuel documentation!
================================================================================
Overview
========
~~~~~~~~
.. toctree::
:maxdepth: 1
@ -16,7 +16,7 @@ Overview
references
Installing and configuring StackLight Infrastructure Alerting plugin for Fuel
=============================================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. toctree::
:maxdepth: 1
@ -27,7 +27,7 @@ Installing and configuring StackLight Infrastructure Alerting plugin for Fuel
verification
Using StackLight Infrastructure Alerting plugin for Fuel
========================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. toctree::
:maxdepth: 1

View File

@ -1,64 +1,75 @@
.. _troubleshooting:
.. raw:: latex
\pagebreak
Troubleshooting
---------------
If you cannot access the Nagios web UI, follow these troubleshooting tips.
If you cannot access the Nagios web UI, use the following troubleshooting tips.
1. Check that the StackLight Collector are able to connect to the Nagios
VIP address on port *80*.
#. Verify that the StackLight Collector is able to connect to the Nagios VIP
address on port ``80``.
2. Check that the Nagios configuration is valid::
#. Verify that the Nagios configuration is valid:
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
.. code-block:: console
[snip]
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
Total Warnings: 0
Total Errors: 0
[snip]
Here, things look okay. No serious problems were detected during the pre-flight check.
Total Warnings: 0
Total Errors: 0
3. Check that the Nagios server is up and running::
No serious problems were detected during the pre-flight check.
[root@node-13 ~]# crm resource status nagios3
resource nagios3 is NOT running
#. Verify that the Nagios server is up and running:
4. If Nagios is not running, start it::
.. code-block:: console
[root@node-13 ~]# crm resource start nagios3
[root@node-13 ~]# crm resource status nagios3
resource nagios3 is NOT running
5. Check that Apache is up and running::
#. If Nagios is not running, start it:
[root@node-13 ~]# crm resource status apache2-nagios
.. code-block:: console
6. If Apache is not running, start it::
[root@node-13 ~]# crm resource start nagios3
[root@node-13 ~]# crm resource start apache2-nagios
#. Verify that Apache is up and running:
7. Look for errors in the Nagios log file:
.. code-block:: console
* ``/var/nagios/nagios.log``.
[root@node-13 ~]# crm resource status apache2-nagios
8. Look for errors in the Apache log files:
#. If Apache is not running, start it:
.. code-block:: console
[root@node-13 ~]# crm resource start apache2-nagios
#. Look for errors in the Nagios ``/var/nagios/nagios.log`` log file:
#. Look for errors in the Apache log files:
* ``/var/log/apache2/nagios_error.log``
* ``/var/log/apache2/nagios_wsgi_access.log``
* ``/var/log/apache2/nagios_wsgi_error.log``
Finally, Nagios may report a host or service state as *UNKNOWN*.
Two cases can be distinguished:
Nagios may report a host or service state as *UNKNOWN*, for example:
* 'UNKNOWN: No datapoint have been received ever',
* 'UNKNOWN: No datapoint have been received over the last X seconds'.
* 'UNKNOWN: No datapoint have been received ever'
* 'UNKNOWN: No datapoint have been received over the last X seconds'
Both cases indicate that Nagios doesn't receive regular passive checks from
the StackLight Collector. This may be due to different problems:
Both cases indicate that Nagios does not receive regular passive checks from
the StackLight Collector. This may be due to different issues, for example:
* The 'hekad' process fails to communicate with Nagios,
* The 'collectd' and/or 'hekad' process have crashed,
* One or several alarm rules are misconfigured.
* The 'hekad' process fails to communicate with Nagios
* The 'collectd' and/or 'hekad' process have crashed
* One or several alarm rules are misconfigured
To remedy to the above situations, follow the `troubleshooting tips
For solutions, see the `Troubleshooting tips
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
of the *StackLight Collector Plugin User Guide*.

View File

@ -3,73 +3,70 @@
Using Nagios
------------
The StackLight Infrastructure Alerting Plugin configures Nagios
to display the health status of all the nodes and services running
in the OpenStack environment. The alarms (or service checks in Nagios
terms) are created in **passive mode** which means that the actual
checks are not performed by Nagios itself, but by the Collector
and Aggregator agents of the LMA toolchain.
The StackLight Infrastructure Alerting plugin configures Nagios to display the
health status of all the nodes and services running in the OpenStack
environment. The alarms, or service checks in Nagios terms, are created in
**passive mode**, which means that the actual checks are not performed by
Nagios itself, but by the Collector and Aggregator agents of the LMA toolchain.
The best place to get an overview of your OpenStack environment
is to go the **Services Dashboard**.
If you click the *Services* link in the left panel of the
Nagios web UI, you should see a page like this:
**To get an overview of your OpenStack environment:**
.. image:: ../images/nagios_services.png
:align: center
:width: 800
#. Log in to the Fuel web UI.
#. Click :guilabel:`Dashboard`.
#. Click :guilabel:`Nagios`.
#. Click the :guilabel:`Services` link in the left panel of the Nagios web UI.
You should see the following page:
In this dashboard, there are two 'virtual hosts' representing
.. image:: ../images/nagios_services.png
:width: 445pt
In this dashboard, there are two virtual hosts representing
the health status of the so-called **global clusters** and
**node clusters** entities:
* *00-global-clusters-env${ENVID}* is used to represent the
aggregated health status of global clusters like 'Nova',
'Keystone' or 'RabbiMQ' to name a few.
* *00-global-clusters-env${ENVID}* is used to represent the aggregated
health status of global clusters, such as 'Nova', 'Keystone', 'RabbiMQ',
and others.
* *00-node-clusters-env${ENVID}* is used to represent the
aggregated health status of node clusters like
'Controller', 'Compute' and 'Storage'.
* *00-node-clusters-env${ENVID}* is used to represent the aggregated health
status of node clusters, such as 'Controller', 'Compute', and 'Storage'.
Following the 'virtual hosts' sections, there is a list
of checks received for each of the nodes provisioned in the
environment. These checks may vary depending on the role of
the node being monitored.
The virtual hosts section contains a list of checks received for each of the
nodes provisioned in the environment. These checks may vary depending on the
role of the node being monitored.
Alerting for the global cluster entities is enabled by default.
Alerting for the nodes and clusters of nodes is disabled
by default to avoid the alert fatigue since those alerts should
not be representative of a critical condition affecting
the overall health status of the global cluster entities.
If you nonetheless want to enable those alerts, we can go
to the service details page and click on the *Enable notifications
for this service* link within the *Service Commands* panel as shown below.
Alerting is enabled by default for the global cluster entities. For the nodes
and clusters of nodes alerting is disabled by default to avoid the alert
fatigue, since these alerts should not be representative of a critical
condition affecting the overall health status of the global cluster entities.
.. image:: ../images/nagios_enable_notifs.png
:align: center
:width: 800
**To enable alerting for nodes and clusters:**
Finally, you should pay attention to the fact that there is
a direct dependency between the configuraton of the passive
checks in Nagios and the `configuration of the alarms in
the Collectors
#. Click a particular service.
#. Click the :guilabel:`Enable notifications for this service` link within
the :guilabel:`Service Commands` panel as shown below.
.. image:: ../images/nagios_enable_notifs.png
:width: 450pt
There is a direct dependency between the configuration of the passive checks in
Nagios and the `configuration of the alarms in the Collectors
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/alarms.html>`_.
A change in ``/etc/hiera/override/alarming.yaml`` or
``/etc/hiera/override/gse_filters.yaml`` on any of the
nodes monitored by StackLight would require to reconfigure Nagios.
It also implies that these two files should be maintained
rigourously identical on all the nodes of the environment
**including those where Nagios is installed**. Fortunately,
StackLight provides Puppet artefacts to help you out with
that task. To reconfigure the passive checks in Nagios
when ``/etc/hiera/override/alarming.yaml`` or
``/etc/hiera/override/gse_filters.yaml`` are modified
you should run the command shown bellow on all the nodes where
Nagios is installed::
``/etc/hiera/override/gse_filters.yaml`` on any of the nodes monitored by
StackLight would require to reconfigure Nagios. It also implies that these two
files should be maintained rigorously identical on all the nodes of the
environment **including those where Nagios is installed**. StackLight provides
Puppet artifacts to help you out with that task. To reconfigure the passive
checks in Nagios when ``/etc/hiera/override/alarming.yaml`` or
``/etc/hiera/override/gse_filters.yaml`` are modified,
run the following command on all the nodes where Nagios is installed:
# puppet apply --modulepath=/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/modules:\
/etc/puppet/modules \
/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
.. code-block:: console
# puppet apply --modulepath=/etc/fuel/plugins/\
lma_infrastructure_alerting-<version>/puppet/modules:/etc/puppet/modules \
/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
Configuring service checks using the InfluxDB metrics
-----------------------------------------------------
@ -80,68 +77,75 @@ metrics stored in InfluxDB's time-series.
For example, you could define active checks to be notified
when the CPU activity of particular process is too high.
Let's assume the following scenario.
Consider the following scenario:
* You want to monitor the Elasticsearch server
* The CPU activity of the Elasticsearch server is captured
in a time-series stored in InfluxDB.
* You want to receive an alert at the 'warning' level
when the CPU load exceeds 30% of system activity.
* You want to receive an alert at the 'critical' level
when the CPU load exceeds 50% of system activity.
* You want to monitor the Elasticsearch server.
* The CPU activity of the Elasticsearch server is captured in a time-series
stored in InfluxDB.
* You want to receive an alert at the 'warning' level when the CPU load
exceeds 30% of system activity.
* You want to receive an alert at the 'critical' level when the CPU load
exceeds 50% of system activity.
The steps to create such an alarms in Nagios would be as follow:
The steps to create such alarms in Nagios are as follows:
1. Connect to each of the nodes running Nagios.
#. Connect to each of the nodes running Nagios.
2. Install the Nagios plugin for querying InfluxDB::
#. Install the Nagios plugin for querying InfluxDB:
[root@node-13 ~]# pip install influx-nagios-plugin
.. code-block:: console
3. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file::
[root@node-13 ~]# pip install influx-nagios-plugin
# Replace <INFLUXDB_HOST>, <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by
# the appropriate values for your deployment
define command {
command_line /usr/local/bin/check_influx \
-h <INFLUXDB_HOST> -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma \
-q "select max(value) from lma_components_cputime_syst \
where time > now() - 5m and service='$ARG1$' \
group by time(5m) limit 1" \
-w $ARG2$ -c $ARG3$
command_name check_cpu_metric
}
#. Define the command and the service check in the
``/etc/nagios3/conf.d/influxdb_services.conf`` file::
define service {
service_description Elasticsearch system CPU
host node-13
check_command check_cpu_metric!elasticsearch!30!50:
use generic-service
}
# Replace <INFLUXDB_HOST>, <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by
# the appropriate values for your deployment
define command {
command_line /usr/local/bin/check_influx \
-h <INFLUXDB_HOST> -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma \
-q "select max(value) from lma_components_cputime_syst \
where time > now() - 5m and service='$ARG1$' \
group by time(5m) limit 1" \
-w $ARG2$ -c $ARG3$
command_name check_cpu_metric
}
4. Verify that the Nagios configuration is valid::
define service {
service_description Elasticsearch system CPU
host node-13
check_command check_cpu_metric!elasticsearch!30!50:
use generic-service
}
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
#. Verify that the Nagios configuration is valid:
[snip]
.. code-block:: console
Total Warnings: 0
Total Errors: 0
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
Here, things look okay. No serious problems were detected during the pre-flight check.
[snip]
5. Restart the Nagios server::
Total Warnings: 0
Total Errors: 0
[root@node-13 ~]# crm resource restart nagios3
No serious problems were detected during the pre-flight check.
6. Go to the Nagios Web UI to verify that the service check has been added.
#. Restart the Nagios server:
You can define additional service checks for different nodes or
node groups using the same ``check_influx`` command.
You will just need to provide these three required arguments for defining new service checks:
.. code-block:: console
[root@node-13 ~]# crm resource restart nagios3
#. Go to the Nagios Web UI to verify that the service check has been added.
You can define additional service checks for different nodes or node groups
using the same :command:`check_influx` command. To define new service checks,
provide the following required arguments:
* A valid InfluxDB query that should return only one row with a single value.
Check the `InfluxDB documentation <https://docs.influxdata.com/influxdb/v0.10/query_language/>`_
See `InfluxDB documentation <https://docs.influxdata.com/influxdb/v0.10/query_language/>`_
to learn how to use the InfluxDB's query language.
* A range specification for the warning threshold.
* A range specification for the critical threshold.
@ -152,19 +156,18 @@ You will just need to provide these three required arguments for defining new se
Using an external SMTP server with STARTTLS
-------------------------------------------
If your SMTP server requires STARTTLS, you need to make some
manual adjustements to the Nagios configuration after the deployment of
your environment.
If your SMTP server requires STARTTLS, perform some manual adjustments to the
Nagios configuration after the deployment of your environment.
.. note:: Prior to enabling STARTTLS, you need to configure the *SMTP Authentication method*
.. note:: Prior to enabling STARTTLS, configure the *SMTP Authentication method*
parameter in the plugin's settings to use either *Plain*, *Login* or *CRAM-MD5*.
1. Login to the *LMA Infrastructure Alerting* node.
#. Log in to the *LMA Infrastructure Alerting* node.
2. Edit the
#. Edit the
``/etc/nagios3/conf.d/cmd_notify-service-by-smtp-with-long-service-output.cfg``
file to add the ``-S smtp-use-starttls`` option to the `mail` command. For
example::
file to add the ``-S smtp-use-starttls`` option to the :command:`mail`
command. For example::
define command{
command_name notify-service-by-smtp-with-long-service-output
@ -184,17 +187,21 @@ your environment.
$CONTACTEMAIL$
}
.. note:: If the server certificate isn't present in the standard directory (eg
``/etc/ssl/certs`` on Ubuntu), you can specify its location by adding the ``-S
ssl-ca-file=<FILE>`` option.
.. note:: If the server certificate is not present in the standard
directory, for example, ``/etc/ssl/certs`` on Ubuntu, specify its
location by adding the ``-S ssl-ca-file=<FILE>`` option.
If you want to disable the verification of the SSL/TLS server
certificate altogether, you should add the ``-S ssl-verify=ignore`` option instead.
To disable the verification of the SSL/TLS server certificate altogether,
add the ``-S ssl-verify=ignore`` option instead.
3. Verify that the Nagios configuration is correct::
#. Verify that the Nagios configuration is correct:
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
.. code-block:: console
4. Restart the Nagios service::
[root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
[root@node-13 ~]# crm resource restart nagios3
#. Restart the Nagios service:
.. code-block:: console
[root@node-13 ~]# crm resource restart nagios3

View File

@ -1,30 +1,29 @@
.. _verification:
.. raw:: latex
\pagebreak
Plugin verification
-------------------
Be aware, that depending on the number of nodes and deployment setup,
deploying a Mirantis OpenStack environment may typically take between
20 minutes to several hours. Once your deployment is complete,
you should see a deployment success notification message with
a link to the Nagios web UI as shown below.
Depending on the number of nodes and deployment setup, deploying a Mirantis
OpenStack environment may take between 20 minutes to several hours. Once your
deployment is complete, you should see a deployment success notification
message with a link to the Nagios web UI as shown below.
.. image:: ../images/deployment_notification.png
:align: center
:width: 800
:width: 460pt
Click on the *Nagios* link.
Once you are authenticated,
you should be redirected to the **Nagios Home Page** as shown below.
Click :guilabel:`Nagios`. Once authenticated, you should be redirected to the
Nagios home page as shown below.
.. image:: ../images/nagios_homepage.png
:align: center
:width: 800
:width: 470pt
.. note:: *username* is ``nagiosadmin`` by default, *password* is defined
.. note:: The username is ``nagiosadmin`` by default, the password is defined
in the settings.
.. note:: Be aware that if Nagios is installed on the *management network*,
you may not have direct access to the Nagios web UI. Some extra network
configuration may be required to create an SSH tunnel to the *management network*.
.. note:: If Nagios is installed on the *management network*, you may not have
direct access to the Nagios web UI. Extra network configuration may be
required to create an SSH tunnel to the *management network*.