Merge "[docs] Edits verification, usage, and troubleshooting sections"

2016-07-21 08:13:05 +00:00 · 2016-07-21 08:13:05 +00:00 · 6e33c68cc7
parent 4373057df9 b2587570b4
commit 6e33c68cc7
5 changed files with 181 additions and 164 deletions
--- a/doc/images/nagios_homepage.png
+++ b/doc/images/nagios_homepage.png
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -3,7 +3,7 @@ Welcome to the StackLight Infrastructure Alerting plugin for Fuel documentation!
 ================================================================================

 Overview
-========
+~~~~~~~~

 .. toctree::
   :maxdepth: 1
@ -16,7 +16,7 @@ Overview
   references

 Installing and configuring StackLight Infrastructure Alerting plugin for Fuel
-=============================================================================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. toctree::
   :maxdepth: 1
@ -27,7 +27,7 @@ Installing and configuring StackLight Infrastructure Alerting plugin for Fuel
   verification

 Using StackLight Infrastructure Alerting plugin for Fuel
-========================================================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. toctree::
   :maxdepth: 1
--- a/doc/source/troubleshooting.rst
+++ b/doc/source/troubleshooting.rst
@ -1,64 +1,75 @@
 .. _troubleshooting:

+.. raw:: latex
+
+   \pagebreak
+
 Troubleshooting
 ---------------

-If you cannot access the Nagios web UI, follow these troubleshooting tips.
+If you cannot access the Nagios web UI, use the following troubleshooting tips.

-1. Check that the StackLight Collector are able to connect to the Nagios
-   VIP address on port *80*.
+#. Verify that the StackLight Collector is able to connect to the Nagios VIP
+   address on port ``80``.

-2. Check that the Nagios configuration is valid::
+#. Verify that the Nagios configuration is valid:

-    [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
+   .. code-block:: console

-       [snip]
+      [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg

-    Total Warnings: 0
-    Total Errors:   0
+         [snip]

-  Here, things look okay. No serious problems were detected during the pre-flight check.
+      Total Warnings: 0
+      Total Errors:   0

-3. Check that the Nagios server is up and running::
+  No serious problems were detected during the pre-flight check.

-    [root@node-13 ~]# crm resource status nagios3
-    resource nagios3 is NOT running
+#. Verify that the Nagios server is up and running:

-4. If Nagios is not running, start it::
+   .. code-block:: console

-    [root@node-13 ~]# crm resource start nagios3
+      [root@node-13 ~]# crm resource status nagios3
+      resource nagios3 is NOT running

-5. Check that Apache is up and running::
+#. If Nagios is not running, start it:

-    [root@node-13 ~]# crm resource status apache2-nagios
+   .. code-block:: console

-6. If Apache is not running, start it::
+      [root@node-13 ~]# crm resource start nagios3

-    [root@node-13 ~]# crm resource start apache2-nagios
+#. Verify that Apache is up and running:

-7. Look for errors in the Nagios log file:
+   .. code-block:: console

-   * ``/var/nagios/nagios.log``.
+      [root@node-13 ~]# crm resource status apache2-nagios

-8. Look for errors in the Apache log files:
+#. If Apache is not running, start it:
+
+   .. code-block:: console
+
+      [root@node-13 ~]# crm resource start apache2-nagios
+
+#. Look for errors in the Nagios ``/var/nagios/nagios.log`` log file:
+
+#. Look for errors in the Apache log files:

   * ``/var/log/apache2/nagios_error.log``
   * ``/var/log/apache2/nagios_wsgi_access.log``
   * ``/var/log/apache2/nagios_wsgi_error.log``

-Finally, Nagios may report a host or service state as *UNKNOWN*.
-Two cases can be distinguished:
+Nagios may report a host or service state as *UNKNOWN*, for example:

-  * 'UNKNOWN: No datapoint have been received ever',
-  * 'UNKNOWN: No datapoint have been received over the last X seconds'.
+  * 'UNKNOWN: No datapoint have been received ever'
+  * 'UNKNOWN: No datapoint have been received over the last X seconds'

-Both cases indicate that Nagios doesn't receive regular passive checks from
-the StackLight Collector. This may be due to different problems:
+Both cases indicate that Nagios does not receive regular passive checks from
+the StackLight Collector. This may be due to different issues, for example:

-  * The 'hekad' process fails to communicate with Nagios,
-  * The 'collectd' and/or 'hekad' process have crashed,
-  * One or several alarm rules are misconfigured.
+  * The 'hekad' process fails to communicate with Nagios
+  * The 'collectd' and/or 'hekad' process have crashed
+  * One or several alarm rules are misconfigured

-To remedy to the above situations, follow the `troubleshooting tips
+For solutions, see the `Troubleshooting tips
 <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
 of the *StackLight Collector Plugin User Guide*.
--- a/doc/source/usage.rst
+++ b/doc/source/usage.rst
@ -3,73 +3,70 @@
 Using Nagios
 ------------

-The StackLight Infrastructure Alerting Plugin configures Nagios
-to display the health status of all the nodes and services running
-in the OpenStack environment. The alarms (or service checks in Nagios
-terms) are created in **passive mode** which means that the actual
-checks are not performed by Nagios itself, but by the Collector
-and Aggregator agents of the LMA toolchain.
+The StackLight Infrastructure Alerting plugin configures Nagios to display the
+health status of all the nodes and services running in the OpenStack
+environment. The alarms, or service checks in Nagios terms, are created in
+**passive mode**, which means that the actual checks are not performed by
+Nagios itself, but by the Collector and Aggregator agents of the LMA toolchain.

-The best place to get an overview of your OpenStack environment
-is to go the **Services Dashboard**.
-If you click the *Services* link in the left panel of the
-Nagios web UI, you should see a page like this:
+**To get an overview of your OpenStack environment:**

-.. image:: ../images/nagios_services.png
-   :align: center
-   :width: 800
+#. Log in to the Fuel web UI.
+#. Click :guilabel:`Dashboard`.
+#. Click :guilabel:`Nagios`.
+#. Click the :guilabel:`Services` link in the left panel of the Nagios web UI.
+   You should see the following page:

-In this dashboard, there are two 'virtual hosts' representing
+   .. image:: ../images/nagios_services.png
+      :width: 445pt
+
+In this dashboard, there are two virtual hosts representing
 the health status of the so-called **global clusters** and
 **node clusters** entities:

-  * *00-global-clusters-env${ENVID}* is used to represent the
-    aggregated health status of global clusters like 'Nova',
-    'Keystone' or 'RabbiMQ' to name a few.
+  * *00-global-clusters-env${ENVID}* is used to represent the aggregated
+    health status of global clusters, such as 'Nova', 'Keystone', 'RabbiMQ',
+    and others.

-  * *00-node-clusters-env${ENVID}* is used to represent the
-    aggregated health status of  node clusters like
-    'Controller', 'Compute' and 'Storage'.
+  * *00-node-clusters-env${ENVID}* is used to represent the aggregated health
+    status of node clusters, such as 'Controller', 'Compute', and 'Storage'.

-Following the 'virtual hosts' sections, there is a list
-of checks received for each of the nodes provisioned in the
-environment. These checks may vary depending on the role of
-the node being monitored.
+The virtual hosts section contains a list of checks received for each of the
+nodes provisioned in the environment. These checks may vary depending on the
+role of the node being monitored.

-Alerting for the global cluster entities is enabled by default.
-Alerting for the nodes and clusters of nodes is disabled
-by default to avoid the alert fatigue since those alerts should
-not be representative of a critical condition affecting
-the overall health status of the global cluster entities.
-If you nonetheless want to enable those alerts, we can go
-to the service details page and click on the *Enable notifications
-for this service* link within the *Service Commands* panel as shown below.
+Alerting is enabled by default for the global cluster entities. For the nodes
+and clusters of nodes alerting is disabled by default to avoid the alert
+fatigue, since these alerts should not be representative of a critical
+condition affecting the overall health status of the global cluster entities.

-.. image:: ../images/nagios_enable_notifs.png
-   :align: center
-   :width: 800
+**To enable alerting for nodes and clusters:**

-Finally, you should pay attention to the fact that there is
-a direct dependency between the configuraton of the passive
-checks in Nagios and the `configuration of the alarms in
-the Collectors
+#. Click a particular service.
+#. Click the :guilabel:`Enable notifications for this service` link within
+   the :guilabel:`Service Commands` panel as shown below.
+
+   .. image:: ../images/nagios_enable_notifs.png
+      :width: 450pt
+
+There is a direct dependency between the configuration of the passive checks in
+Nagios and the `configuration of the alarms in the Collectors
 <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/alarms.html>`_.
 A change in ``/etc/hiera/override/alarming.yaml`` or
-``/etc/hiera/override/gse_filters.yaml`` on any of the
-nodes monitored by StackLight would require to reconfigure Nagios.
-It also implies that these two files should be maintained
-rigourously identical on all the nodes of the environment
-**including those where Nagios is installed**. Fortunately,
-StackLight provides Puppet artefacts to help you out with
-that task. To reconfigure the passive checks in Nagios
-when ``/etc/hiera/override/alarming.yaml`` or
-``/etc/hiera/override/gse_filters.yaml`` are modified
-you should run the command shown bellow on all the nodes where
-Nagios is installed::
+``/etc/hiera/override/gse_filters.yaml`` on any of the nodes monitored by
+StackLight would require to reconfigure Nagios. It also implies that these two
+files should be maintained rigorously identical on all the nodes of the
+environment **including those where Nagios is installed**. StackLight provides
+Puppet artifacts to help you out with that task. To reconfigure the passive
+checks in Nagios when ``/etc/hiera/override/alarming.yaml`` or
+``/etc/hiera/override/gse_filters.yaml`` are modified,
+run the following command on all the nodes where Nagios is installed:

-  # puppet apply --modulepath=/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/modules:\
-  /etc/puppet/modules \
-  /etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
+.. code-block:: console
+
+   # puppet apply --modulepath=/etc/fuel/plugins/\
+   lma_infrastructure_alerting-<version>/puppet/modules:/etc/puppet/modules \
+   /etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp

 Configuring service checks using the InfluxDB metrics
 -----------------------------------------------------
@ -80,68 +77,75 @@ metrics stored in InfluxDB's time-series.
 For example, you could define active checks to be notified
 when the CPU activity of particular process is too high.

-Let's assume the following scenario.
+Consider the following scenario:

-  * You want to monitor the Elasticsearch server
-  * The CPU activity of the Elasticsearch server is captured
-    in a time-series stored in InfluxDB.
-  * You want to receive an alert at the 'warning' level
-    when the CPU load exceeds 30% of system activity.
-  * You want to receive an alert at the 'critical' level
-    when the CPU load exceeds 50% of system activity.
+  * You want to monitor the Elasticsearch server.
+  * The CPU activity of the Elasticsearch server is captured in a time-series
+    stored in InfluxDB.
+  * You want to receive an alert at the 'warning' level when the CPU load
+    exceeds 30% of system activity.
+  * You want to receive an alert at the 'critical' level when the CPU load
+    exceeds 50% of system activity.

-The steps to create such an alarms in Nagios would be as follow:
+The steps to create such alarms in Nagios are as follows:

-1. Connect to each of the nodes running Nagios.
+#. Connect to each of the nodes running Nagios.

-2. Install the Nagios plugin for querying InfluxDB::
+#. Install the Nagios plugin for querying InfluxDB:

-    [root@node-13 ~]# pip install influx-nagios-plugin
+   .. code-block:: console

-3. Define the command and the service check in the ``/etc/nagios3/conf.d/influxdb_services.conf`` file::
+      [root@node-13 ~]# pip install influx-nagios-plugin

-    # Replace <INFLUXDB_HOST>, <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by
-    # the appropriate values for your deployment
-    define command {
-      command_line /usr/local/bin/check_influx \
-          -h <INFLUXDB_HOST> -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma \
-          -q "select max(value) from lma_components_cputime_syst \
-          where time > now() - 5m and service='$ARG1$' \
-          group by time(5m) limit 1" \
-          -w $ARG2$ -c $ARG3$
-      command_name check_cpu_metric
-    }
+#. Define the command and the service check in the
+   ``/etc/nagios3/conf.d/influxdb_services.conf`` file::

-    define service {
-      service_description Elasticsearch system CPU
-      host                node-13
-      check_command       check_cpu_metric!elasticsearch!30!50:
-      use                 generic-service
-    }
+     # Replace <INFLUXDB_HOST>, <INFLUXDB_USER> and <INFLUXDB_PASSWORD> by
+     # the appropriate values for your deployment
+     define command {
+       command_line /usr/local/bin/check_influx \
+           -h <INFLUXDB_HOST> -u <INFLUXDB_USER> -p <INFLUXDB_PASSWORD> -d lma \
+           -q "select max(value) from lma_components_cputime_syst \
+           where time > now() - 5m and service='$ARG1$' \
+           group by time(5m) limit 1" \
+           -w $ARG2$ -c $ARG3$
+       command_name check_cpu_metric
+     }

-4. Verify that the Nagios configuration is valid::
+     define service {
+       service_description Elasticsearch system CPU
+       host                node-13
+       check_command       check_cpu_metric!elasticsearch!30!50:
+       use                 generic-service
+     }

-    [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
+#. Verify that the Nagios configuration is valid:

-       [snip]
+   .. code-block:: console

-    Total Warnings: 0
-    Total Errors:   0
+      [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg

-  Here, things look okay. No serious problems were detected during the pre-flight check.
+         [snip]

-5. Restart the Nagios server::
+      Total Warnings: 0
+      Total Errors:   0

-    [root@node-13 ~]# crm resource restart nagios3
+   No serious problems were detected during the pre-flight check.

-6. Go to the Nagios Web UI to verify that the service check has been added.
+#. Restart the Nagios server:

-You can define additional service checks for different nodes or
-node groups using the same ``check_influx`` command.
-You will just need to provide these three required arguments for defining new service checks:
+   .. code-block:: console
+
+      [root@node-13 ~]# crm resource restart nagios3
+
+#. Go to the Nagios Web UI to verify that the service check has been added.
+
+You can define additional service checks for different nodes or node groups
+using the same :command:`check_influx` command. To define new service checks,
+provide the following required arguments:

  * A valid InfluxDB query that should return only one row with a single value.
-    Check the `InfluxDB documentation <https://docs.influxdata.com/influxdb/v0.10/query_language/>`_
+    See `InfluxDB documentation <https://docs.influxdata.com/influxdb/v0.10/query_language/>`_
    to learn how to use the InfluxDB's query language.
  * A range specification for the warning threshold.
  * A range specification for the critical threshold.
@ -152,19 +156,18 @@ You will just need to provide these three required arguments for defining new se
 Using an external SMTP server with STARTTLS
 -------------------------------------------

-If your SMTP server requires STARTTLS, you need to make some
-manual adjustements to the Nagios configuration after the deployment of
-your environment.
+If your SMTP server requires STARTTLS, perform some manual adjustments to the
+Nagios configuration after the deployment of your environment.

-.. note:: Prior to enabling STARTTLS, you need to configure the *SMTP Authentication method*
+.. note:: Prior to enabling STARTTLS, configure the *SMTP Authentication method*
   parameter in the plugin's settings to use either *Plain*, *Login* or *CRAM-MD5*.

-1. Login to the *LMA Infrastructure Alerting* node.
+#. Log in to the *LMA Infrastructure Alerting* node.

-2. Edit the
+#. Edit the
   ``/etc/nagios3/conf.d/cmd_notify-service-by-smtp-with-long-service-output.cfg``
-   file to add the ``-S smtp-use-starttls`` option to the `mail` command. For
-   example::
+   file to add the ``-S smtp-use-starttls`` option to the :command:`mail`
+   command. For example::

    define command{
      command_name    notify-service-by-smtp-with-long-service-output
@ -184,17 +187,21 @@ your environment.
        $CONTACTEMAIL$
    }

-   .. note:: If the server certificate isn't present in the standard directory (eg
-     ``/etc/ssl/certs`` on Ubuntu), you can specify its location by adding the ``-S
-     ssl-ca-file=<FILE>`` option.
+   .. note:: If the server certificate is not present in the standard
+      directory, for example, ``/etc/ssl/certs`` on Ubuntu, specify its
+      location by adding the ``-S ssl-ca-file=<FILE>`` option.

-     If you want to disable the verification of the SSL/TLS server
-     certificate altogether, you should add the ``-S ssl-verify=ignore`` option instead.
+      To disable the verification of the SSL/TLS server certificate altogether,
+      add the ``-S ssl-verify=ignore`` option instead.

-3. Verify that the Nagios configuration is correct::
+#. Verify that the Nagios configuration is correct:

-    [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg
+   .. code-block:: console

-4. Restart the Nagios service::
+      [root@node-13 ~]# nagios3 -v /etc/nagios3/nagios.cfg

-    [root@node-13 ~]# crm resource restart nagios3
+#. Restart the Nagios service:
+
+   .. code-block:: console
+
+      [root@node-13 ~]# crm resource restart nagios3
--- a/doc/source/verification.rst
+++ b/doc/source/verification.rst
@ -1,30 +1,29 @@
 .. _verification:

+.. raw:: latex
+
+   \pagebreak
+
 Plugin verification
 -------------------

-Be aware, that depending on the number of nodes and deployment setup,
-deploying a Mirantis OpenStack environment may typically take between
-20 minutes to several hours. Once your deployment is complete,
-you should see a deployment success notification message with
-a link to the Nagios web UI as shown below.
+Depending on the number of nodes and deployment setup, deploying a Mirantis
+OpenStack environment may take between 20 minutes to several hours. Once your
+deployment is complete, you should see a deployment success notification
+message with a link to the Nagios web UI as shown below.

 .. image:: ../images/deployment_notification.png
-   :align: center
-   :width: 800
+   :width: 460pt

-Click on the *Nagios* link.
-
-Once you are authenticated,
-you should be redirected to the **Nagios Home Page** as shown below.
+Click :guilabel:`Nagios`. Once authenticated, you should be redirected to the
+Nagios home page as shown below.

 .. image:: ../images/nagios_homepage.png
-   :align: center
-   :width: 800
+   :width: 470pt

-.. note:: *username* is ``nagiosadmin`` by default, *password* is defined
+.. note:: The username is ``nagiosadmin`` by default, the password is defined
   in the settings.

-.. note:: Be aware that if Nagios is installed on the *management network*,
-   you may not have direct access to the Nagios web UI. Some extra network
-   configuration may be required to create an SSH tunnel to the *management network*.
+.. note:: If Nagios is installed on the *management network*, you may not have
+   direct access to the Nagios web UI. Extra network configuration may be
+   required to create an SSH tunnel to the *management network*.