Updated the user section

Removed trailing spaces Fixed build failure issue Update grafana_nova_annot.png Fixed various typos Beefed up the troubleshooting section Fixed remarks from patchset 4 rixed remarks from patchset 5 Change-Id: Ic78f312990d76bb5461ccfa071f18a8682683818
2016-02-12 11:29:58 +01:00
parent df05c0fd98
commit e29d26bfb0
10 changed files with 228 additions and 192 deletions
--- a/doc/images/deployment_notification.png
+++ b/doc/images/deployment_notification.png
--- a/doc/images/grafana_home.png
+++ b/doc/images/grafana_home.png
--- a/doc/images/grafana_hypervisor.png
+++ b/doc/images/grafana_hypervisor.png
--- a/doc/images/grafana_link.png
+++ b/doc/images/grafana_link.png
--- a/doc/images/grafana_login.png
+++ b/doc/images/grafana_login.png
--- a/doc/images/grafana_nova.png
+++ b/doc/images/grafana_nova.png
--- a/doc/images/grafana_nova_annot.png
+++ b/doc/images/grafana_nova_annot.png
--- a/doc/images/influx_grafana_role.png
+++ b/doc/images/influx_grafana_role.png
--- a/doc/images/influx_grafana_settings.png
+++ b/doc/images/influx_grafana_settings.png
--- a/doc/source/user.rst
+++ b/doc/source/user.rst
@@ -8,228 +8,233 @@ User Guide
 Plugin configuration
 --------------------

-To configure your plugin, you need to follow these steps:
+To configure the plugin, you need to follow these steps:

-#. `Create a new environment <http://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#launch-wizard-to-create-new-environment>`_
-   with the Fuel web user interface.
+#. `Create a new environment <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#launch-wizard-to-create-new-environment>`_
+   from the Fuel web user interface.

-#. Click on the Settings tab of the Fuel web UI.
+#. Click the **Settings** tab and select the **Other** category.

-#. Scroll down the page and select the InfluxDB-Grafana Plugin in the left column.
-   The InfluxDB-Grafana Plugin settings screen should appear as shown below.
-
-|
+#. Scroll down through the settings until you find the **InfluxDB-Grafana Server
+   Plugin** section. You should see a page like this

   .. image:: ../images/influx_grafana_settings.png
      :width: 800
      :align: center

-|
+#. Check the **InfluxDB-Grafana Plugin** box and fill-in the required fields as indicated below.

-#. Select the InfluxDB-Grafana Plugin checkbox and fill-in the required fields.
+   a. Specify the number of days of retention for your data.
+   b. Specify the InfluxDB admin password (called root password in the InfluxDB documentation).
+   c. Specify the database name (default is lma).
+   d. Specify the InfluxDB username and password.
+   e. Specify the Grafana username and password.

-   a. Specify the number of days retention period for data.
+#. With the introduction of Grafana 2.6.0, the plugin now uses a MySQL database
+   to store its configuration such as the dashboard templates.

-   #. Specify the InfluxDB admin password (called root password in the InfluxDB documentation).
+   a. Select **Local MySQL** if you want to create the Grafana database using the MySQL server
+      of the OpenStack control-plane. Otherwise, select **Remote server** and specify
+      the fully qualified name or IP address of the MySQL server you want to use. 
+   b. Then, specify the MySQL database name, username and password that will be used
+      to access that database.

-   #. Specify the database name (default is lma).
+#. Scroll down to the bottom of the page and click the **Save Settings** button when
+   you are done with the settings. 

-   #. Specify the InfluxDB user name and password.
-
-   #. Specify the Grafana user name and password.
-
-#. Assign the *InfluxDB Grafana* role to the node where you would like to install
-   the InfluxDB and Grafana servers as shown below.
-
-|
+#. Assign the *InfluxDB_Grafana* role to either one node (no HA) or three nodes if
+   you want to run the InfluxDB and Grafana servers in an HA cluster.
+   Note that installing the InfluxDB and Grafana servers on more than three nodes is currently
+   not possible. Similarly, installing the InfluxDB and Grafana servers on two nodes
+   is not recommended to avoid split-brain situations in the Raft consensus of 
+   the InfluxDB cluster as well as the *Pacemaker* cluster which is responsible of
+   the VIP address failover.
+   To be also noted, that it is possible to add or remove a node
+   with the *InfluxDB_Grafana* role in the cluster after deployment.

   .. image:: ../images/influx_grafana_role.png
      :width: 800
      :align: center

-|
+   .. note:: You can see in the example above that the *InfluxDB_Grafana* role is assigned to
+      three different nodes along with the *Infrastructure_Alerting* role and the *Elasticsearch_Kibana*
+      role. This means that the three plugins of the LMA toolchain can be installed on the same nodes. 

-   .. note:: Because of a bug with Fuel 7.0 (see bug `#1496328
-      <https://bugs.launchpad.net/fuel-plugins/+bug/1496328>`_), the UI won't let
-      you assign the *InfluxDB Grafana* role if at least one node is already
-      assigned with one of the built-in roles.
+#. Clik on **Apply Changes**

-      To workaround this problem, you should either remove the already assigned built-in roles or use the Fuel CLI::
-
-      $ fuel --env <environment id> node set --node-id <node_id> --role=influxdb_grafana
-
-#. Adjust the disk configuration if necessary (see the `Fuel User Guide
-   <http://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#disk-partitioning>`_
+#. Adjust the disk configuration for your plugin if necessary (see the `Fuel User Guide
+   <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#disk-partitioning>`_
   for details). By default, the InfluxDB-Grafana Plugin allocates:

-   * 20% of the first available disk for the operating system by honoring a range of 15GB minimum to 50GB maximum.
-   * 10GB for */var/log*.
-   * At least 30 GB for the InfluxDB database in */opt/influxdb*.
+   - 20% of the first available disk for the operating system by honoring a range of 15GB minimum to 50GB maximum.
+   - 10GB for */var/log*.
+   - At least 30 GB for the InfluxDB database in */var/lib/influxdb*.

-#. `Configure your environment <http://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#configure-your-environment>`_
+#. `Configure your environment <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#configure-your-environment>`_
   as needed.

-#. `Verify the networks <http://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#verify-networks>`_ on the Networks tab of the Fuel web UI.
+#. `Verify the networks <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#verify-networks>`_.

-#. `Deploy <http://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#deploy-changes>`_ your changes.
+#. And finaly, `deploy <http://docs.mirantis.com/openstack/fuel/fuel-8.0/user-guide.html#deploy-changes>`_ your changes.

 .. _plugin_install_verification:

 Plugin verification
 -------------------

-Be aware, that depending on the number of nodes and deployment setup,
+Be aware that depending on the number of nodes and deployment setup,
 deploying a Mirantis OpenStack environment can typically take anything
 from 30 minutes to several hours. But once your deployment is complete,
-you should see a notification that looks like the following:
+you should see a notification message indicating that you deployment is complete
+like in the figure below.

-|
-
-   .. image:: ../images/deployment_notification.png
-      :width: 800
-      :align: center
-
-|
+.. image:: ../images/deployment_notification.png
+   :width: 800
+   :align: center

 Verifying InfluxDB
 ~~~~~~~~~~~~~~~~~~
-Once your deployment has completed, you should verify that InfluxDB is
-running properly. On the Fuel Master node, you can retrieve the IP
-address of the node where InfluxDB is installed via the `fuel` command line::
+
+You should verify that the InfluxDB cluster is running properly.
+To do that, you need first to retreive the InfluxDB cluster VIP address.
+Here is how to proceed.
+
+#. On the Fuel Master node, find the IP address of a node where the InfluxDB
+   server is installed using the following command::

    [root@fuel ~]# fuel nodes
-    id | status   | name | cluster | ip        | ... | roles             | ...
-    ---|----------|------|---------|-----------|-----|-------------------|----
-    37 | ready    | lma  | 38      | 10.20.0.4 | ... | influxdb_grafana  | ...
-
-    [Skip ...]
-
-On that node (node-37 in this example), the *influx* command should be
-available via the CLI. Executing *influx* will start an interactive CLI
-and automatically connect to the local InfluxDB server::
-
-    [root@node-37 ~]# /opt/influxdb/influx -database lma -password lmapass --username lma
-    Connected to http://localhost:8086 version 0.9.4.2
-    InfluxDB shell 0.9.4.2
-    >
-
-Then if you type::
-
-    > show series
-
-You should see a dump of all the time-series collected so far::
-
-    [ Skip...]
-
-    name: swap_used
-    ---------------
-    _key                                                deployment_id   hostname
-    swap_used,deployment_id=38,hostname=node-40 38              node-40
-    swap_used,deployment_id=38,hostname=node-42 38              node-42
-    swap_used,deployment_id=38,hostname=node-41 38              node-41
-    swap_used,deployment_id=38,hostname=node-43 38              node-43
-    swap_used,deployment_id=38,hostname=node-38 38              node-38
-    swap_used,deployment_id=38,hostname=node-37 38              node-37
-    swap_used,deployment_id=38,hostname=node-36 38              node-36
+    id | status   | name             | cluster | ip         | mac               | roles                 |
+    ---|----------|------------------|---------|------------|-------------------|-----------------------|
+    1  | ready    | Untitled (fa:87) | 1       | 10.109.0.8 | 64:18:ef:86:fa:87 | influxdb_grafana, ... |
+    2  | ready    | Untitled (12:aa) | 1       | 10.109.0.3 | 64:5f:c6:88:12:aa | influxdb_grafana, ... |
+    3  | ready    | Untitled (4e:6e) | 1       | 10.109.0.7 | 64:ca:bf:a4:4e:6e | influxdb_grafana, ... |


-    name: total_threads_created
-    ---------------------------
-    _key                                                        deployment_id   hostname
-    total_threads_created,deployment_id=38,hostname=node-38     38              node-38
-    total_threads_created,deployment_id=38,hostname=node-37     38              node-37
-    total_threads_created,deployment_id=38,hostname=node-36     38              node-36
+#. Then `ssh` to anyone of these nodes (ex. *node-1*) and type the command::
+
+    root@node-1:~# hiera lma::influxdb::vip
+    10.109.1.4
+
+   This tells you that the VIP address of your InfluxDB cluster is *10.109.1.4*.
+
+#. With that VIP address type the command::
+   
+     root@node-1:~# /usr/bin/influx -database lma -password lmapass \
+     --username root -host 10.109.1.4 -port 8086
+     Visit https://enterprise.influxdata.com to register for updates,
+     InfluxDB server management, and monitoring.
+     Connected to http://10.109.1.4:8086 version 0.10.0
+     InfluxDB shell 0.10.0
+     >
+
+   As you can see, executing */usr/bin/influx* will start an interactive CLI and automatically connect to
+   the InfluxDB server. Then if you type::
+
+     > show series
+
+   You should see a dump of all the time-series collected so far.
+   Then, if you type::
+
+     > show servers
+     name: data_nodes
+     ----------------
+     id      http_addr       tcp_addr
+     1       node-1:8086     node-1:8088
+     3       node-2:8086     node-2:8088
+     5       node-3:8086     node-3:8088
+     
+     name: meta_nodes
+     ----------------
+     id      http_addr       tcp_addr
+     1       node-1:8091     node-1:8088
+     2       node-2:8091     node-2:8088
+     4       node-3:8091     node-3:8088
+
+   You should see a list of the nodes participating in the `InfluxDB cluster
+   <https://docs.influxdata.com/influxdb/v0.10/guides/clustering/>`_ with their roles (data or meta).
+

 Verifying Grafana
 ~~~~~~~~~~~~~~~~~

-The Grafana user interface runs on port 8000.
-Pointing your browser to the URL http://<HOST>:8000/ you should see the
-Grafana login page:
+From the Fuel web UI **Dashboard** view, click on the **Grafana** link as shown in the figure below.

-|
+.. image:: ../images/grafana_link.png
+   :width: 800
+   :align: center
+
+The first time you access Grafana, you are requested to
+authenticate using the credentials you defined in the plugin's settings.

 .. image:: ../images/grafana_login.png
-   :align: center
   :width: 800
+   :align: center

-|
-
-You should be redirected to the Grafana *Home Page*.
-The first time you access Grafana, you are requested to
-authenticate using the credentials you have defined in the settings.
-Once you have authenticated successfully, you should be automatically
-redirected to the *Home Page* from where you can select a dashboard as
+Once you have authenticated, you should be automatically
+redirected to the **Home Page** from where you can select a dashboard as
 shown below.

-|
-
 .. image:: ../images/grafana_home.png
   :align: center
   :width: 800

-|
-
 Exploring your time-series with Grafana
 ---------------------------------------

 The InfluxDB-Grafana Plugin comes with a collection of predefined
-dashboards you can use to visualize the time-series that are
-stored in InfluxDB. There is one primary dashboard, called the
-*Main Dashboard*, and several other dashboards that are organized
-per service name.
+dashboards you can use to visualize the time-series  stored in InfluxDB.
+
+Please check the LMA Collector documentation for a complete list of all the 
+`metrics time-series <http://fuel-plugin-lma-collector.readthedocs.org/en/latest/dev/metrics.html#list-of-metrics>`_
+that are collected and stored in InfluxDB.

 The Main Dashboard
 ~~~~~~~~~~~~~~~~~~

-We suggest you start with the *Main Dashboard*, as shown
-below. The *Main Dashboard* provides a
-single pane of glass to visualize the health
-status of all the OpenStack services being monitored
-such as Nova or Cinder but also HAProxy, MySQL and RabbitMQ to
-name a few..
-
-|
+We suggest you start with the **Main Dashboard**, as shown
+below, as an entry to the other dashboards.
+The **Main Dashboard** provides a single pane of glass from where you can visualize the
+overall health state of your OpenStack services such as Nova and Cinder
+but also HAProxy, MySQL and RabbitMQ to name a few..

 .. image:: ../images/grafana_main.png
   :align: center
   :width: 800

-|
-
-As you can see, the *Main Dashboard* (as most dashboards) provides
+As you can see, the **Main Dashboard** (as most dashboards) provides
 a drop down menu list in the upper left corner of the window
-from where you can select a metric tag (a.k.a dimension) such as
-a controller name or device name you want to visualize.
-In the example above, we say we want to visualize the
-system time-series for *node-48*.
+from where you can pick a particular metric dimension such as
+the *controller name* or the *device name* you want to select. 

-Within the *OpenStack Services* row, each of the services
+In the example above, the system metrics of *node-48* are
+being displayed in the dashbaord.
+
+Within the **OpenStack Services** row, each of the services
 represented can be assigned five different states.

-.. note:: The precise determination of a service state depends
-   on the Global Status Evaluation (GSE) policies defined
-   for the *GSE Plugins*.
+.. note:: The precise determination of a service health state depends
+   on the correlation policies implemented for that service by a `Global Status Evaluation (GSE)
+   plugin <http://fuel-plugin-lma-collector.readthedocs.org/en/latest/user/alarms.html#cluster-policies>`_.

 The meaning associated with a service health state is the following:

-* **Down**: One or several primary functions of a service
+- **Down**: One or several primary functions of a service
  cluster are failed. For example,
  all API endpoints of a service cluster like Nova
  or Cinder are failed.
-* **Critical**: One or several primary functions of a
+- **Critical**: One or several primary functions of a
  service cluster are severely degraded. The quality
  of service delivered to the end-user should be severely
  impacted.
-* **Warning**: One or several primary functions of a
+- **Warning**: One or several primary functions of a
  service cluster are slightly degraded. The quality
  of service delivered to the end-user should be slightly
  impacted.
-* **Unknown**: There is not enough data to infer the actual
+- **Unknown**: There is not enough data to infer the actual
  health state of a service cluster.
-* **Okay**: None of the above was found to be true.
+- **Okay**: None of the above was found to be true.

-The *Virtual Compute Resources* row provides an overview of
+The **Virtual Compute Resources** row provides an overview of
 the amount of virtual resources being used by the compute nodes
 including the number of virtual CPUs, the amount of memory
 and disk space being used as well as the amount of virtual
@@ -244,78 +249,94 @@ The "Ceph" row provides an overview of the resources usage
 and current health state of the Ceph cluster when it is deployed
 in the OpenStack environment.

-The *Main Dashboard* is also an entry point to access detailed
-dashboards for each of the OpenStack services being monitored.
-For example, if you click through the Nova box, you should see
-a screen like this:
+The **Main Dashboard** is also an entry point to access more detailed
+dashboards for each of the OpenStack services that are monitored.
+For example, if you click through the *Nova box*, the **Nova
+Dashboard** should be displayed.

-|
-
-   .. image:: ../images/grafana_nova.png
-      :align: center
-      :width: 800
-
-|
+.. image:: ../images/grafana_nova.png
+   :align: center
+   :width: 800

 The Nova Dashboard
 ~~~~~~~~~~~~~~~~~~

-The *Nova Dashboard* provides a detailed view of the
+The **Nova Dashboard** provides a detailed view of the
 Nova service's related metrics.

-The *Service Status* row provides information about the Nova service
+The **Service Status** row provides information about the Nova service
 cluster health state as a whole including the state of the API frontend
 (the HAProxy plubic VIP), a counter of HTTP 5xx errors,
 the HTTP requests response time and status code.

-The *Nova API* row provides information about the health state of
-the API backends (nova-api, ec2-api, ...), the state of the workers
-and compute nodes.
+The **Nova API** row provides information about the current health state of
+the API backends (nova-api, ec2-api, ...).

-The *Instance* row provides information about the number of
-active instances, instances in error and instances creation time
-statistics.
+The **Nova Services** row provides information about the current and
+historical state of the Nova *workers*.

-The "Resources" row provides various virtual resources usage indicators.
+The **Instances** row provides information about the number of active
+instances in error and instances creation time statistics.

-The LMA Self-Monitoring Dashboard
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The **Resources** row provides various virtual resources usage indicators.

-The *LMA Self-Monitoring Dashboard* is a new dashboard in LMA 0.8.
-This dashboard provides an overview of how the LMA Toolchain
-performs overall.
+Self-Monitoring Dashboards
+~~~~~~~~~~~~~~~~~~~~~~~~~~

-The *LMA Collector* row provides information about the Heka process.
-In particular, it is possible to visualize the
-processing time allocated to the Lua plugins and the amount of messages
-that have been processed as well as the amount of system resources
-consumed by the Heka process.
+The first **Self-Monitoring Dashboard** was introduced in LMA 0.8.
+The intent of the self-monitoring dashboards is to bring operational
+insights about how the monitoring system itself (the toolchain) performs overall.

-Again, it is possible to select a particular node using the dropdown
+The **Self-Monitoring Dashboard**, provides information about the *hekad*
+and *collectd* processes.
+In particular, it gives information about the amount of system resources
+consumed by these processes, the time allocated to the Lua plugins
+running within *hekad*, the amount of messages being processed and
+the time it takes to process those messages.
+
+Again, it is possible to select a particular node view using the drop down
 menu list.

-The *Collectd* row provides system resource usage information allocated
-to the *collectd* process.
+With LMA 0.9, we have introduced two new dashboards.

-The *InfluxDB* row provides system resource usage information allocated
-to the *InfluxDB* application.
+#. The **Elasticsearch Cluster Dasboard** provides information about
+   the overall health state of the Elasticsearch cluster including
+   the state of the shards, the number of pending tasks and various resources
+   usage metrics.

-The *Grafana* row provides system resource usage information allocated
-to the *Grafana* application.
+#. The **InfluxDB Cluster Dashboard** provides statistics about the InfluxDB
+   processes running in the InfluxDB cluster including various resources usage metrics.

-The *Elasticsearch* row provides system resource usage information allocated
-to the JVM process running the Elasticsearch application.
+
+The Hypervisor Dashboard
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+LMA 0.9 introduces a new **Hypervisor Dashboard** which brings operational
+insights about the virtual instances managed through *libvirt*.
+As shown in the figure below, the **Hypervisor Dashboard** assembles a
+view of various *libvirt* metrics. A dropdown menu list allows to pick
+a particular instance UUID running on a particular node. In the
+example below, the metrics for the instance id *ba844a75-b9db-4c2f-9cb9-0b083fe03fb7*
+running on *node-4* are displayed.
+
+.. image:: ../images/grafana_hypervisor.png
+   :align: center
+   :width: 800
+
+Check the LMA Collector documentation for additional information about the 
+`*libvirt* metrics <http://fuel-plugin-lma-collector.readthedocs.org/en/latest/dev/metrics.html#libvirt>`_
+that are displayed in the **Hypervisor Dashboard**.

 Other Dashboards
 ~~~~~~~~~~~~~~~~

-In total there are 16 different dashboards you can use to
+In total there are 19 different dashboards you can use to
 explore different time-series facettes of your OpenStack environment.

 Viewing Faults and Anomalies
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The LMA-Toolchain is capable of detecting a number of service-affecting
+The LMA Toolchain is capable of detecting a number of service-affecting
 conditions such as the faults and anomalies that occured in your OpenStack
 environment.
 Those conditions are reported in annotations that are displayed in
@@ -348,13 +369,13 @@ Nova has changed a state to *warning* because the system has detected
 5xx errors and that it may be due to the fact that Neutron is *down*.
 An example of what an annotation looks like is shown below.

-|
-
 .. image:: ../images/grafana_nova_annot.png
   :align: center
   :width: 800

-|
+This annotation tells us that the health state of Nova is *down*
+because there is no *nova-api* service backend (viewed from HAProxy)
+that is *up*.

 Troubleshooting
 ---------------
@@ -365,28 +386,43 @@ If you get no data in Grafana, follow these troubleshooting tips.
   LMA Collector troubleshooting instructions in the
   `LMA Collector Fuel Plugin User Guide <http://fuel-plugin-lma-collector.readthedocs.org/>`_.

-#. Check that the nodes are able to connect to the InfluxDB server on port *8086*.
+#. Check that the nodes are able to connect to the InfluxDB cluster via the VIP address
+   (see above for how to get the InfluxDB cluster VIP address) on port *8086*::

-#. Check that InfluxDB is running::
+     root@node-2:~# curl -I http://<VIP>:8086/ping

-    [root@node-37 ~]# /etc/init.d/influxdb status
-    influxdb Process is running [ OK ]
+   The server should return a 204 HTTP status::

-#. If InfluxDB is down, restart it::
+     HTTP/1.1 204 No Content
+     Request-Id: cdc3c545-d19d-11e5-b457-000000000000
+     X-Influxdb-Version: 0.10.0
+     Date: Fri, 12 Feb 2016 15:32:19 GMT

-    [root@node-37 ~]# /etc/init.d/influxdb start
-    Starting the process influxdb [ OK ]
-    influxdb process was started [ OK ]
+#. Check that InfluxDB cluster VIP address is up and running::

-#. Check that Grafana is running::
+     root@node-1:~# crm resource status vip__influxdb
+     resource vip__influxdb is running on: node-1.test.domain.local

-    [root@node-37 ~]# /etc/init.d/grafana-server status
-    * grafana is running
+#. Check that the InfluxDB service is started on all nodes of the cluster::

-#. If Grafana is down, restart it::
+     root@node-1:~# service influxdb status 
+     influxdb Process is running [ OK ]

-    [root@node-37 ~]# /etc/init.d/grafana-server start
-    * Starting Grafana Server
+#. If not, (re)start it::
+
+     root@node-1:~# service influxdb start
+     Starting the process influxdb [ OK ]
+     influxdb process was started [ OK ]
+
+#. Check that Grafana server is running::
+
+     root@node-1:~# service grafana-server status
+     * grafana is running
+
+#. If not, (re)start it::
+
+     root@node-1:~# service grafana-server start
+     * Starting Grafana Server

 #. If none of the above solves the problem, check the logs in ``/var/log/influxdb/influxdb.log``
   and ``/var/log/grafana/grafana.log`` to find out what might have gone wrong.