.. _usage: Exploring your time-series with Grafana --------------------------------------- The InfluxDB-Grafana Plugin comes with a collection of predefined dashboards you can use to visualize the time-series stored in InfluxDB. For a complete list of all the metrics time-series that are collected and stored in InfluxDB, see the `List of metrics` section of the `StackLight Collector documentation `_. The Main dashboard ++++++++++++++++++ We recommend that you start with the **Main dashboard**, as shown below, as an entry to other dashboards. The **Main dashboard** provides a single pane of glass from where you can visualize the overall health status of your OpenStack services, such as Nova, Cinder, HAProxy, MySQL, RabbitMQ, and others. .. image:: ../images/grafana_main.png :width: 450pt The **Main dashboard**, like most dashboards, provides a drop-down menu in the upper left corner from where you can pick a particular metric dimension, such as the *controller name* or the *device name* you want to select. In the example above, the dashboard displays the system metrics of *node-48*. Within the **OpenStack Services** section, each of the services represented can be assigned five different statuses. .. note:: The precise determination of a service health status depends on the correlation policies implemented for that service by a Global Status Evaluation (GSE) plugin. See the `Configuring alarms` section in the `StackLight Collector documentation `_. The service health statuses can be as follows: * **Down**: One or several primary functions of a service cluster has failed. For example, all API endpoints of a service cluster like Nova or Cinder failed. * **Critical**: One or several primary functions of a service cluster are severely degraded. The quality of service delivered to the end user is severely impacted. * **Warning**: One or several primary functions of a service cluster are slightly degraded. The quality of service delivered to the end user is slightly impacted. * **Unknown**: There is not enough data to infer the actual health status of a service cluster. * **Okay**: None of the above was found to be true. The **Virtual compute resources** section provides an overview of the amount of virtual resources being used by the compute nodes including the number of virtual CPUs, the amount of memory and disk space being used, as well as the amount of virtual resources remaining available to create new instances. The **System** section provides an overview of the amount of physical resources being used on the control plane (the controller cluster). You can select a specific controller using the controller's drop-down list in the left corner of the toolbar. The **Ceph** section provides an overview of the resources usage and current health status of the Ceph cluster when it is deployed in the OpenStack environment. The **Main dashboard** is also an entry point to access more detailed dashboards for each of the OpenStack services that are monitored. For example, if you click the **Nova** box, the **Nova dashboard** is displayed. .. image:: ../images/grafana_nova.png :width: 450pt The Nova dashboard ++++++++++++++++++ The **Nova** dashboard provides a detailed view of the Nova service's related metrics and consists of the following sections: **Service status** -- information about the Nova service cluster overall health status, including the status of the API front end (the HAProxy public VIP), a counter of HTTP 5xx errors, the HTTP requests response time and status code. **Nova API** -- information about the current health status of the API back ends, for example, nova-api, ec2-api, and others. **Nova services** -- information about the current and historical status of the Nova *workers*. **Instances** -- information about the number of active instances in error and instances creation time statistics. **Resources** -- various virtual resources usage indicators. Self-monitoring dashboards ++++++++++++++++++++++++++ The **Self-Monitoring** dashboard brings operational insights about the overall monitoring system (the toolchain) performance. It provides information about the *hekad* and *collectd* processes. In particular, the **Self-Monitoring** dashboard provides information about the amount of system resources consumed by these processes, the time allocated to the Lua plugins running within *hekad*, the number of messages being processed, and the time it takes to process those messages. You can select a particular node view using the drop-down menu. Since StackLight 0.9, there are two new dashboards: * The **Elasticsearch Cluster** dashboard provides information about the overall health status of the Elasticsearch cluster including the state of the shards, the number of pending tasks, and various resources usage metrics. * The **InfluxDB Cluster** dashboard provides statistics about the InfluxDB processes running in the InfluxDB cluster including various resources usage metrics. The hypervisor dashboard ++++++++++++++++++++++++ The **Hypervisor** dashboard brings operational insights about the virtual instances managed through *libvirt*. As shown in the figure below, the **Hypervisor** dashboard assembles a view of various *libvirt* metrics. Use the drop-down menu to pick a particular instance UUID running on a particular node. The example below shows the metrics for the instance ID ``ba844a75-b9db-4c2f-9cb9-0b083fe03fb7`` running on *node-4*. .. image:: ../images/grafana_hypervisor.png :width: 450pt For additional information on the *libvirt* metrics that are displayed in the **Hypervisor** dashboard, see the `List of metrics` section of the `StackLight Collector documentation `_. Other dashboards ++++++++++++++++ There are 19 different dashboards in total that you can use to explore different time-series facets of your OpenStack environment. Viewing faults and anomalies ++++++++++++++++++++++++++++ The LMA Toolchain is capable of detecting a number of service-affecting conditions, such as the faults and anomalies that occurred in your OpenStack environment. These conditions are reported in annotations that are displayed in Grafana. The Grafana annotations contain a textual representation of the alarm (or set of alarms) that were triggered by the Collectors for a service. In other words, the annotations contain valuable insights that you can use to diagnose and troubleshoot issues. Furthermore, with the Grafana annotations, the system makes a distinction between what is estimated as a direct root cause versus what is estimated as an indirect root cause. This is internally represented in a dependency graph. There are first degree dependencies used to describe situations, whereby the health status of an entity strictly depends on the health status of another entity. For example, Nova as a service has first-degree dependencies with the nova-api endpoints and the nova-scheduler workers. But there are also second-degree dependencies, whereby the health status of an entity does not strictly depend on the health status of another entity, although it might, depending on other operations being performed. For example, by default, we declared that Nova has a second-degree dependency with Neutron. As a result, the health status of Nova will not be directly impacted by the health status of Neutron, but the annotation will provide a root cause analysis hint. Consider a situation where Nova has changed from *okay* to the *critical* status (because of 5xx HTTP errors) and that Neutron has been in the *down* status for a while. In this case, the Nova dashboard will display an annotation showing that Nova has changed to a *warning* status because the system has detected 5xx errors and that it may be due to the fact that Neutron is *down*. Below is an example of an annotation, which shows that the health status of Nova is *down* because there is no *nova-api* service back end (viewed from HAProxy) that is *up*. .. image:: ../images/grafana_nova_annot.png :width: 450pt Hiding nodes from dashboards ++++++++++++++++++++++++++++ When you remove a node from the environment, it is still displayed in the :guilabel:`server` and :guilabel:`controller` drop-down lists. To hide it from the list, edit the associated InfluxDB query in the *Templating* section. For example, if you want to remove *node-1*, add the following condition to the *where* clause:: and hostname != 'node-1' .. image:: ../images/remove_controllers_from_templating.png :width: 450pt To hide more than one node, add more conditions. For example:: and hostname != 'node-1' and hostname != 'node-2' Perform these actions for all dashboards that display the deleted node and save them afterward.