..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

=====================================================
Fuel plugin for the OpenStack Infrastructure Alarming
=====================================================


https://blueprints.launchpad.net/fuel/+spec/lma-infra-alerting-plugin

The `LMA Infrastructure Alerting` plugin is composed of several services
running on a node (base-os role). It provides alerting functionality for the
OpenStack Infrastructure inside the `LMA toolchain` [1]_ plugins suite.


Problem description
===================

The current implementation of the `LMA toolchain` [1]_ doesn't provide the
alerting functionality.

This specification aims to address the following use cases:

* OpenStack operator(s) want to be notified when the status of a component
  within the infrastructure changes:

  * OpenStack service status has changed (for example OKAY -> FAIL)
  * Cluster (RabbitMQ, MySQL, ..)  status has changed (for example OKAY -> WARN)
  * ...

* OpenStack operators want to configure thresholds on the metrics collected by
  the LMA collector and be notified when a metric crosses its threshold.
  Operators should be able to configure alarms with their own threshold against
  any of the available metrics collected by `LMA collector`:

  * Load average is too high on a controller node.
  * File system is nearly full on a node.
  * CPU usage is too high on a controller node.
  * ...

Proposed changes
================

Implement a Fuel plugin that will install and configure the LMA infrastructure
alerting system for an OpenStack environment.

The initial implementation of this plugin plans to install and configure
Nagios [2]_ to manage alerts and send notifications to operators by email.

There are two types of alerts which are initially supported:

   * Leverage the service status determinations computed by the `LMA collector`
     plugins (OKAY, WARN, FAIL, UNKNOWN).
   * Provide the ability to configure alarms over metrics by querying the
     time series database provided by the `Influxdb-Grafana` plugin [8]_

In order to implement these features into the `LMA toolchain` it's necessary
to:

0. Configure Nagios server.

1. Plug the `LMA collector` [3]_ to this new alerting system with the native
   Hekad [4]_ NagiosOutputPlugin [5]_ with HTTP method.
   Following example shows the configuration of Heka and Nagios for the
   Nova status:

.. code::

  # Heka configuation example
  [NagiosOutput]
  url = "http://<node-nagios>/nagios3/cgi-bin/cmd.cgi"
  username = "nagiosadmin"
  password = "supersecret"
  nagios_host = openstack-services"
  nagios_service_description = "openstack.nova.status"

  # Nagios configuration
  define service {
    check_command                  return-unknown-openstack.nova.status
    check_freshness                1
    check_interval                 30
    contact_groups                 openstack-admin
    display_name                   openstack.nova.status
    host_name                      openstack-services-env9
    freshness_threshold            45
    max_check_attempts             1
    retry_interval                 30
    passive_checks_enabled         1
    active_checks_enabled          0
    process_perf_data              0
    service_description            openstack.nova.status
    use                            generic-service
  }


2. Integrate [7]_ or develop a Nagios plugin that will query metrics from the
   InfluxDB database and trigger alerts when certain thresholds are met.
   Note that this implies to declare all the nodes as hosts in the Nagios
   configuration.

   Following example is the configuration of an alert on CPU usage for
   primary controller:

.. code::

  # Nagios configuration to check CPU usage of nodes
  define command {
    command_name = check_cpu_for_host
    command_line = check_influx_for_host -H $HOSTNAME$ -m cpu -w $ARG1$ -c $ARG2$
  }

  define host {
    host_name = node-2
    display_name = primary-controller
    address = 10.109.0.4
    contact_groups = openstack-admin
    ..
  }

  # Check CPU usage with threshold set to 75% for WARNING and 95% for critical
  define service {
    service_description = CPU usage
    host_name = node-2
    contact_groups = openstack-admin
    check_command = check_cpu_for_host!75!95
    ...
  }

The resulting InfluxDB 0.8 query would be :

.. code::

  select mean(value) from merge(/node-2.cpu.\d+.user/) where time > now() - 1m group by time(1m)

With InfluxDB 0.9 the corresponding tag is used to filter per node:

.. code::

  select mean(value) from merge(/cpu.\d+.user/) where node='node-2' and time > now() - 1m group by time(1m)


Alternatives
------------

There are plenty of alerting solutions but Nagios is the dominant open
source monitoring solution. Hence Nagios brings a robust and proven solution
which matches perfectly both to our alerting use case and the integration within
a legacy infrastructure monitoring.

It may be possible to leverage other open source solutions to complete and/or
replace Nagios in future.

Writing a new alerting system would be also possible either by polling
the time serie database or by performing realtime computation of metrics.
But this would require to be scalable and would need to reinvent lots of things
that already exist.

Alert severities
----------------

The service statutes computed by the `LMA collector` are mapped with the states
defined by Nagios by this way:

+---------------+----------+
| LMA collector | Nagios   |
+===============+==========+
| OKAY          | OK       |
+---------------+----------+
| WARN          | WARNING  |
+---------------+----------+
| FAIL          | CRITICAL |
+---------------+----------+
| UNKNOWN       | UNKNOWN  |
+---------------+----------+

Contacts, Alerting and Escalation
---------------------------------

The plugin allows to configure one email address to receive notifications,
it's up to the user to select which kind of event he/she will receive:

* critical
* warning
* unknown
* recovery

There is no escalation configuration enabled by the plugin. The user still have
the possiblity to configure it manually after the deployment of the plugin.

Limitations
-----------

Adding and removing node(s) to/from the OpenStack cluster won't re-configure
the Nagios server.

This is a limitation of the Fuel Plugin Framework which doesn't trigger `task`
when those actions are performed. This limitation should be addressed by a
Fuel blueprint [9]_ in the future but might be not ready for MOS 7.0.

This limitation is leading the user to adjust manually the Nagios
configuration:

 * to not receive alert notifications about a deleted node,
 * to add the new node(s) to Nagios configuration.

A possible workaround for the 'adding case' would be to use a SSH command from
the new node(s) deployed to run the appropriate Puppet manifest on the Nagios
node. This workaround may be investigated eventually but not in the first place.

Data model impact
-----------------

None

REST API impact
---------------
None

Upgrade impact
--------------

If you want to use the LMA alerting plugin, you will have to upgrade your
LMA collector plugin too.

Security impact
---------------

None

Notifications impact
--------------------

None

Other end user impact
---------------------

None

Performance Impact
------------------

The Nagios server can have several ``active checks`` which poll servers/services
and can lead to add extra workload on these targets.

This impact is minimized here by both:
 * the usage of ``passive checks`` (ie. Nagios receives status but doesn't poll)
 * Nagios doesn't poll servers to retrieve metrics but queries the time series
   database.


Other deployer impact
---------------------

New configuration options:

* email address of the operator
* SMTP gateway (optional)

Developer impact
----------------

None

Infrastructure impact
---------------------

None

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  Swann Croiset <scroiset@mirantis.com> (developer)

Other contributors:
  Guillaume Thouvenin <gthouvenin@mirantis.com> (developer)
  Simon Pasquier <spasquier@mirantis.com> (feature lead, developer)

Work Items
----------

* Implement the Puppet manifests for both Ubuntu and CentOS to configure Nagios

  * Nagios server: main configuration.
  * Nagios CGI (Web interface) served by Apache [10]_ and PhP [11]_.
  * Nagios Objects configuration: Commands, Services, Hosts and Contacts.

* Add support for Nagios output plugin of the LMA collector.

* Implement or integrate [7]_ the Nagios plugin to query InfluxDB for alarm
  evaluation over metrics.

* Testing.

* Write the documentation.

Dependencies
============

* Fuel 6.1 and higher.

* LMA Collector Fuel plugin.

Testing
=======

* Prepare a test plan.

* Test the plugin by deploying environments with all Fuel deployment modes and
  the LMA toolchain configured.

* Create integration tests with the LMA toolchain

Acceptance criteria
-------------------

* The operator can login to the Nagios web interface.
* The operator must be notified by email when the state of an
  OpenStack service change (OK -> DOWN, OK -> WARN, DOWN -> OK).
* The operator can define own alerts based on InfluxDB metrics and receive
  notifications when the thresholds are reached.

Documentation Impact
====================


* Write the User Guide for this plugin: deploy and configure the solution.

* Test Plan.

* Test Report.

References
==========

.. [1] The LMA toolchain is currently composed of several Fuel plugins:

        * LMA collector plugin
        * InfluxDB-Grafana plugin
        * Elasticsearch-Kibana plugin

.. [2] http://nagios.org

.. [3] https://github.com/stackforge/fuel-plugin-lma-collector

.. [4] http://hekad.readthedocs.org/

.. [5] http://hekad.readthedocs.org/en/v0.9.2/config/outputs/nagios.html

.. [6] http://www.influxdb.com/

.. [7] https://github.com/shaharke/influx-nagios-plugin

.. [8] https://github.com/stackforge/fuel-plugin-influxdb-grafana

.. [9] https://blueprints.launchpad.net/fuel/+spec/fuel-task-notify-other-nodes

.. [10] http://httpd.apache.org

.. [11] http://php.net