.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ===================================================== Fuel plugin for the OpenStack Infrastructure Alarming ===================================================== https://blueprints.launchpad.net/fuel/+spec/lma-infra-alerting-plugin The `LMA Infrastructure Alerting` plugin is composed of several services running on a node (base-os role). It provides alerting functionality for the OpenStack Infrastructure inside the `LMA toolchain` [1]_ plugins suite. Problem description =================== The current implementation of the `LMA toolchain` [1]_ doesn't provide the alerting functionality. This specification aims to address the following use cases: * OpenStack operator(s) want to be notified when the status of a component within the infrastructure changes: * OpenStack service status has changed (for example OKAY -> FAIL) * Cluster (RabbitMQ, MySQL, ..) status has changed (for example OKAY -> WARN) * ... * OpenStack operators want to configure thresholds on the metrics collected by the LMA collector and be notified when a metric crosses its threshold. Operators should be able to configure alarms with their own threshold against any of the available metrics collected by `LMA collector`: * Load average is too high on a controller node. * File system is nearly full on a node. * CPU usage is too high on a controller node. * ... Proposed changes ================ Implement a Fuel plugin that will install and configure the LMA infrastructure alerting system for an OpenStack environment. The initial implementation of this plugin plans to install and configure Nagios [2]_ to manage alerts and send notifications to operators by email. There are two types of alerts which are initially supported: * Leverage the service status determinations computed by the `LMA collector` plugins (OKAY, WARN, FAIL, UNKNOWN). * Provide the ability to configure alarms over metrics by querying the time series database provided by the `Influxdb-Grafana` plugin [8]_ In order to implement these features into the `LMA toolchain` it's necessary to: 0. Configure Nagios server. 1. Plug the `LMA collector` [3]_ to this new alerting system with the native Hekad [4]_ NagiosOutputPlugin [5]_ with HTTP method. Following example shows the configuration of Heka and Nagios for the Nova status: .. code:: # Heka configuation example [NagiosOutput] url = "http:///nagios3/cgi-bin/cmd.cgi" username = "nagiosadmin" password = "supersecret" nagios_host = openstack-services" nagios_service_description = "openstack.nova.status" # Nagios configuration define service { check_command return-unknown-openstack.nova.status check_freshness 1 check_interval 30 contact_groups openstack-admin display_name openstack.nova.status host_name openstack-services-env9 freshness_threshold 45 max_check_attempts 1 retry_interval 30 passive_checks_enabled 1 active_checks_enabled 0 process_perf_data 0 service_description openstack.nova.status use generic-service } 2. Integrate [7]_ or develop a Nagios plugin that will query metrics from the InfluxDB database and trigger alerts when certain thresholds are met. Note that this implies to declare all the nodes as hosts in the Nagios configuration. Following example is the configuration of an alert on CPU usage for primary controller: .. code:: # Nagios configuration to check CPU usage of nodes define command { command_name = check_cpu_for_host command_line = check_influx_for_host -H $HOSTNAME$ -m cpu -w $ARG1$ -c $ARG2$ } define host { host_name = node-2 display_name = primary-controller address = 10.109.0.4 contact_groups = openstack-admin .. } # Check CPU usage with threshold set to 75% for WARNING and 95% for critical define service { service_description = CPU usage host_name = node-2 contact_groups = openstack-admin check_command = check_cpu_for_host!75!95 ... } The resulting InfluxDB 0.8 query would be : .. code:: select mean(value) from merge(/node-2.cpu.\d+.user/) where time > now() - 1m group by time(1m) With InfluxDB 0.9 the corresponding tag is used to filter per node: .. code:: select mean(value) from merge(/cpu.\d+.user/) where node='node-2' and time > now() - 1m group by time(1m) Alternatives ------------ There are plenty of alerting solutions but Nagios is the dominant open source monitoring solution. Hence Nagios brings a robust and proven solution which matches perfectly both to our alerting use case and the integration within a legacy infrastructure monitoring. It may be possible to leverage other open source solutions to complete and/or replace Nagios in future. Writing a new alerting system would be also possible either by polling the time serie database or by performing realtime computation of metrics. But this would require to be scalable and would need to reinvent lots of things that already exist. Alert severities ---------------- The service statutes computed by the `LMA collector` are mapped with the states defined by Nagios by this way: +---------------+----------+ | LMA collector | Nagios | +===============+==========+ | OKAY | OK | +---------------+----------+ | WARN | WARNING | +---------------+----------+ | FAIL | CRITICAL | +---------------+----------+ | UNKNOWN | UNKNOWN | +---------------+----------+ Contacts, Alerting and Escalation --------------------------------- The plugin allows to configure one email address to receive notifications, it's up to the user to select which kind of event he/she will receive: * critical * warning * unknown * recovery There is no escalation configuration enabled by the plugin. The user still have the possiblity to configure it manually after the deployment of the plugin. Limitations ----------- Adding and removing node(s) to/from the OpenStack cluster won't re-configure the Nagios server. This is a limitation of the Fuel Plugin Framework which doesn't trigger `task` when those actions are performed. This limitation should be addressed by a Fuel blueprint [9]_ in the future but might be not ready for MOS 7.0. This limitation is leading the user to adjust manually the Nagios configuration: * to not receive alert notifications about a deleted node, * to add the new node(s) to Nagios configuration. A possible workaround for the 'adding case' would be to use a SSH command from the new node(s) deployed to run the appropriate Puppet manifest on the Nagios node. This workaround may be investigated eventually but not in the first place. Data model impact ----------------- None REST API impact --------------- None Upgrade impact -------------- If you want to use the LMA alerting plugin, you will have to upgrade your LMA collector plugin too. Security impact --------------- None Notifications impact -------------------- None Other end user impact --------------------- None Performance Impact ------------------ The Nagios server can have several ``active checks`` which poll servers/services and can lead to add extra workload on these targets. This impact is minimized here by both: * the usage of ``passive checks`` (ie. Nagios receives status but doesn't poll) * Nagios doesn't poll servers to retrieve metrics but queries the time series database. Other deployer impact --------------------- New configuration options: * email address of the operator * SMTP gateway (optional) Developer impact ---------------- None Infrastructure impact --------------------- None Implementation ============== Assignee(s) ----------- Primary assignee: Swann Croiset (developer) Other contributors: Guillaume Thouvenin (developer) Simon Pasquier (feature lead, developer) Work Items ---------- * Implement the Puppet manifests for both Ubuntu and CentOS to configure Nagios * Nagios server: main configuration. * Nagios CGI (Web interface) served by Apache [10]_ and PhP [11]_. * Nagios Objects configuration: Commands, Services, Hosts and Contacts. * Add support for Nagios output plugin of the LMA collector. * Implement or integrate [7]_ the Nagios plugin to query InfluxDB for alarm evaluation over metrics. * Testing. * Write the documentation. Dependencies ============ * Fuel 6.1 and higher. * LMA Collector Fuel plugin. Testing ======= * Prepare a test plan. * Test the plugin by deploying environments with all Fuel deployment modes and the LMA toolchain configured. * Create integration tests with the LMA toolchain Acceptance criteria ------------------- * The operator can login to the Nagios web interface. * The operator must be notified by email when the state of an OpenStack service change (OK -> DOWN, OK -> WARN, DOWN -> OK). * The operator can define own alerts based on InfluxDB metrics and receive notifications when the thresholds are reached. Documentation Impact ==================== * Write the User Guide for this plugin: deploy and configure the solution. * Test Plan. * Test Report. References ========== .. [1] The LMA toolchain is currently composed of several Fuel plugins: * LMA collector plugin * InfluxDB-Grafana plugin * Elasticsearch-Kibana plugin .. [2] http://nagios.org .. [3] https://github.com/stackforge/fuel-plugin-lma-collector .. [4] http://hekad.readthedocs.org/ .. [5] http://hekad.readthedocs.org/en/v0.9.2/config/outputs/nagios.html .. [6] http://www.influxdb.com/ .. [7] https://github.com/shaharke/influx-nagios-plugin .. [8] https://github.com/stackforge/fuel-plugin-influxdb-grafana .. [9] https://blueprints.launchpad.net/fuel/+spec/fuel-task-notify-other-nodes .. [10] http://httpd.apache.org .. [11] http://php.net