implements blueprint lma-infra-alerting-plugin Change-Id: I324973840ec2de04ae1514d2eb2c71523d2895dc
10 KiB
Fuel plugin for the OpenStack Infrastructure Alarming
https://blueprints.launchpad.net/fuel/+spec/lma-infra-alerting-plugin
The LMA Infrastructure Alerting plugin is composed of several services running on a node (base-os role). It provides alerting functionality for the OpenStack Infrastructure inside the LMA toolchain1 plugins suite.
Problem description
The current implementation of the LMA toolchain2 doesn't provide the alerting functionality.
This specification aims to address the following use cases:
- OpenStack operator(s) want to be notified when the status of a
component within the infrastructure changes:
- OpenStack service status has changed (for example OKAY -> FAIL)
- Cluster (RabbitMQ, MySQL, ..) status has changed (for example OKAY -> WARN)
- ...
- OpenStack operators want to configure thresholds on the metrics
collected by the LMA collector and be notified when a metric crosses its
threshold. Operators should be able to configure alarms with their own
threshold against any of the available metrics collected by `LMA
collector`:
- Load average is too high on a controller node.
- File system is nearly full on a node.
- CPU usage is too high on a controller node.
- ...
Proposed changes
Implement a Fuel plugin that will install and configure the LMA infrastructure alerting system for an OpenStack environment.
The initial implementation of this plugin plans to install and configure Nagios3 to manage alerts and send notifications to operators by email.
There are two types of alerts which are initially supported:
- Leverage the service status determinations computed by the LMA collector plugins (OKAY, WARN, FAIL, UNKNOWN).
- Provide the ability to configure alarms over metrics by querying the time series database provided by the Influxdb-Grafana plugin4
In order to implement these features into the LMA toolchain it's necessary to:
- Configure Nagios server.
- Plug the LMA collector5 to this new alerting system with the native Hekad6 NagiosOutputPlugin7 with HTTP method. Following example shows the configuration of Heka and Nagios for the Nova status:
# Heka configuation example
[NagiosOutput]
url = "http://<node-nagios>/nagios3/cgi-bin/cmd.cgi"
username = "nagiosadmin"
password = "supersecret"
nagios_host = openstack-services"
nagios_service_description = "openstack.nova.status"
# Nagios configuration
define service {
check_command return-unknown-openstack.nova.status
check_freshness 1
check_interval 30
contact_groups openstack-admin
display_name openstack.nova.status
host_name openstack-services-env9
freshness_threshold 45
max_check_attempts 1
retry_interval 30
passive_checks_enabled 1
active_checks_enabled 0
process_perf_data 0
service_description openstack.nova.status
use generic-service
}
Integrate8 or develop a Nagios plugin that will query metrics from the InfluxDB database and trigger alerts when certain thresholds are met. Note that this implies to declare all the nodes as hosts in the Nagios configuration.
Following example is the configuration of an alert on CPU usage for primary controller:
# Nagios configuration to check CPU usage of nodes
define command {
command_name = check_cpu_for_host
command_line = check_influx_for_host -H $HOSTNAME$ -m cpu -w $ARG1$ -c $ARG2$
}
define host {
host_name = node-2
display_name = primary-controller
address = 10.109.0.4
contact_groups = openstack-admin
..
}
# Check CPU usage with threshold set to 75% for WARNING and 95% for critical
define service {
service_description = CPU usage
host_name = node-2
contact_groups = openstack-admin
check_command = check_cpu_for_host!75!95
...
}
The resulting InfluxDB 0.8 query would be :
select mean(value) from merge(/node-2.cpu.\d+.user/) where time > now() - 1m group by time(1m)
With InfluxDB 0.9 the corresponding tag is used to filter per node:
select mean(value) from merge(/cpu.\d+.user/) where node='node-2' and time > now() - 1m group by time(1m)
Alternatives
There are plenty of alerting solutions but Nagios is the dominant open source monitoring solution. Hence Nagios brings a robust and proven solution which matches perfectly both to our alerting use case and the integration within a legacy infrastructure monitoring.
It may be possible to leverage other open source solutions to complete and/or replace Nagios in future.
Writing a new alerting system would be also possible either by polling the time serie database or by performing realtime computation of metrics. But this would require to be scalable and would need to reinvent lots of things that already exist.
Alert severities
The service statutes computed by the LMA collector are mapped with the states defined by Nagios by this way:
LMA collector | Nagios |
---|---|
OKAY | OK |
WARN | WARNING |
FAIL | CRITICAL |
UNKNOWN | UNKNOWN |
Contacts, Alerting and Escalation
The plugin allows to configure one email address to receive notifications, it's up to the user to select which kind of event he/she will receive:
- critical
- warning
- unknown
- recovery
There is no escalation configuration enabled by the plugin. The user still have the possiblity to configure it manually after the deployment of the plugin.
Limitations
Adding and removing node(s) to/from the OpenStack cluster won't re-configure the Nagios server.
This is a limitation of the Fuel Plugin Framework which doesn't trigger task when those actions are performed. This limitation should be addressed by a Fuel blueprint9 in the future but might be not ready for MOS 7.0.
This limitation is leading the user to adjust manually the Nagios configuration:
- to not receive alert notifications about a deleted node,
- to add the new node(s) to Nagios configuration.
A possible workaround for the 'adding case' would be to use a SSH command from the new node(s) deployed to run the appropriate Puppet manifest on the Nagios node. This workaround may be investigated eventually but not in the first place.
Data model impact
None
REST API impact
None
Upgrade impact
If you want to use the LMA alerting plugin, you will have to upgrade your LMA collector plugin too.
Security impact
None
Notifications impact
None
Other end user impact
None
Performance Impact
The Nagios server can have several active checks
which
poll servers/services and can lead to add extra workload on these
targets.
- This impact is minimized here by both:
-
- the usage of
passive checks
(ie. Nagios receives status but doesn't poll) - Nagios doesn't poll servers to retrieve metrics but queries the time series database.
- the usage of
Other deployer impact
New configuration options:
- email address of the operator
- SMTP gateway (optional)
Developer impact
None
Infrastructure impact
None
Implementation
Assignee(s)
- Primary assignee:
-
Swann Croiset <scroiset@mirantis.com> (developer)
- Other contributors:
-
Guillaume Thouvenin <gthouvenin@mirantis.com> (developer) Simon Pasquier <spasquier@mirantis.com> (feature lead, developer)
Work Items
- Implement the Puppet manifests for both Ubuntu and CentOS to configure Nagios
- Add support for Nagios output plugin of the LMA collector.
- Implement or integrate12 the Nagios plugin to query InfluxDB for alarm evaluation over metrics.
- Testing.
- Write the documentation.
Dependencies
- Fuel 6.1 and higher.
- LMA Collector Fuel plugin.
Testing
- Prepare a test plan.
- Test the plugin by deploying environments with all Fuel deployment modes and the LMA toolchain configured.
- Create integration tests with the LMA toolchain
Acceptance criteria
- The operator can login to the Nagios web interface.
- The operator must be notified by email when the state of an OpenStack service change (OK -> DOWN, OK -> WARN, DOWN -> OK).
- The operator can define own alerts based on InfluxDB metrics and receive notifications when the thresholds are reached.
Documentation Impact
- Write the User Guide for this plugin: deploy and configure the solution.
- Test Plan.
- Test Report.
References
The LMA toolchain is currently composed of several Fuel plugins:
- LMA collector plugin
- InfluxDB-Grafana plugin
- Elasticsearch-Kibana plugin
The LMA toolchain is currently composed of several Fuel plugins:
- LMA collector plugin
- InfluxDB-Grafana plugin
- Elasticsearch-Kibana plugin
https://github.com/stackforge/fuel-plugin-influxdb-grafana↩︎
http://hekad.readthedocs.org/en/v0.9.2/config/outputs/nagios.html↩︎
https://blueprints.launchpad.net/fuel/+spec/fuel-task-notify-other-nodes↩︎