48ca39876e
Story: 2005551 Task: 30699 Change-Id: I111e9923ceb12f364febf33c0b59bdbdac53d5fe
260 lines
8.4 KiB
ReStructuredText
260 lines
8.4 KiB
ReStructuredText
Team and repository tags
|
|
========================
|
|
|
|
.. image:: https://governance.openstack.org/tc/badges/monasca-notification.svg
|
|
:target: https://governance.openstack.org/tc/reference/tags/index.html
|
|
|
|
.. Change things from this point on
|
|
|
|
Notification Engine
|
|
===================
|
|
|
|
This engine reads alarms from Kafka and then notifies the customer using
|
|
the configured notification method. Multiple notification and retry
|
|
engines can run in parallel, up to one per available Kafka partition.
|
|
Zookeeper is used to negotiate access to the Kafka partitions whenever a
|
|
new process joins or leaves the working set.
|
|
|
|
Architecture
|
|
============
|
|
|
|
The notification engine generates notifications using the following
|
|
steps:
|
|
|
|
1. Read Alarms from Kafka, with no auto commit. -
|
|
monasca\_common.kafka.KafkaConsumer class
|
|
2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
|
|
3. Send notification. - NotificationProcessor class
|
|
4. Add successful notifications to a sent notification topic. - NotificationEngine class
|
|
5. Add failed notifications to a retry topic. - NotificationEngine class
|
|
6. Commit offset to Kafka - KafkaConsumer class
|
|
|
|
The notification engine uses three Kafka topics:
|
|
|
|
1. alarm\_topic: Alarms inbound to the notification engine.
|
|
2. notification\_topic: Successfully sent notifications.
|
|
3. notification\_retry\_topic: Failed notifications.
|
|
|
|
A retry engine runs in parallel with the notification engine and gives
|
|
any failed notification a configurable number of extra chances at
|
|
success.
|
|
|
|
The retry engine generates notifications using the following steps:
|
|
|
|
1. Read notification json data from Kafka, with no auto commit. - KafkaConsumer class
|
|
2. Rebuild the notification that failed. - RetryEngine class
|
|
3. Send notification. - NotificationProcessor class
|
|
4. Add successful notifications to a sent notification topic. - RetryEngine class
|
|
5. Add failed notifications that have not hit the retry limit back to the retry topic. -
|
|
RetryEngine class
|
|
6. Discard failed notifications that have hit the retry limit. - RetryEngine class
|
|
7. Commit offset to Kafka. - KafkaConsumer class
|
|
|
|
The retry engine uses two Kafka topics:
|
|
|
|
1. notification\_retry\_topic: Notifications that need to be retried.
|
|
2. notification\_topic: Successfully sent notifications.
|
|
|
|
Fault Tolerance
|
|
---------------
|
|
|
|
When reading from the alarm topic, no committing is done. The committing
|
|
is done only after processing. This allows the processing to continue
|
|
even though some notifications can be slow. In the event of a
|
|
catastrophic failure some notifications could be sent but the alarms
|
|
have not yet been acknowledged. This is an acceptable failure mode,
|
|
better to send a notification twice than not at all.
|
|
|
|
The general process when a major error is encountered is to exit the
|
|
daemon which should allow the other processes to renegotiate access to
|
|
the Kafka partitions. It is also assumed that the notification engine
|
|
will be run by a process supervisor which will restart it in case of a
|
|
failure. In this way, any errors which are not easy to recover from are
|
|
automatically handled by the service restarting and the active daemon
|
|
switching to another instance.
|
|
|
|
Though this should cover all errors, there is the risk that an alarm or
|
|
a set of alarms can be processed and notifications are sent out multiple
|
|
times. To minimize this risk a number of techniques are used:
|
|
|
|
- Timeouts are implemented for all notification types.
|
|
- An alarm TTL is utilized. Any alarm older than the TTL is not
|
|
processed.
|
|
|
|
Operation
|
|
=========
|
|
|
|
``oslo.config`` is used for handling configuration options. A sample
|
|
configuration file ``etc/monasca/notification.conf.sample`` can be
|
|
generated by running:
|
|
|
|
::
|
|
|
|
tox -e genconfig
|
|
|
|
To run the service using the default config file location
|
|
of `/etc/monasca/notification.conf`:
|
|
|
|
::
|
|
|
|
monasca-notification
|
|
|
|
To run the service and explicitly specify the config file:
|
|
|
|
::
|
|
|
|
monasca-notification --config-file /etc/monasca/monasca-notification.conf
|
|
|
|
Monitoring
|
|
----------
|
|
|
|
StatsD is incorporated into the daemon and will send all stats to the
|
|
StatsD server launched by monasca-agent. Default host and port points to
|
|
**localhost:8125**.
|
|
|
|
- Counters
|
|
|
|
- ConsumedFromKafka
|
|
- AlarmsFailedParse
|
|
- AlarmsNoNotification
|
|
- NotificationsCreated
|
|
- NotificationsSentSMTP
|
|
- NotificationsSentWebhook
|
|
- NotificationsSentPagerduty
|
|
- NotificationsSentFailed
|
|
- NotificationsInvalidType
|
|
- AlarmsFinished
|
|
- PublishedToKafka
|
|
|
|
- Timers
|
|
|
|
- ConfigDBTime
|
|
- SendNotificationTime
|
|
|
|
Plugins
|
|
-------
|
|
|
|
The following notification plugins are available:
|
|
|
|
- Email
|
|
- HipChat
|
|
- Jira
|
|
- PagerDuty
|
|
- Slack
|
|
- Webhook
|
|
|
|
The plugins can be configured via the Monasca Notification config file. In
|
|
general you will need to follow these steps to enable a plugin:
|
|
|
|
- Make sure that the plugin is enabled in the config file
|
|
- Make sure that the plugin is configured in the config file
|
|
- Restart the Monasca Notification service
|
|
|
|
PagerDuty plugin
|
|
----------------
|
|
|
|
The PagerDuty plugin supports the PagerDuty v1 Events API. The first step
|
|
is to `configure`_ a service in PagerDuty which uses this API. Once
|
|
configured, the service will be assigned an integration key. This key should be
|
|
used as the `ADDRESS` field when creating the notification type, for example:
|
|
|
|
::
|
|
|
|
monasca notification-create pd_notification pagerduty a30d5560c5ce4239a6f52a01a15850ca
|
|
|
|
The default settings for the plugin, including the v1 Events API URL should
|
|
be sufficient to get started, but it is worth checking that the PagerDuty
|
|
Events v1 API URL matches that provided in the example Monasca Notification
|
|
config file.
|
|
|
|
Slack plugin
|
|
~~~~~~~~~~~~
|
|
|
|
To use the Slack plugin you must first configure an incoming `webhook`_
|
|
for the Slack channel you wish to post notifications to. The notification can
|
|
then be created as follows:
|
|
|
|
::
|
|
|
|
monasca notification-create slack_notification slack https://hooks.slack.com/services/MY/SECRET/WEBHOOK/URL
|
|
|
|
Note that whilst it is also possible to use a token instead of a webhook,
|
|
this approach is now `deprecated`_.
|
|
|
|
By default the Slack notification will dump all available information into
|
|
the alert. For example, a notification may be posted to Slack which looks
|
|
like this:
|
|
|
|
::
|
|
|
|
{
|
|
"metrics":[
|
|
{
|
|
"dimensions":{
|
|
"hostname":"operator"
|
|
},
|
|
"id":null,
|
|
"name":"cpu.user_perc"
|
|
}
|
|
],
|
|
"alarm_id":"20a54a65-44b8-4ac9-a398-1f2d888827d2",
|
|
"state":"ALARM",
|
|
"alarm_timestamp":1556703552,
|
|
"tenant_id":"62f7a7a314904aa3ab137d569d6b4fde",
|
|
"old_state":"OK",
|
|
"alarm_description":"Dummy alarm",
|
|
"message":"Thresholds were exceeded for the sub-alarms: count(cpu.user_perc, deterministic) >= 1.0 with the values: [1.0]",
|
|
"alarm_definition_id":"78ce7b53-f7e6-4b51-88d0-cb741e7dc906",
|
|
"alarm_name":"dummy_alarm"
|
|
}
|
|
|
|
The format of the above message can be customised with a Jinja template. All fields
|
|
from the raw Slack message are available in the template. For example, you may
|
|
configure the plugin as follows:
|
|
|
|
::
|
|
|
|
[notification_types]
|
|
enabled = slack
|
|
|
|
[slack_notifier]
|
|
message_template = /etc/monasca/slack_template.j2
|
|
timeout = 10
|
|
ca_certs = /etc/ssl/certs/ca-bundle.crt
|
|
insecure = False
|
|
|
|
With the following contents of `/etc/monasca/slack_template.j2`:
|
|
|
|
::
|
|
|
|
{{ alarm_name }} has triggered on {% for item in metrics %}host {{ item.dimensions.hostname }}{% if not loop.last %}, {% endif %}{% endfor %}.
|
|
|
|
With this configuration, the raw Slack message above would be transformed
|
|
into:
|
|
|
|
::
|
|
|
|
dummy_alarm has triggered on host(s): operator.
|
|
|
|
Future Considerations
|
|
=====================
|
|
|
|
- More extensive load testing is needed:
|
|
|
|
- How fast is the mysql db? How much load do we put on it. Initially I
|
|
think it makes most sense to read notification details for each alarm
|
|
but eventually I may want to cache that info.
|
|
- How expensive are commits to Kafka for every message we read? Should
|
|
we commit every N messages?
|
|
- How efficient is the default Kafka consumer batch size?
|
|
- Currently we can get ~200 notifications per second per
|
|
NotificationEngine instance using webhooks to a local http server. Is
|
|
that fast enough?
|
|
- Are we putting too much load on Kafka at ~200 commits per second?
|
|
|
|
.. _webhook: https://api.slack.com/incoming-webhooks
|
|
|
|
.. _deprecated: https://api.slack.com/custom-integrations/legacy-tokens
|
|
|
|
.. _configure: https://support.pagerduty.com/docs/services-and-integrations#section-events-api-v1
|