Notification Engine for Monasca

Go to file

Doug Szumski 39a906b8fb Templates for Slack notifications This change adds an optional, user configurable template which may be used to format the text contained in Slack notifications. Story: 2001308 Task: 5859 Change-Id: Id936c3dc8b4f3e2430de20c8b69d0e703b1cf9ef		2019-05-02 09:49:42 +01:00
config-generator	Integrate with oslo.conf and oslo.log	2017-10-20 09:32:11 +02:00
doc	Configure releasenotes job	2019-02-01 09:40:26 +01:00
docker	Merge "Fix docker mysql env variables"	2019-04-17 13:53:37 +00:00
etc/monasca	Update documentation	2019-01-07 09:47:23 +00:00
monasca_notification	Templates for Slack notifications	2019-05-02 09:49:42 +01:00
playbooks	Push Docker image to Docker Hub	2019-04-15 15:10:54 +02:00
releasenotes	Update master for stable/stein	2019-04-01 12:53:07 +00:00
tests	Templates for Slack notifications	2019-05-02 09:49:42 +01:00
.coveragerc	Migrate tests to ostestr	2017-01-19 06:15:40 +01:00
.gitignore	Integrate with oslo.conf and oslo.log	2017-10-20 09:32:11 +02:00
.gitreview	OpenDev Migration Patch	2019-04-19 19:29:46 +00:00
.stestr.conf	Add .stestr.conf .	2017-09-22 14:31:45 +02:00
.testr.conf	Migrate tests to ostestr	2017-01-19 06:15:40 +01:00
.zuul.yaml	Merge "Push Docker image to Docker Hub"	2019-04-17 13:21:25 +00:00
HACKING.rst	Rename to monasca, setup for tox, removed legacy bits	2014-07-16 15:59:00 -06:00
LICENSE	Added copyright header, LICENSE and HACKING.rst.	2014-05-01 12:27:06 -06:00
lower-constraints.txt	Bump psycopg2 lowest version	2019-04-08 09:12:48 +02:00
README.rst	Templates for Slack notifications	2019-05-02 09:49:42 +01:00
requirements.txt	Add Python Jira module to requirements	2019-04-12 11:33:13 +01:00
setup.cfg	Merge "Dropping the py35 testing"	2019-04-17 09:02:12 +00:00
setup.py	Updated from global requirements	2017-03-02 11:47:07 +00:00
test-requirements.txt	Bump psycopg2 lowest version	2019-04-08 09:12:48 +02:00
tox.ini	Dropping the py35 testing	2019-04-14 23:49:39 +00:00

README.rst

Team and repository tags

Notification Engine

This engine reads alarms from Kafka and then notifies the customer using the configured notification method. Multiple notification and retry engines can run in parallel, up to one per available Kafka partition. Zookeeper is used to negotiate access to the Kafka partitions whenever a new process joins or leaves the working set.

Architecture

The notification engine generates notifications using the following steps:

Read Alarms from Kafka, with no auto commit. -monasca_common.kafka.KafkaConsumer class
Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
Send notification. - NotificationProcessor class
Add successful notifications to a sent notification topic. - NotificationEngine class
Add failed notifications to a retry topic. - NotificationEngine class
Commit offset to Kafka - KafkaConsumer class

The notification engine uses three Kafka topics:

alarm_topic: Alarms inbound to the notification engine.
notification_topic: Successfully sent notifications.
notification_retry_topic: Failed notifications.

A retry engine runs in parallel with the notification engine and gives any failed notification a configurable number of extra chances at success.

The retry engine generates notifications using the following steps:

Read notification json data from Kafka, with no auto commit. - KafkaConsumer class
Rebuild the notification that failed. - RetryEngine class
Send notification. - NotificationProcessor class
Add successful notifications to a sent notification topic. - RetryEngine class
Add failed notifications that have not hit the retry limit back to the retry topic. -RetryEngine class
Discard failed notifications that have hit the retry limit. - RetryEngine class
Commit offset to Kafka. - KafkaConsumer class

The retry engine uses two Kafka topics:

notification_retry_topic: Notifications that need to be retried.
notification_topic: Successfully sent notifications.

Fault Tolerance

When reading from the alarm topic, no committing is done. The committing is done only after processing. This allows the processing to continue even though some notifications can be slow. In the event of a catastrophic failure some notifications could be sent but the alarms have not yet been acknowledged. This is an acceptable failure mode, better to send a notification twice than not at all.

The general process when a major error is encountered is to exit the daemon which should allow the other processes to renegotiate access to the Kafka partitions. It is also assumed that the notification engine will be run by a process supervisor which will restart it in case of a failure. In this way, any errors which are not easy to recover from are automatically handled by the service restarting and the active daemon switching to another instance.

Though this should cover all errors, there is the risk that an alarm or a set of alarms can be processed and notifications are sent out multiple times. To minimize this risk a number of techniques are used:

Timeouts are implemented for all notification types.
An alarm TTL is utilized. Any alarm older than the TTL is not processed.

Operation

oslo.config is used for handling configuration options. A sample configuration file etc/monasca/notification.conf.sample can be generated by running:

tox -e genconfig

To run the service using the default config file location of `/etc/monasca/notification.conf`:

monasca-notification

To run the service and explicitly specify the config file:

monasca-notification --config-file /etc/monasca/monasca-notification.conf

Monitoring

StatsD is incorporated into the daemon and will send all stats to the StatsD server launched by monasca-agent. Default host and port points to localhost:8125.

Counters
- ConsumedFromKafka
- AlarmsFailedParse
- AlarmsNoNotification
- NotificationsCreated
- NotificationsSentSMTP
- NotificationsSentWebhook
- NotificationsSentPagerduty
- NotificationsSentFailed
- NotificationsInvalidType
- AlarmsFinished
- PublishedToKafka
Timers
- ConfigDBTime
- SendNotificationTime

Plugins

The following notification plugins are available:

Email
HipChat
Jira
Pagerduty
Slack
Webhook

The plugins can be configured via the Monasca Notification config file. In general you will need to follow these steps to enable a plugin:

Make sure that the plugin is enabled in the config file
Make sure that the plugin is configured in the config file
Restart the Monasca Notification service

Slack plugin

To use the Slack plugin you must first configure an incoming webhook for the Slack channel you wish to post notifications to. The notification can then be created as follows:

monasca notification-create slack_notification slack https://hooks.slack.com/services/MY/SECRET/WEBHOOK/URL

Note that whilst it is also possible to use a token instead of a webhook, this approach is now deprecated.

By default the Slack notification will dump all available information into the alert. For example, a notification may be posted to Slack which looks like this:

{
  "metrics":[
     {
        "dimensions":{
           "hostname":"operator"
        },
        "id":null,
        "name":"cpu.user_perc"
     }
  ],
  "alarm_id":"20a54a65-44b8-4ac9-a398-1f2d888827d2",
  "state":"ALARM",
  "alarm_timestamp":1556703552,
  "tenant_id":"62f7a7a314904aa3ab137d569d6b4fde",
  "old_state":"OK",
  "alarm_description":"Dummy alarm",
  "message":"Thresholds were exceeded for the sub-alarms: count(cpu.user_perc, deterministic) >= 1.0 with the values: [1.0]",
  "alarm_definition_id":"78ce7b53-f7e6-4b51-88d0-cb741e7dc906",
  "alarm_name":"dummy_alarm"
}

The format of the above message can be customised with a Jinja template. All fields from the raw Slack message are available in the template. For example, you may configure the plugin as follows:

[notification_types]
enabled = slack

[slack_notifier]
message_template = /etc/monasca/slack_template.j2
timeout = 10
ca_certs = /etc/ssl/certs/ca-bundle.crt
insecure = False

With the following contents of `/etc/monasca/slack_template.j2`:

{{ alarm_name }} has triggered on {% for item in metrics %}host {{ item.dimensions.hostname }}{% if not loop.last %}, {% endif %}{% endfor %}.

With this configuration, the raw Slack message above would be transformed into:

dummy_alarm has triggered on host(s): operator.

Future Considerations

More extensive load testing is needed:
- How fast is the mysql db? How much load do we put on it. Initially I think it makes most sense to read notification details for each alarm but eventually I may want to cache that info.
- How expensive are commits to Kafka for every message we read? Should we commit every N messages?
- How efficient is the default Kafka consumer batch size?
- Currently we can get ~200 notifications per second per NotificationEngine instance using webhooks to a local http server. Is that fast enough?
- Are we putting too much load on Kafka at ~200 commits per second?