Notification Engine for Monasca
Go to file
Tomasz Trębski e1a9b9a96a Integrate with oslo.conf and oslo.log
Change upgrades the monasca-notification to leverage
the capabilities of both oslo.log and oslo.conf:

- configuration of logging separated from application settings
- ability to enforce data types for application settings
- ability to use oslo.config-generator capabilities
- automatic configuration parsing done by oslo.cfg

That change will bring it closer to the rest of monasca
components where such transition has happened already.
However, in the rest of monasca, oslo.cfg was partially
or fully implemented whereas monasca-notification has
been relying on YAML based configuration file.

Therefore backward compatybility for such format will
be kept for now.

Story: 2000959
Task: 4093
Task: 4092

Change-Id: Ia75c3b60d0fada854178f21ca5ccb9e6a880f37f
2017-10-20 09:32:11 +02:00
config-generator Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
etc/monasca Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
monasca_notification Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
tests Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
tools Remove monasca_notification_offsets 2017-01-13 09:35:56 +01:00
.coveragerc Migrate tests to ostestr 2017-01-19 06:15:40 +01:00
.gitignore Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
.gitreview Update .gitreview for new namespace 2015-10-17 22:30:54 +00:00
.stestr.conf Add .stestr.conf . 2017-09-22 14:31:45 +02:00
.testr.conf Migrate tests to ostestr 2017-01-19 06:15:40 +01:00
HACKING.rst Rename to monasca, setup for tox, removed legacy bits 2014-07-16 15:59:00 -06:00
LICENSE Added copyright header, LICENSE and HACKING.rst. 2014-05-01 12:27:06 -06:00
notification.yaml Merge "Added a field 'Grafana Url' in the email" 2017-09-01 09:19:03 +00:00
README.md Optimize the link address 2017-04-11 11:53:28 +05:30
requirements.txt Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
setup.cfg Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
setup.py Updated from global requirements 2017-03-02 11:47:07 +00:00
test-requirements.txt Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00
tox.ini Integrate with oslo.conf and oslo.log 2017-10-20 09:32:11 +02:00

Team and repository tags

Team and repository tags

Notification Engine

This engine reads alarms from Kafka and then notifies the customer using their configured notification method. Multiple notification and retry engines can run in parallel up to one per available Kafka partition. Zookeeper is used to negotiate access to the Kafka partitions whenever a new process joins or leaves the working set.

Architecture

The notification engine generates notifications using the following steps:

  1. Reads Alarms from Kafka, with no auto commit. - KafkaConsumer class
  2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
  3. Send Notification. - NotificationProcessor class
  4. Successful notifications are added to a sent notification topic. - NotificationEngine class
  5. Failed notifications are added to a retry topic. - NotificationEngine class
  6. Commit offset to Kafka - KafkaConsumer class

The notification engine uses three Kafka topics:

  1. alarm_topic: Alarms inbound to the notification engine.
  2. notification_topic: Successfully sent notifications.
  3. notification_retry_topic: Unsuccessful notifications.

A retry engine runs in parallel with the notification engine and gives any failed notification a configurable number of extra chances at succeess.

The retry engine generates notifications using the following steps:

  1. Reads Notification json data from Kafka, with no auto commit. - KafkaConsumer class
  2. Rebuild the notification that failed. - RetryEngine class
  3. Send Notification. - NotificationProcessor class
  4. Successful notifictions are added to a sent notification topic. - RetryEngine class
  5. Failed notifications that have not hit the retry limit are added back to the retry topic. - RetryEngine class
  6. Failed notifications that have hit the retry limit are discarded. - RetryEngine class
  7. Commit offset to Kafka - KafkaConsumer class

The retry engine uses two Kafka topics:

  1. notification_retry_topic: Notifications that need to be retried.
  2. notification_topic: Successfully sent notifications.

Fault Tolerance

When reading from the alarm topic no committing is done. The committing is done only after processing. This allows the processing to continue even though some notifications can be slow. In the event of a catastrophic failure some notifications could be sent but the alarms not yet acknowledged. This is an acceptable failure mode, better to send a notification twice than not at all.

The general process when a major error is encountered is to exit the daemon which should allow the other processes to renegotiate access to the Kafka partitions. It is also assumed the notification engine will be run by a process supervisor which will restart it in case of a failure. This way any errors which are not easy to recover from are automatically handled by the service restarting and the active daemon switching to another instance.

Though this should cover all errors there is risk that an alarm or set of alarms can be processed and notifications sent out multiple times. To minimize this risk a number of techniques are used:

  • Timeouts are implemented with all notification types.
  • An alarm TTL is utilized. Any alarm older than the TTL is not processed.

Operation

Yaml config file by default is in '/etc/monasca/notification.yaml', a sample is in this project.

Monitoring

statsd is incorporated into the daemon and will send all stats to statsd server launched by monasca-agent. Default host and port points at localhost:8125.

  • Counters
    • ConsumedFromKafka
    • AlarmsFailedParse
    • AlarmsNoNotification
    • NotificationsCreated
    • NotificationsSentSMTP
    • NotificationsSentWebhook
    • NotificationsSentPagerduty
    • NotificationsSentFailed
    • NotificationsInvalidType
    • AlarmsFinished
    • PublishedToKafka
  • Timers
    • ConfigDBTime
    • SendNotificationTime

Future Considerations

  • More extensive load testing is needed
    • How fast is the mysql db? How much load do we put on it. Initially I think it makes most sense to read notification details for each alarm but eventually I may want to cache that info.
    • How expensive are commits to Kafka for every message we read? Should we commit every N messages?
    • How efficient is the default Kafka consumer batch size?
    • Currently we can get ~200 notifications per second per NotificationEngine instance using webhooks to a local http server. Is that fast enough?
    • Are we putting too much load on Kafka at ~200 commits per second?

License

Copyright (c) 2014 Hewlett-Packard Development Company, L.P.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.