20d6557744
* Add PyPI validation check for README.rst [1] * Add docutils to test-requirements.txt * Add lower bound for jira [1] https://docs.openstack.org/project-team-guide/project-setup/python.html#running-the-style-checks Change-Id: I5d90ccb1b919c4bab66b468a8ddb714ffc5f1635 Story: 2001980 Task: 20013
140 lines
4.9 KiB
ReStructuredText
140 lines
4.9 KiB
ReStructuredText
Team and repository tags
|
|
========================
|
|
|
|
|Team and repository tags|
|
|
|
|
.. raw:: html
|
|
|
|
<!-- Change things from this point on -->
|
|
|
|
Notification Engine
|
|
===================
|
|
|
|
This engine reads alarms from Kafka and then notifies the customer using
|
|
the configured notification method. Multiple notification and retry
|
|
engines can run in parallel, up to one per available Kafka partition.
|
|
Zookeeper is used to negotiate access to the Kafka partitions whenever a
|
|
new process joins or leaves the working set.
|
|
|
|
Architecture
|
|
============
|
|
|
|
The notification engine generates notifications using the following
|
|
steps:
|
|
|
|
1. Read Alarms from Kafka, with no auto commit. -
|
|
monasca\_common.kafka.KafkaConsumer class
|
|
2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
|
|
3. Send notification. - NotificationProcessor class
|
|
4. Add successful notifications to a sent notification topic. - NotificationEngine class
|
|
5. Add failed notifications to a retry topic. - NotificationEngine class
|
|
6. Commit offset to Kafka - KafkaConsumer class
|
|
|
|
The notification engine uses three Kafka topics:
|
|
|
|
1. alarm\_topic: Alarms inbound to the notification engine.
|
|
2. notification\_topic: Successfully sent notifications.
|
|
3. notification\_retry\_topic: Failed notifications.
|
|
|
|
A retry engine runs in parallel with the notification engine and gives
|
|
any failed notification a configurable number of extra chances at
|
|
success.
|
|
|
|
The retry engine generates notifications using the following steps:
|
|
|
|
1. Read notification json data from Kafka, with no auto commit. - KafkaConsumer class
|
|
2. Rebuild the notification that failed. - RetryEngine class
|
|
3. Send notification. - NotificationProcessor class
|
|
4. Add successful notifications to a sent notification topic. - RetryEngine class
|
|
5. Add failed notifications that have not hit the retry limit back to the retry topic. -
|
|
RetryEngine class
|
|
6. Discard failed notifications that have hit the retry limit. - RetryEngine class
|
|
7. Commit offset to Kafka. - KafkaConsumer class
|
|
|
|
The retry engine uses two Kafka topics:
|
|
|
|
1. notification\_retry\_topic: Notifications that need to be retried.
|
|
2. notification\_topic: Successfully sent notifications.
|
|
|
|
Fault Tolerance
|
|
---------------
|
|
|
|
When reading from the alarm topic, no committing is done. The committing
|
|
is done only after processing. This allows the processing to continue
|
|
even though some notifications can be slow. In the event of a
|
|
catastrophic failure some notifications could be sent but the alarms
|
|
have not yet been acknowledged. This is an acceptable failure mode,
|
|
better to send a notification twice than not at all.
|
|
|
|
The general process when a major error is encountered is to exit the
|
|
daemon which should allow the other processes to renegotiate access to
|
|
the Kafka partitions. It is also assumed that the notification engine
|
|
will be run by a process supervisor which will restart it in case of a
|
|
failure. In this way, any errors which are not easy to recover from are
|
|
automatically handled by the service restarting and the active daemon
|
|
switching to another instance.
|
|
|
|
Though this should cover all errors, there is the risk that an alarm or
|
|
a set of alarms can be processed and notifications are sent out multiple
|
|
times. To minimize this risk a number of techniques are used:
|
|
|
|
- Timeouts are implemented for all notification types.
|
|
- An alarm TTL is utilized. Any alarm older than the TTL is not
|
|
processed.
|
|
|
|
Operation
|
|
=========
|
|
|
|
``oslo.config`` is used for handling configuration options. A sample
|
|
configuration file ``etc/monasca/notification.conf.sample`` can be
|
|
generated by running:
|
|
|
|
::
|
|
|
|
tox -e genconfig
|
|
|
|
Monitoring
|
|
----------
|
|
|
|
StatsD is incorporated into the daemon and will send all stats to the
|
|
StatsD server launched by monasca-agent. Default host and port points to
|
|
**localhost:8125**.
|
|
|
|
- Counters
|
|
|
|
- ConsumedFromKafka
|
|
- AlarmsFailedParse
|
|
- AlarmsNoNotification
|
|
- NotificationsCreated
|
|
- NotificationsSentSMTP
|
|
- NotificationsSentWebhook
|
|
- NotificationsSentPagerduty
|
|
- NotificationsSentFailed
|
|
- NotificationsInvalidType
|
|
- AlarmsFinished
|
|
- PublishedToKafka
|
|
|
|
- Timers
|
|
|
|
- ConfigDBTime
|
|
- SendNotificationTime
|
|
|
|
Future Considerations
|
|
=====================
|
|
|
|
- More extensive load testing is needed:
|
|
|
|
- How fast is the mysql db? How much load do we put on it. Initially I
|
|
think it makes most sense to read notification details for each alarm
|
|
but eventually I may want to cache that info.
|
|
- How expensive are commits to Kafka for every message we read? Should
|
|
we commit every N messages?
|
|
- How efficient is the default Kafka consumer batch size?
|
|
- Currently we can get ~200 notifications per second per
|
|
NotificationEngine instance using webhooks to a local http server. Is
|
|
that fast enough?
|
|
- Are we putting too much load on Kafka at ~200 commits per second?
|
|
|
|
.. |Team and repository tags| image:: https://governance.openstack.org/tc/badges/monasca-notification.svg
|
|
:target: https://governance.openstack.org/tc/reference/tags/index.html
|