Updated documentation
Change-Id: I0ce68df229fc2a7ebfed5abf7f540c42aba1ca46
This commit is contained in:
parent
e5ede47918
commit
f6f29800f8
73
README.md
73
README.md
@ -1,32 +1,38 @@
|
|||||||
# Notification Engine
|
# Notification Engine
|
||||||
|
|
||||||
This engine reads alarms from Kafka and then notifies the customer using their configured notification method.
|
This engine reads alarms from Kafka and then notifies the customer using their configured notification method.
|
||||||
|
Multiple notification and retry engines can run in parallel up to one per available Kafka partition. Zookeeper
|
||||||
|
is used to negotiate access to the Kafka partitions whenever a new process joins or leaves the working set.
|
||||||
|
|
||||||
# Architecture
|
# Architecture
|
||||||
There are four processing steps separated by queues implemented with python multiprocessing. The steps are:
|
The notification engine generates notifications using the following steps:
|
||||||
|
|
||||||
1. Reads Alarms from Kafka, with no auto commit. - KafkaConsumer class
|
1. Reads Alarms from Kafka, with no auto commit. - KafkaConsumer class
|
||||||
2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
|
2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
|
||||||
3. Send Notification. - NotificationProcessor class
|
3. Send Notification. - NotificationProcessor class
|
||||||
4. Add sent notifications to Kafka on the notification topic. - SentNotificationProcessor class
|
4. Successful notifications are added to a sent notification topic. - NotificationEngine class
|
||||||
|
5. Failed notifications are added to a retry topic. - NotificationEngine class
|
||||||
|
6. Commit offset to Kafka - KafkaConsumer class
|
||||||
|
|
||||||
There is also a special processing step, the KafkaStateTracker, that runs in the main thread and keeps track of the
|
The notification engine uses three Kafka topics:
|
||||||
last committed message and ones available for commit, it then periodically commits all progress. This handles the
|
1. alarm_topic: Alarms inbound to the notification engine.
|
||||||
situation where alarms that are not acted on are quickly ready for commit but others which are prior to them in the
|
2. notification_topic: Successfully sent notifications.
|
||||||
kafka order are still in progress. Locking is also handled by this class using zookeeper.
|
3. notification_retry_topic: Unsuccessful notifications.
|
||||||
|
|
||||||
There are 4 internal queues:
|
A retry engine runs in parallel with the notification engine and gives any
|
||||||
|
failed notification a configurable number of extra chances at succeess.
|
||||||
|
|
||||||
1. alarms - kafka alarms are added to this queue.
|
The retry engine generates notifications using the following steps:
|
||||||
2. notifications - notifications to be sent, grouped by source alarm are added to this queue.
|
1. Reads Notification json data from Kafka, with no auto commit. - KafkaConsumer class
|
||||||
Consists of a list of Notification objects.
|
2. Rebuild the notification that failed. - RetryEngine class
|
||||||
3. sent_notifications - notifications that have been sent are added here. Consists of Notification objects.
|
3. Send Notification. - NotificationProcessor class
|
||||||
4. finished - alarms that are done with processing, either the notification is sent or there was none.
|
4. Successful notifictions are added to a sent notification topic. - RetryEngine class
|
||||||
|
5. Failed notifications that have not hit the retry limit are added back to the retry topic. - RetryEngine class
|
||||||
|
6. Failed notifications that have hit the retry limit are discarded. - RetryEngine class
|
||||||
|
6. Commit offset to Kafka - KafkaConsumer class
|
||||||
|
|
||||||
## High Availability
|
The retry engine uses two Kafka topics:
|
||||||
HA is handled by running multiple notification engines. Only one at a time is active if it dies another can take
|
1. notification_retry_topic: Notifications that need to be retried.
|
||||||
over and continue from where it left. A zookeeper lock file is used to ensure only one running daemon. If needed
|
2. notification_topic: Successfully sent notifications.
|
||||||
the code can be modified to use kafka partitions to have multiple active engines working on different alarms.
|
|
||||||
|
|
||||||
## Fault Tolerance
|
## Fault Tolerance
|
||||||
When reading from the alarm topic no committing is done. The committing is done only after processing. This allows
|
When reading from the alarm topic no committing is done. The committing is done only after processing. This allows
|
||||||
@ -34,53 +40,48 @@ the processing to continue even though some notifications can be slow. In the ev
|
|||||||
notifications could be sent but the alarms not yet acknowledged. This is an acceptable failure mode, better to send a
|
notifications could be sent but the alarms not yet acknowledged. This is an acceptable failure mode, better to send a
|
||||||
notification twice than not at all.
|
notification twice than not at all.
|
||||||
|
|
||||||
The general process when a major error is encountered is to exit the daemon which should allow another daemon to take
|
The general process when a major error is encountered is to exit the daemon which should allow the other processes to
|
||||||
over according to the HA strategy. It is also assumed the notification engine will be run by a process supervisor which
|
renegotiate access to the Kafka partitions. It is also assumed the notification engine will be run by a process
|
||||||
will restart it in case of a failure. This way any errors which are not easy to recover from are automatically handled
|
supervisor which will restart it in case of a failure. This way any errors which are not easy to recover from are
|
||||||
by the service restarting and the active daemon switching to another instance.
|
automatically handled by the service restarting and the active daemon switching to another instance.
|
||||||
|
|
||||||
Though this should cover all errors there is risk that an alarm or set of alarms can be processed and notifications
|
Though this should cover all errors there is risk that an alarm or set of alarms can be processed and notifications
|
||||||
sent out multiple times. To minimize this risk a number of techniques are used:
|
sent out multiple times. To minimize this risk a number of techniques are used:
|
||||||
|
|
||||||
- Timeouts are implemented with all notification types.
|
- Timeouts are implemented with all notification types.
|
||||||
- On shutdown uncommitted work is finished up.
|
|
||||||
- An alarm TTL is utilized. Any alarm older than the TTL is not processed.
|
- An alarm TTL is utilized. Any alarm older than the TTL is not processed.
|
||||||
- A maximum offset lag time is set. The offset is normally only updated if there is a continuous chain of finished
|
|
||||||
alarms. If there is a new offset that arrives yet still a gap it is normally held in reserve. If the maximum lag
|
|
||||||
time has been set and exceeded when a new finished alarm comes in the offset is updated regardless of gaps.
|
|
||||||
|
|
||||||
# Operation
|
# Operation
|
||||||
Yaml config file by default is in '/etc/monasca/notification.yaml', a sample is in this project.
|
Yaml config file by default is in '/etc/monasca/notification.yaml', a sample is in this project.
|
||||||
|
|
||||||
## Monitoring
|
## Monitoring
|
||||||
statsd is incorporated into the daemon and will send all stats to localhost on udp port 8125. In many cases the stats
|
statsd is incorporated into the daemon and will send all stats to localhost on udp port 8125.
|
||||||
are gathered per thread, the thread number is indicated by a -# at the end of the name.
|
|
||||||
|
|
||||||
- Counters
|
- Counters
|
||||||
- ConsumedFromKafka
|
- ConsumedFromKafka
|
||||||
- AlarmsFailedParse
|
- AlarmsFailedParse
|
||||||
- AlarmsFinished
|
|
||||||
- AlarmsNoNotification
|
- AlarmsNoNotification
|
||||||
- AlarmsOffsetUpdated
|
|
||||||
- NotificationsCreated
|
- NotificationsCreated
|
||||||
- NotificationsSentSMTP
|
- NotificationsSentSMTP
|
||||||
|
- NotificationsSentWebhook
|
||||||
|
- NotificationsSentPagerduty
|
||||||
- NotificationsSentFailed
|
- NotificationsSentFailed
|
||||||
- NotificationsInvalidType
|
- NotificationsInvalidType
|
||||||
|
- AlarmsFinished
|
||||||
- PublishedToKafka
|
- PublishedToKafka
|
||||||
- Timers
|
- Timers
|
||||||
- ConfigDBTime
|
- ConfigDBTime
|
||||||
- SMTPTime
|
- SendNotificationTime
|
||||||
- OffsetCommitTime
|
|
||||||
|
|
||||||
# Future Considerations
|
# Future Considerations
|
||||||
- Currently I lock the topic rather than the partitions. This effectively means there is only one active notification
|
|
||||||
engine at any given time. In the future to share the load among multiple daemons we could lock by partition.
|
|
||||||
- More extensive load testing is needed
|
- More extensive load testing is needed
|
||||||
- How fast is the mysql db? How much load do we put on it. Initially I think it makes most sense to read notification
|
- How fast is the mysql db? How much load do we put on it. Initially I think it makes most sense to read notification
|
||||||
details for each alarm but eventually I may want to cache that info.
|
details for each alarm but eventually I may want to cache that info.
|
||||||
- I am starting with a single KafkaConsumer and a single SentNotificationProcessor depending on load this may need
|
- How expensive are commits to Kafka for every message we read? Should we commit every N messages?
|
||||||
to scale.
|
- How efficient is the default Kafka consumer batch size?
|
||||||
- How fast is the state tracker? Do I need to scale or speed that up at all?
|
- Currently we can get ~200 notifications per second per NotificationEngine instance using webhooks to a local
|
||||||
|
http server. Is that fast enough?
|
||||||
|
- Are we putting too much load on Kafka at ~200 commits per second?
|
||||||
|
|
||||||
# License
|
# License
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user