Updated documentation

Change-Id: I0ce68df229fc2a7ebfed5abf7f540c42aba1ca46
2015-02-12 10:10:49 -07:00 · 2015-02-12 10:10:49 -07:00 · f6f29800f8
commit f6f29800f8
parent e5ede47918
1 changed files with 37 additions and 36 deletions
--- a/README.md
+++ b/README.md
@ -1,32 +1,38 @@
 # Notification Engine

 This engine reads alarms from Kafka and then notifies the customer using their configured notification method.
+Multiple notification and retry engines can run in parallel up to one per available Kafka partition.  Zookeeper
+is used to negotiate access to the Kafka partitions whenever a new process joins or leaves the working set.

 # Architecture
-There are four processing steps separated by queues implemented with python multiprocessing. The steps are:
-
+The notification engine generates notifications using the following steps:
 1. Reads Alarms from Kafka, with no auto commit. - KafkaConsumer class
 2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class
 3. Send Notification. - NotificationProcessor class
-4. Add sent notifications to Kafka on the notification topic. - SentNotificationProcessor class
+4. Successful notifications are added to a sent notification topic. - NotificationEngine class
+5. Failed notifications are added to a retry topic. - NotificationEngine class
+6. Commit offset to Kafka - KafkaConsumer class

-There is also a special processing step, the KafkaStateTracker, that runs in the main thread and keeps track of the
-last committed message and ones available for commit, it then periodically commits all progress. This handles the
-situation where alarms that are not acted on are quickly ready for commit but others which are prior to them in the
-kafka order are still in progress. Locking is also handled by this class using zookeeper.
+The notification engine uses three Kafka topics:
+1. alarm_topic: Alarms inbound to the notification engine.
+2. notification_topic: Successfully sent notifications.
+3. notification_retry_topic: Unsuccessful notifications.

-There are 4 internal queues:
+A retry engine runs in parallel with the notification engine and gives any
+failed notification a configurable number of extra chances at succeess.

-1. alarms - kafka alarms are added to this queue.
-2. notifications - notifications to be sent, grouped by source alarm are added to this queue.
-   Consists of a list of Notification objects.
-3. sent_notifications - notifications that have been sent are added here. Consists of Notification objects.
-4. finished - alarms that are done with processing, either the notification is sent or there was none.
+The retry engine generates notifications using the following steps:
+1. Reads Notification json data from Kafka, with no auto commit. - KafkaConsumer class
+2. Rebuild the notification that failed. - RetryEngine class
+3. Send Notification. - NotificationProcessor class
+4. Successful notifictions are added to a sent notification topic. - RetryEngine class
+5. Failed notifications that have not hit the retry limit are added back to the retry topic. - RetryEngine class
+6. Failed notifications that have hit the retry limit are discarded. - RetryEngine class
+6. Commit offset to Kafka - KafkaConsumer class

-## High Availability
-HA is handled by running multiple notification engines. Only one at a time is active if it dies another can take
-over and continue from where it left. A zookeeper lock file is used to ensure only one running daemon. If needed
-the code can be modified to use kafka partitions to have multiple active engines working on different alarms.
+The retry engine uses two Kafka topics:
+1. notification_retry_topic: Notifications that need to be retried.
+2. notification_topic: Successfully sent notifications.

 ## Fault Tolerance
 When reading from the alarm topic no committing is done. The committing is done only after processing. This allows
@ -34,53 +40,48 @@ the processing to continue even though some notifications can be slow. In the ev
 notifications could be sent but the alarms not yet acknowledged. This is an acceptable failure mode, better to send a
 notification twice than not at all.

-The general process when a major error is encountered is to exit the daemon which should allow another daemon to take
-over according to the HA strategy. It is also assumed the notification engine will be run by a process supervisor which
-will restart it in case of a failure. This way any errors which are not easy to recover from are automatically handled
-by the service restarting and the active daemon switching to another instance.
+The general process when a major error is encountered is to exit the daemon which should allow the other processes to
+renegotiate access to the Kafka partitions.  It is also assumed the notification engine will be run by a process
+supervisor which will restart it in case of a failure. This way any errors which are not easy to recover from are
+automatically handled by the service restarting and the active daemon switching to another instance.

 Though this should cover all errors there is risk that an alarm or set of alarms can be processed and notifications
 sent out multiple times. To minimize this risk a number of techniques are used:

 - Timeouts are implemented with all notification types.
- On shutdown uncommitted work is finished up.
 - An alarm TTL is utilized. Any alarm older than the TTL is not processed.
- A maximum offset lag time is set. The offset is normally only updated if there is a continuous chain of finished
-  alarms. If there is a new offset that arrives yet still a gap it is normally held in reserve. If the maximum lag
-  time has been set and exceeded when a new finished alarm comes in the offset is updated regardless of gaps.

 # Operation
 Yaml config file by default is in '/etc/monasca/notification.yaml', a sample is in this project.

 ## Monitoring
-statsd is incorporated into the daemon and will send all stats to localhost on udp port 8125. In many cases the stats
-are gathered per thread, the thread number is indicated by a -# at the end of the name.
+statsd is incorporated into the daemon and will send all stats to localhost on udp port 8125.

 - Counters
    - ConsumedFromKafka
    - AlarmsFailedParse
-    - AlarmsFinished
    - AlarmsNoNotification
-    - AlarmsOffsetUpdated
    - NotificationsCreated
    - NotificationsSentSMTP
+    - NotificationsSentWebhook
+    - NotificationsSentPagerduty
    - NotificationsSentFailed
    - NotificationsInvalidType
+    - AlarmsFinished
    - PublishedToKafka
 - Timers
    - ConfigDBTime
-    - SMTPTime
-    - OffsetCommitTime
+    - SendNotificationTime

 # Future Considerations
- Currently I lock the topic rather than the partitions. This effectively means there is only one active notification
-  engine at any given time. In the future to share the load among multiple daemons we could lock by partition.
 - More extensive load testing is needed
  - How fast is the mysql db? How much load do we put on it. Initially I think it makes most sense to read notification
    details for each alarm but eventually I may want to cache that info.
-  - I am starting with a single KafkaConsumer and a single SentNotificationProcessor depending on load this may need
-    to scale.
-  - How fast is the state tracker? Do I need to scale or speed that up at all?
+  - How expensive are commits to Kafka for every message we read?  Should we commit every N messages?
+  - How efficient is the default Kafka consumer batch size?
+  - Currently we can get ~200 notifications per second per NotificationEngine instance using webhooks to a local 
+    http server.  Is that fast enough?
+  - Are we putting too much load on Kafka at ~200 commits per second?

 # License