Add spec for event alarm evaluator

This blueprint proposes to add a new alarm evaluator for handling alarms on events passed from other OpenStack services, that provides event-driven alarm evaluation which makes new sequence in Ceilometer for handling alarms on events separated from other types of alarms handled in the existing polling-based Alarm Evaluator, and realizes immediate alarm notification to end users. APIImpact DocImpact blueprint: event-alarm-evaluator Change-Id: Ie5ff0698a611eecd3f94a9d369795d685c0247ae
2015-04-13 21:03:40 +09:00 · 2015-04-13 21:03:40 +09:00 · 12b9f884e6
commit 12b9f884e6
parent b54b2e2e52
1 changed files with 320 additions and 0 deletions
--- a/specs/liberty/event-alarm-evaluator.rst
+++ b/specs/liberty/event-alarm-evaluator.rst
@ -0,0 +1,320 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 =====================
 Event Alarm Evaluator
 =====================
 https://blueprints.launchpad.net/ceilometer/+spec/event-alarm-evaluator
 This blueprint proposes to add a new alarm evaluator for handling alarms on
 events passed from other OpenStack services, that provides event-driven alarm
 evaluation which makes new sequence in Ceilometer for handling alarms on events
 separated from other types of alarms handled in the existing polling-based
 Alarm Evaluator, and realizes immediate alarm notification to end users.
 Problem description
 ===================
 As an end user, I need to receive alarm notification immediately once
 Ceilometer captured an event which would make alarm fired, so that I can
 perform recovery actions promptly to shorten downtime of my service.
 The typical use case is that an end user set alarm on "compute.instance.update"
 in order to trigger recovery actions once the instance status has changed to
 'shutdown' or 'error'. It should be possible for an end user to receive a
 notification within 1 second of a fault being observed, as with other health-
 check mechanisms can do in some cases.
 The existing Alarm Evaluator is periodically querying/polling the databases
 in order to check all alarms independently from other processes. This is good
 approach for evaluating an alarm on samples stored in a certain period.
 However, this is not efficient to evaluate an alarm on events which are emitted
 by other OpenStack servers once in a while.
 The periodical evaluation leads delay on sending alarm notification to users.
 The default period of evaluation cycle is 60 seconds. It is recommended that
 an operator set longer interval than configured pipeline interval for
 underlying metrics, and also longer enough to evaluate all defined alarms
 in certain period while taking into account the number of resources, users and
 alarms.
 Proposed change
 ===============
 The proposal is to add a new event-driven alarm evaluator which receives
 messages from Notification Agent and finds related Alarms, then evaluates each
 alarms;
 * New alarm evaluator receives event notification from Notification Agent
  by which adding a dedicated notifier to publish event in a new topic.
  The topic name is 'alarm.all'.
 * When new alarm evaluator received event notification, it queries alarm
  database by Project ID written in the event notification.
  To reduce heavy load of Ceilometer API, this alarm evaluator would cache
  alarm definitions which queried by Project ID in a certain period.
 * Found alarms are evaluated by referring event notification.
 * Depending on the result of evaluation, those alarms would be fired through
  Alarm Notifier as the same as existing Alarm Evaluator does.
 This proposal also adds new alarm type "event" and "event_rule".
 This enables users to create alarms on events. The separation from other alarm
 types (such as "threshold" type) is intended to show different timing of
 evaluation and different format of condition, since the new evaluator will
 check each event notification once it received whereas "threshold" alarm can
 evaluate average of values in certain period calculated from multiple samples.
 The new alarm evaluator handles "event" type alarms, so we have to change
 existing alarm evaluator to exclude "event" type alarms from evaluation
 targets.
 Project ID of events would be retrieved from traits named 'project_id' and
 'tenant_id' by this alarm evaluator, since there is no common field to hold it.
 Not all events have project ID; all events of virtual resource like VM instance
 should have project ID which belongs to, but events of physical resources like
 host don't have project ID. Since those raw infra events have to be hidden from
 end users, those will be treated as labeled with PROJECT_NONE ('' in API) which
 can be set in an alarm definition only by admin.
 Alternatives
 ------------
 There was similar blueprint proposal "Alarm type based on notification", but
 the approach is different. The old proposal was to adding new step (alarm
 evaluations) in Notification Agent every time it received event from other
 OpenStack services, whereas this proposal intends to execute alarm evaluation
 in another component which can minimize impact to existing pipeline processing.
 Another approach is enhancement of existing alarm evaluator by adding
 notification listener. However, there are two issues; 1) this approach could
 cause stall of periodical evaluations when it receives bulk of notifications,
 and 2) this could break the alarm partitioning i.e. when alarm evaluator
 received notification, it might have to evaluate some alarms which are not
 assign to it.
 Caching all alarm definitions can reduce the number of query to API/DB in each
 period and reduce time of evaluation in other projects, but it may lead longer
 time in first query and larger footprint of cache since there can be large
 number of projects whereas alarms in a project could be limited by quota.
 Event ID can be used instead of Project ID in alarm query, but it cannot ensure
 the number of alarms retrieved from DB and difficult to get all alarms when we
 allow users to use wildcard in "event_type" of an alarm definition, which is
 proposed in this spec.
 Resource ID could be added to Alarm data model as an optional attribute.
 This would help the new alarm evaluator to filter out non-related alarms
 while querying alarms, otherwise it have to evaluate all alarms in the project.
 To focus on creating basic framework of event alarms, we decided not to handle
 Resource ID in this blueprint.
 Data model impact
 -----------------
 None
 REST API impact
 ---------------
 Alarm API will be extended as follows;
 * Add "event" type into alarm type list
 * Add "event_rule" to "alarm"
  * "event_rule" has "event_type", which can include "*", to indicate which
    type(s) of event to be evaluated on the alarm
  * "event_rule" has "query" which is a list of conditions combined by AND as
    the same as AlarmThresholdRule
 Sample data of Notification-type alarm::
  {
      "alarm_actions": [
          "http://site:8000/alarm"
      ],
      "alarm_id": null,
      "description": "An event alarm",
      "enabled": true,
      "insufficient_data_actions": [
          "http://site:8000/nodata"
      ],
      "name": "InstanceStatusAlarm",
      "event_rule": {
          "event_type": "compute.instance.update",
          "query" : [
              {
                  "field" : "traits.instance_id",
                  "type" : "string",
                  "value" : "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
                  "op" : "eq",
              },
              {
                  "field" : "traits.state",
                  "type" : "string",
                  "value" : "error",
                  "op" : "eq",
              },
          ]
      },
      "ok_actions": [],
      "project_id": "c96c887c216949acbdfbd8b494863567",
      "repeat_actions": false,
      "severity": "moderate",
      "state": "ok",
      "state_timestamp": "2015-04-03T17:49:38.406845",
      "timestamp": "2015-04-03T17:49:38.406839",
      "type": "event",
      "user_id": "c96c887c216949acbdfbd8b494863567"
  }
 Security impact
 ---------------
 Since default event notification may include raw infra information,
 operator/administrator should configure event definitions carefully when end
 users is allowed to operate this event alarm API and can receive event alarm
 notification.
 Pipeline impact
 ---------------
 This change needs to add new notifier into event pipeline in order to pass
 event to this new alarm evaluator.
 Other end user impact
 ---------------------
 None
 Performance/Scalability Impacts
 -------------------------------
 When Ceilometer received a number of events from other OpenStack services in
 short period, this alarm evaluator can keep working since events are queued in
 a messaging queue system, but it can cause delay of alarm notification to users
 and increase the number of read and write access to alarm database.
 All event alarms defined in the project will be evaluated every time this
 evaluator received event. The number of alarms to be evaluated can be reduced
 by adding new parameter (e.g. Resource ID) on Alarm data model and setting
 filter while alarm querying.
 Other deployer impact
 ---------------------
 Notification Agent have to be configured to publish event to a new topic
 'alarm.all' which is dedicated to this event alarm evaluator and different from
 current messaging to store events (i.e. add 'notifier://?topic=alarm.all' in
 event_pipeline.yaml). Note this configuration should be done when the new event
 alarm evaluator runs in the deployment, otherwise it may fill up the queue.
 New service process (this alarm evaluator) have to be run.
 A deployer can run multiple evaluators in order to scale out event alarm
 evaluation process. All event will be dispatched to all evaluators listening
 the topic in a round robin fashion.
 Developer impact
 ----------------
 Developers should be aware that events could be notified to end users and avoid
 passing raw infra information to end users, while defining events and traits.
 All events related to virtual resources should have project ID and user ID
 properly.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  r-mibu
 Other contributors:
  lianhao-lu
  edwin-zhai
 Ongoing maintainer:
  None
 Work Items
 ----------
 * Add new alarm type "event" as well as AlarmEventRule
 * Modify existing alarm evaluator to filter out "event" alarms
 * New event-driven alarm evaluator
 * Make the new evaluator cache alarm definitions
 Future lifecycle
 ================
 This proposal is key feature to provide information of cloud resources to end
 users in real-time that enables efficient integration with user-side manager
 or Orchestrator, whereas currently those information are considered to be
 consumed by admin side tool or service.
 Based on this change, we will seek orchestrating scenarios including fault
 recovery and add useful event definition as well as additional traits.
 This feature will or can be enhanced by the followings in the future;
 * Enabling this evaluator to get delta of alarm definitions in the last period,
  in order to reduce traffic on updating cache. This requires alarm storage to
  hold deleted alarms and alarm API amendment.
 * Adopting similar coordination process as the notification agent has, for
  efficiency of this evaluator e.g. reducing cache of alarm definitions.
  Key for partitioning might be 'project_id', 'event_type' or 'resource_id'.
  For this coordination, all topic name will have the same prefix 'alarm.',
  so that evaluator can use topic='alarm.*' to listen all messages for event
  alarm evaluation. This is the reason why we use 'alarm.all' in this spec.
 * Mechanism to update cache of alarm definition promptly; poisoning cache on
  evaluator in which assigned alarm definition has updated, etc.
 * Filtering out uninterested events in notification agent by leveraging the
  mechanism of graceful pipeline update and reflecting alarm definitions
  configured by end users.
 We can refactor function to retrieve project ID from event object after
 creating common field in event object that requires DB and API changes.
 Dependencies
 ============
 None
 Testing
 =======
 New unit/scenario tests are required for this change.
 Documentation Impact
 ====================
 * Administrator Guide and Installation Guide in OpenStack Manuals should be
  updated to describe new alarm type and rule as well as all deployer impacts.
 * Proposed evaluator will be described in the developer document.
 References
 ==========
 * OPNFV Doctor project: https://wiki.opnfv.org/doctor
 * Blueprint "Alarm type based on notification":
  https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification
 * Liberty Summit Note: https://etherpad.openstack.org/p/event_alarm