Add spec for event alarm evaluator
This blueprint proposes to add a new alarm evaluator for handling alarms on events passed from other OpenStack services, that provides event-driven alarm evaluation which makes new sequence in Ceilometer for handling alarms on events separated from other types of alarms handled in the existing polling-based Alarm Evaluator, and realizes immediate alarm notification to end users. APIImpact DocImpact blueprint: event-alarm-evaluator Change-Id: Ie5ff0698a611eecd3f94a9d369795d685c0247ae
This commit is contained in:
parent
b54b2e2e52
commit
12b9f884e6
320
specs/liberty/event-alarm-evaluator.rst
Normal file
320
specs/liberty/event-alarm-evaluator.rst
Normal file
@ -0,0 +1,320 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================
|
||||
Event Alarm Evaluator
|
||||
=====================
|
||||
|
||||
https://blueprints.launchpad.net/ceilometer/+spec/event-alarm-evaluator
|
||||
|
||||
This blueprint proposes to add a new alarm evaluator for handling alarms on
|
||||
events passed from other OpenStack services, that provides event-driven alarm
|
||||
evaluation which makes new sequence in Ceilometer for handling alarms on events
|
||||
separated from other types of alarms handled in the existing polling-based
|
||||
Alarm Evaluator, and realizes immediate alarm notification to end users.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
As an end user, I need to receive alarm notification immediately once
|
||||
Ceilometer captured an event which would make alarm fired, so that I can
|
||||
perform recovery actions promptly to shorten downtime of my service.
|
||||
The typical use case is that an end user set alarm on "compute.instance.update"
|
||||
in order to trigger recovery actions once the instance status has changed to
|
||||
'shutdown' or 'error'. It should be possible for an end user to receive a
|
||||
notification within 1 second of a fault being observed, as with other health-
|
||||
check mechanisms can do in some cases.
|
||||
|
||||
The existing Alarm Evaluator is periodically querying/polling the databases
|
||||
in order to check all alarms independently from other processes. This is good
|
||||
approach for evaluating an alarm on samples stored in a certain period.
|
||||
However, this is not efficient to evaluate an alarm on events which are emitted
|
||||
by other OpenStack servers once in a while.
|
||||
|
||||
The periodical evaluation leads delay on sending alarm notification to users.
|
||||
The default period of evaluation cycle is 60 seconds. It is recommended that
|
||||
an operator set longer interval than configured pipeline interval for
|
||||
underlying metrics, and also longer enough to evaluate all defined alarms
|
||||
in certain period while taking into account the number of resources, users and
|
||||
alarms.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The proposal is to add a new event-driven alarm evaluator which receives
|
||||
messages from Notification Agent and finds related Alarms, then evaluates each
|
||||
alarms;
|
||||
|
||||
* New alarm evaluator receives event notification from Notification Agent
|
||||
by which adding a dedicated notifier to publish event in a new topic.
|
||||
The topic name is 'alarm.all'.
|
||||
|
||||
* When new alarm evaluator received event notification, it queries alarm
|
||||
database by Project ID written in the event notification.
|
||||
To reduce heavy load of Ceilometer API, this alarm evaluator would cache
|
||||
alarm definitions which queried by Project ID in a certain period.
|
||||
|
||||
* Found alarms are evaluated by referring event notification.
|
||||
|
||||
* Depending on the result of evaluation, those alarms would be fired through
|
||||
Alarm Notifier as the same as existing Alarm Evaluator does.
|
||||
|
||||
This proposal also adds new alarm type "event" and "event_rule".
|
||||
This enables users to create alarms on events. The separation from other alarm
|
||||
types (such as "threshold" type) is intended to show different timing of
|
||||
evaluation and different format of condition, since the new evaluator will
|
||||
check each event notification once it received whereas "threshold" alarm can
|
||||
evaluate average of values in certain period calculated from multiple samples.
|
||||
|
||||
The new alarm evaluator handles "event" type alarms, so we have to change
|
||||
existing alarm evaluator to exclude "event" type alarms from evaluation
|
||||
targets.
|
||||
|
||||
Project ID of events would be retrieved from traits named 'project_id' and
|
||||
'tenant_id' by this alarm evaluator, since there is no common field to hold it.
|
||||
Not all events have project ID; all events of virtual resource like VM instance
|
||||
should have project ID which belongs to, but events of physical resources like
|
||||
host don't have project ID. Since those raw infra events have to be hidden from
|
||||
end users, those will be treated as labeled with PROJECT_NONE ('' in API) which
|
||||
can be set in an alarm definition only by admin.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There was similar blueprint proposal "Alarm type based on notification", but
|
||||
the approach is different. The old proposal was to adding new step (alarm
|
||||
evaluations) in Notification Agent every time it received event from other
|
||||
OpenStack services, whereas this proposal intends to execute alarm evaluation
|
||||
in another component which can minimize impact to existing pipeline processing.
|
||||
|
||||
Another approach is enhancement of existing alarm evaluator by adding
|
||||
notification listener. However, there are two issues; 1) this approach could
|
||||
cause stall of periodical evaluations when it receives bulk of notifications,
|
||||
and 2) this could break the alarm partitioning i.e. when alarm evaluator
|
||||
received notification, it might have to evaluate some alarms which are not
|
||||
assign to it.
|
||||
|
||||
Caching all alarm definitions can reduce the number of query to API/DB in each
|
||||
period and reduce time of evaluation in other projects, but it may lead longer
|
||||
time in first query and larger footprint of cache since there can be large
|
||||
number of projects whereas alarms in a project could be limited by quota.
|
||||
|
||||
Event ID can be used instead of Project ID in alarm query, but it cannot ensure
|
||||
the number of alarms retrieved from DB and difficult to get all alarms when we
|
||||
allow users to use wildcard in "event_type" of an alarm definition, which is
|
||||
proposed in this spec.
|
||||
|
||||
Resource ID could be added to Alarm data model as an optional attribute.
|
||||
This would help the new alarm evaluator to filter out non-related alarms
|
||||
while querying alarms, otherwise it have to evaluate all alarms in the project.
|
||||
To focus on creating basic framework of event alarms, we decided not to handle
|
||||
Resource ID in this blueprint.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Alarm API will be extended as follows;
|
||||
|
||||
* Add "event" type into alarm type list
|
||||
|
||||
* Add "event_rule" to "alarm"
|
||||
|
||||
* "event_rule" has "event_type", which can include "*", to indicate which
|
||||
type(s) of event to be evaluated on the alarm
|
||||
|
||||
* "event_rule" has "query" which is a list of conditions combined by AND as
|
||||
the same as AlarmThresholdRule
|
||||
|
||||
Sample data of Notification-type alarm::
|
||||
|
||||
{
|
||||
"alarm_actions": [
|
||||
"http://site:8000/alarm"
|
||||
],
|
||||
"alarm_id": null,
|
||||
"description": "An event alarm",
|
||||
"enabled": true,
|
||||
"insufficient_data_actions": [
|
||||
"http://site:8000/nodata"
|
||||
],
|
||||
"name": "InstanceStatusAlarm",
|
||||
"event_rule": {
|
||||
"event_type": "compute.instance.update",
|
||||
"query" : [
|
||||
{
|
||||
"field" : "traits.instance_id",
|
||||
"type" : "string",
|
||||
"value" : "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
|
||||
"op" : "eq",
|
||||
},
|
||||
{
|
||||
"field" : "traits.state",
|
||||
"type" : "string",
|
||||
"value" : "error",
|
||||
"op" : "eq",
|
||||
},
|
||||
]
|
||||
},
|
||||
"ok_actions": [],
|
||||
"project_id": "c96c887c216949acbdfbd8b494863567",
|
||||
"repeat_actions": false,
|
||||
"severity": "moderate",
|
||||
"state": "ok",
|
||||
"state_timestamp": "2015-04-03T17:49:38.406845",
|
||||
"timestamp": "2015-04-03T17:49:38.406839",
|
||||
"type": "event",
|
||||
"user_id": "c96c887c216949acbdfbd8b494863567"
|
||||
}
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Since default event notification may include raw infra information,
|
||||
operator/administrator should configure event definitions carefully when end
|
||||
users is allowed to operate this event alarm API and can receive event alarm
|
||||
notification.
|
||||
|
||||
Pipeline impact
|
||||
---------------
|
||||
|
||||
This change needs to add new notifier into event pipeline in order to pass
|
||||
event to this new alarm evaluator.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance/Scalability Impacts
|
||||
-------------------------------
|
||||
|
||||
When Ceilometer received a number of events from other OpenStack services in
|
||||
short period, this alarm evaluator can keep working since events are queued in
|
||||
a messaging queue system, but it can cause delay of alarm notification to users
|
||||
and increase the number of read and write access to alarm database.
|
||||
|
||||
All event alarms defined in the project will be evaluated every time this
|
||||
evaluator received event. The number of alarms to be evaluated can be reduced
|
||||
by adding new parameter (e.g. Resource ID) on Alarm data model and setting
|
||||
filter while alarm querying.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Notification Agent have to be configured to publish event to a new topic
|
||||
'alarm.all' which is dedicated to this event alarm evaluator and different from
|
||||
current messaging to store events (i.e. add 'notifier://?topic=alarm.all' in
|
||||
event_pipeline.yaml). Note this configuration should be done when the new event
|
||||
alarm evaluator runs in the deployment, otherwise it may fill up the queue.
|
||||
|
||||
New service process (this alarm evaluator) have to be run.
|
||||
|
||||
A deployer can run multiple evaluators in order to scale out event alarm
|
||||
evaluation process. All event will be dispatched to all evaluators listening
|
||||
the topic in a round robin fashion.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers should be aware that events could be notified to end users and avoid
|
||||
passing raw infra information to end users, while defining events and traits.
|
||||
|
||||
All events related to virtual resources should have project ID and user ID
|
||||
properly.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
r-mibu
|
||||
|
||||
Other contributors:
|
||||
lianhao-lu
|
||||
edwin-zhai
|
||||
|
||||
Ongoing maintainer:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add new alarm type "event" as well as AlarmEventRule
|
||||
|
||||
* Modify existing alarm evaluator to filter out "event" alarms
|
||||
|
||||
* New event-driven alarm evaluator
|
||||
|
||||
* Make the new evaluator cache alarm definitions
|
||||
|
||||
Future lifecycle
|
||||
================
|
||||
|
||||
This proposal is key feature to provide information of cloud resources to end
|
||||
users in real-time that enables efficient integration with user-side manager
|
||||
or Orchestrator, whereas currently those information are considered to be
|
||||
consumed by admin side tool or service.
|
||||
Based on this change, we will seek orchestrating scenarios including fault
|
||||
recovery and add useful event definition as well as additional traits.
|
||||
|
||||
This feature will or can be enhanced by the followings in the future;
|
||||
|
||||
* Enabling this evaluator to get delta of alarm definitions in the last period,
|
||||
in order to reduce traffic on updating cache. This requires alarm storage to
|
||||
hold deleted alarms and alarm API amendment.
|
||||
|
||||
* Adopting similar coordination process as the notification agent has, for
|
||||
efficiency of this evaluator e.g. reducing cache of alarm definitions.
|
||||
Key for partitioning might be 'project_id', 'event_type' or 'resource_id'.
|
||||
For this coordination, all topic name will have the same prefix 'alarm.',
|
||||
so that evaluator can use topic='alarm.*' to listen all messages for event
|
||||
alarm evaluation. This is the reason why we use 'alarm.all' in this spec.
|
||||
|
||||
* Mechanism to update cache of alarm definition promptly; poisoning cache on
|
||||
evaluator in which assigned alarm definition has updated, etc.
|
||||
|
||||
* Filtering out uninterested events in notification agent by leveraging the
|
||||
mechanism of graceful pipeline update and reflecting alarm definitions
|
||||
configured by end users.
|
||||
|
||||
We can refactor function to retrieve project ID from event object after
|
||||
creating common field in event object that requires DB and API changes.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
New unit/scenario tests are required for this change.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Administrator Guide and Installation Guide in OpenStack Manuals should be
|
||||
updated to describe new alarm type and rule as well as all deployer impacts.
|
||||
|
||||
* Proposed evaluator will be described in the developer document.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* OPNFV Doctor project: https://wiki.opnfv.org/doctor
|
||||
|
||||
* Blueprint "Alarm type based on notification":
|
||||
https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification
|
||||
|
||||
* Liberty Summit Note: https://etherpad.openstack.org/p/event_alarm
|
Loading…
Reference in New Issue
Block a user