Add spec for event alarm evaluator
This blueprint proposes to add a new alarm evaluator for handling alarms on events passed from other OpenStack services, that provides event-driven alarm evaluation which makes new sequence in Ceilometer for handling alarms on events separated from other types of alarms handled in the existing polling-based Alarm Evaluator, and realizes immediate alarm notification to end users. APIImpact DocImpact blueprint: event-alarm-evaluator Change-Id: Ie5ff0698a611eecd3f94a9d369795d685c0247ae
This commit is contained in:
parent
b54b2e2e52
commit
12b9f884e6
320
specs/liberty/event-alarm-evaluator.rst
Normal file
320
specs/liberty/event-alarm-evaluator.rst
Normal file
@ -0,0 +1,320 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
=====================
|
||||||
|
Event Alarm Evaluator
|
||||||
|
=====================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/ceilometer/+spec/event-alarm-evaluator
|
||||||
|
|
||||||
|
This blueprint proposes to add a new alarm evaluator for handling alarms on
|
||||||
|
events passed from other OpenStack services, that provides event-driven alarm
|
||||||
|
evaluation which makes new sequence in Ceilometer for handling alarms on events
|
||||||
|
separated from other types of alarms handled in the existing polling-based
|
||||||
|
Alarm Evaluator, and realizes immediate alarm notification to end users.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
As an end user, I need to receive alarm notification immediately once
|
||||||
|
Ceilometer captured an event which would make alarm fired, so that I can
|
||||||
|
perform recovery actions promptly to shorten downtime of my service.
|
||||||
|
The typical use case is that an end user set alarm on "compute.instance.update"
|
||||||
|
in order to trigger recovery actions once the instance status has changed to
|
||||||
|
'shutdown' or 'error'. It should be possible for an end user to receive a
|
||||||
|
notification within 1 second of a fault being observed, as with other health-
|
||||||
|
check mechanisms can do in some cases.
|
||||||
|
|
||||||
|
The existing Alarm Evaluator is periodically querying/polling the databases
|
||||||
|
in order to check all alarms independently from other processes. This is good
|
||||||
|
approach for evaluating an alarm on samples stored in a certain period.
|
||||||
|
However, this is not efficient to evaluate an alarm on events which are emitted
|
||||||
|
by other OpenStack servers once in a while.
|
||||||
|
|
||||||
|
The periodical evaluation leads delay on sending alarm notification to users.
|
||||||
|
The default period of evaluation cycle is 60 seconds. It is recommended that
|
||||||
|
an operator set longer interval than configured pipeline interval for
|
||||||
|
underlying metrics, and also longer enough to evaluate all defined alarms
|
||||||
|
in certain period while taking into account the number of resources, users and
|
||||||
|
alarms.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
The proposal is to add a new event-driven alarm evaluator which receives
|
||||||
|
messages from Notification Agent and finds related Alarms, then evaluates each
|
||||||
|
alarms;
|
||||||
|
|
||||||
|
* New alarm evaluator receives event notification from Notification Agent
|
||||||
|
by which adding a dedicated notifier to publish event in a new topic.
|
||||||
|
The topic name is 'alarm.all'.
|
||||||
|
|
||||||
|
* When new alarm evaluator received event notification, it queries alarm
|
||||||
|
database by Project ID written in the event notification.
|
||||||
|
To reduce heavy load of Ceilometer API, this alarm evaluator would cache
|
||||||
|
alarm definitions which queried by Project ID in a certain period.
|
||||||
|
|
||||||
|
* Found alarms are evaluated by referring event notification.
|
||||||
|
|
||||||
|
* Depending on the result of evaluation, those alarms would be fired through
|
||||||
|
Alarm Notifier as the same as existing Alarm Evaluator does.
|
||||||
|
|
||||||
|
This proposal also adds new alarm type "event" and "event_rule".
|
||||||
|
This enables users to create alarms on events. The separation from other alarm
|
||||||
|
types (such as "threshold" type) is intended to show different timing of
|
||||||
|
evaluation and different format of condition, since the new evaluator will
|
||||||
|
check each event notification once it received whereas "threshold" alarm can
|
||||||
|
evaluate average of values in certain period calculated from multiple samples.
|
||||||
|
|
||||||
|
The new alarm evaluator handles "event" type alarms, so we have to change
|
||||||
|
existing alarm evaluator to exclude "event" type alarms from evaluation
|
||||||
|
targets.
|
||||||
|
|
||||||
|
Project ID of events would be retrieved from traits named 'project_id' and
|
||||||
|
'tenant_id' by this alarm evaluator, since there is no common field to hold it.
|
||||||
|
Not all events have project ID; all events of virtual resource like VM instance
|
||||||
|
should have project ID which belongs to, but events of physical resources like
|
||||||
|
host don't have project ID. Since those raw infra events have to be hidden from
|
||||||
|
end users, those will be treated as labeled with PROJECT_NONE ('' in API) which
|
||||||
|
can be set in an alarm definition only by admin.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
There was similar blueprint proposal "Alarm type based on notification", but
|
||||||
|
the approach is different. The old proposal was to adding new step (alarm
|
||||||
|
evaluations) in Notification Agent every time it received event from other
|
||||||
|
OpenStack services, whereas this proposal intends to execute alarm evaluation
|
||||||
|
in another component which can minimize impact to existing pipeline processing.
|
||||||
|
|
||||||
|
Another approach is enhancement of existing alarm evaluator by adding
|
||||||
|
notification listener. However, there are two issues; 1) this approach could
|
||||||
|
cause stall of periodical evaluations when it receives bulk of notifications,
|
||||||
|
and 2) this could break the alarm partitioning i.e. when alarm evaluator
|
||||||
|
received notification, it might have to evaluate some alarms which are not
|
||||||
|
assign to it.
|
||||||
|
|
||||||
|
Caching all alarm definitions can reduce the number of query to API/DB in each
|
||||||
|
period and reduce time of evaluation in other projects, but it may lead longer
|
||||||
|
time in first query and larger footprint of cache since there can be large
|
||||||
|
number of projects whereas alarms in a project could be limited by quota.
|
||||||
|
|
||||||
|
Event ID can be used instead of Project ID in alarm query, but it cannot ensure
|
||||||
|
the number of alarms retrieved from DB and difficult to get all alarms when we
|
||||||
|
allow users to use wildcard in "event_type" of an alarm definition, which is
|
||||||
|
proposed in this spec.
|
||||||
|
|
||||||
|
Resource ID could be added to Alarm data model as an optional attribute.
|
||||||
|
This would help the new alarm evaluator to filter out non-related alarms
|
||||||
|
while querying alarms, otherwise it have to evaluate all alarms in the project.
|
||||||
|
To focus on creating basic framework of event alarms, we decided not to handle
|
||||||
|
Resource ID in this blueprint.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Alarm API will be extended as follows;
|
||||||
|
|
||||||
|
* Add "event" type into alarm type list
|
||||||
|
|
||||||
|
* Add "event_rule" to "alarm"
|
||||||
|
|
||||||
|
* "event_rule" has "event_type", which can include "*", to indicate which
|
||||||
|
type(s) of event to be evaluated on the alarm
|
||||||
|
|
||||||
|
* "event_rule" has "query" which is a list of conditions combined by AND as
|
||||||
|
the same as AlarmThresholdRule
|
||||||
|
|
||||||
|
Sample data of Notification-type alarm::
|
||||||
|
|
||||||
|
{
|
||||||
|
"alarm_actions": [
|
||||||
|
"http://site:8000/alarm"
|
||||||
|
],
|
||||||
|
"alarm_id": null,
|
||||||
|
"description": "An event alarm",
|
||||||
|
"enabled": true,
|
||||||
|
"insufficient_data_actions": [
|
||||||
|
"http://site:8000/nodata"
|
||||||
|
],
|
||||||
|
"name": "InstanceStatusAlarm",
|
||||||
|
"event_rule": {
|
||||||
|
"event_type": "compute.instance.update",
|
||||||
|
"query" : [
|
||||||
|
{
|
||||||
|
"field" : "traits.instance_id",
|
||||||
|
"type" : "string",
|
||||||
|
"value" : "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
|
||||||
|
"op" : "eq",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"field" : "traits.state",
|
||||||
|
"type" : "string",
|
||||||
|
"value" : "error",
|
||||||
|
"op" : "eq",
|
||||||
|
},
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"ok_actions": [],
|
||||||
|
"project_id": "c96c887c216949acbdfbd8b494863567",
|
||||||
|
"repeat_actions": false,
|
||||||
|
"severity": "moderate",
|
||||||
|
"state": "ok",
|
||||||
|
"state_timestamp": "2015-04-03T17:49:38.406845",
|
||||||
|
"timestamp": "2015-04-03T17:49:38.406839",
|
||||||
|
"type": "event",
|
||||||
|
"user_id": "c96c887c216949acbdfbd8b494863567"
|
||||||
|
}
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Since default event notification may include raw infra information,
|
||||||
|
operator/administrator should configure event definitions carefully when end
|
||||||
|
users is allowed to operate this event alarm API and can receive event alarm
|
||||||
|
notification.
|
||||||
|
|
||||||
|
Pipeline impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
This change needs to add new notifier into event pipeline in order to pass
|
||||||
|
event to this new alarm evaluator.
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance/Scalability Impacts
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
When Ceilometer received a number of events from other OpenStack services in
|
||||||
|
short period, this alarm evaluator can keep working since events are queued in
|
||||||
|
a messaging queue system, but it can cause delay of alarm notification to users
|
||||||
|
and increase the number of read and write access to alarm database.
|
||||||
|
|
||||||
|
All event alarms defined in the project will be evaluated every time this
|
||||||
|
evaluator received event. The number of alarms to be evaluated can be reduced
|
||||||
|
by adding new parameter (e.g. Resource ID) on Alarm data model and setting
|
||||||
|
filter while alarm querying.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Notification Agent have to be configured to publish event to a new topic
|
||||||
|
'alarm.all' which is dedicated to this event alarm evaluator and different from
|
||||||
|
current messaging to store events (i.e. add 'notifier://?topic=alarm.all' in
|
||||||
|
event_pipeline.yaml). Note this configuration should be done when the new event
|
||||||
|
alarm evaluator runs in the deployment, otherwise it may fill up the queue.
|
||||||
|
|
||||||
|
New service process (this alarm evaluator) have to be run.
|
||||||
|
|
||||||
|
A deployer can run multiple evaluators in order to scale out event alarm
|
||||||
|
evaluation process. All event will be dispatched to all evaluators listening
|
||||||
|
the topic in a round robin fashion.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Developers should be aware that events could be notified to end users and avoid
|
||||||
|
passing raw infra information to end users, while defining events and traits.
|
||||||
|
|
||||||
|
All events related to virtual resources should have project ID and user ID
|
||||||
|
properly.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
r-mibu
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
lianhao-lu
|
||||||
|
edwin-zhai
|
||||||
|
|
||||||
|
Ongoing maintainer:
|
||||||
|
None
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Add new alarm type "event" as well as AlarmEventRule
|
||||||
|
|
||||||
|
* Modify existing alarm evaluator to filter out "event" alarms
|
||||||
|
|
||||||
|
* New event-driven alarm evaluator
|
||||||
|
|
||||||
|
* Make the new evaluator cache alarm definitions
|
||||||
|
|
||||||
|
Future lifecycle
|
||||||
|
================
|
||||||
|
|
||||||
|
This proposal is key feature to provide information of cloud resources to end
|
||||||
|
users in real-time that enables efficient integration with user-side manager
|
||||||
|
or Orchestrator, whereas currently those information are considered to be
|
||||||
|
consumed by admin side tool or service.
|
||||||
|
Based on this change, we will seek orchestrating scenarios including fault
|
||||||
|
recovery and add useful event definition as well as additional traits.
|
||||||
|
|
||||||
|
This feature will or can be enhanced by the followings in the future;
|
||||||
|
|
||||||
|
* Enabling this evaluator to get delta of alarm definitions in the last period,
|
||||||
|
in order to reduce traffic on updating cache. This requires alarm storage to
|
||||||
|
hold deleted alarms and alarm API amendment.
|
||||||
|
|
||||||
|
* Adopting similar coordination process as the notification agent has, for
|
||||||
|
efficiency of this evaluator e.g. reducing cache of alarm definitions.
|
||||||
|
Key for partitioning might be 'project_id', 'event_type' or 'resource_id'.
|
||||||
|
For this coordination, all topic name will have the same prefix 'alarm.',
|
||||||
|
so that evaluator can use topic='alarm.*' to listen all messages for event
|
||||||
|
alarm evaluation. This is the reason why we use 'alarm.all' in this spec.
|
||||||
|
|
||||||
|
* Mechanism to update cache of alarm definition promptly; poisoning cache on
|
||||||
|
evaluator in which assigned alarm definition has updated, etc.
|
||||||
|
|
||||||
|
* Filtering out uninterested events in notification agent by leveraging the
|
||||||
|
mechanism of graceful pipeline update and reflecting alarm definitions
|
||||||
|
configured by end users.
|
||||||
|
|
||||||
|
We can refactor function to retrieve project ID from event object after
|
||||||
|
creating common field in event object that requires DB and API changes.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
New unit/scenario tests are required for this change.
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
* Administrator Guide and Installation Guide in OpenStack Manuals should be
|
||||||
|
updated to describe new alarm type and rule as well as all deployer impacts.
|
||||||
|
|
||||||
|
* Proposed evaluator will be described in the developer document.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
* OPNFV Doctor project: https://wiki.opnfv.org/doctor
|
||||||
|
|
||||||
|
* Blueprint "Alarm type based on notification":
|
||||||
|
https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification
|
||||||
|
|
||||||
|
* Liberty Summit Note: https://etherpad.openstack.org/p/event_alarm
|
Loading…
Reference in New Issue
Block a user