Added a document with entity equivalence use cases

This document should define the functional requirements for blueprints
like merge alarms and merge resources.

Change-Id: Ie65b140607b38c880d0a10b3d63c9d58b49d8b1d
This commit is contained in:
Ifat Afek 2018-03-07 17:08:54 +00:00
parent 7fc2d5e2a0
commit 3a4344deeb
2 changed files with 500 additions and 0 deletions

View File

@ -0,0 +1,499 @@
============================
Entity Equivalence Use Cases
============================
Background
==========
There are several use cases that require support for either alarm equivalence
or resource equivalence. The design of these features is in progress, and is
not trivial. The purpose of this document is to define the basic requirements
and use cases that should be supported, regardless of the implementation that
will be selected later on.
The term "equivalence" is used to note resources or alarms that are "equal"
although they are reported by different datasources and some of their
properties might conflict. Alternative terms could be equality, merge,
overlapping, etc.
Basic Equivalence Requirements
==============================
Resource Equivalence
--------------------
We currently have two use cases for resource equivalence.
#. K8s datasource reports VMs that are also reported by Nova
#. Vitrage discovery agent (TBD) reports hosts that are also reported by Nova
Maybe both cases can be solved hard-coded by the datasources themselves. This
option should be checked against the use cases.
Alarm Equivalence
-----------------
We should support the following use cases:
#. Equivalent alarms from different monitors, e.g. Zabbix and Nagios
#. Non-equivalent alarms from different monitors, e.g. Zabbix and Nagios
(meaning the alarms are similar but not the same)
#. Equivalence between a monitored alarm and a Vitrage deduced alarm
Equivalence Definition
----------------------
In order to support these use cases, we **must** define a way for the user to
determine which entities are equivalent.
For resources we should define:
* Which properties determine the equivalence. E.g. Nova instance UUID equals
k8s vm externalID
* Optional: what property should be used in case of conflict (could it be done
arbitrarily or hard-coded?)
For alarms we should define:
* Which properties determine the equivalence. E.g. Zabbix ALARM name "HIGH CPU"
equals Prometheus alarm name "high cpu".
* Hidden assumption: equivalent alarms are always "on" the same resource.
Equivalence should be transitive. If the user defines two equivalences with a
common entity, then all entities should be equivalent to one another.
For Example:
* Zabbix high_cpu ~ Nagios HIGH_CPU
* Nagios HIGH_CPU ~ Prometheus High CPU
Vitrage will handle Zabbix, Nagios and Prometheus CPU alarms as all equivalent
to one another.
**Note**: We must support both hard-coded and user-defined equivalence
definitions.
* Hard-coded equivalence: k8s vms always map to Nova vms by the same strategy.
We can't let the user change it.
* User-defined equivalence: the end user may decide that two alarms are, or are
not, equivalent. The user should be able to change this definition at any
time. The equivalence definition should be tenant-specific (see the section
about multi tenancy).
Merge Strategy
--------------
There are different approaches for what information the user should see in case
there is a conflict between two datasources. The user should be able to define
the wanted "merge strategy" out of the following options:
#. last_update: Use the properties from the last update.
#. most_credible: Use the properties from the most credible datasource.
A 'credibility' property should be added to each datasource. By default,
most datasources will have 'medium' credibility, except from Vitrage that
will have 'low' credibility. The user will be able to change it in
vitrage.conf options.
If the equivalent datasources have the same credibility, last_update merge
strategy will be used.
#. worst_state: In case of state/severity calculation: Use the worst state of
all.
The default, which is the current behavior, will be worst_state.
Equivalence Use Cases
=====================
1. Two datasources report the same resource
-------------------------------------------
1.1. Nova reports first, then Vitrage discovery agent
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nova host datasource asks to create nova.host entity
#. Vitrage discovery agent datasource asks to create host (nova.host?) entity
Expected behavior: Vitrage API returns a single host
1.2. Vitrage discovery agent reports first
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Similar to 1.a, but the discovery agent reports first
1.3. Nova reports again on the next get_all
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. An entity in the graph already exists for the host, with properties from
both datasources
#. Nova host datasources reports the same host again
Expected behavior: There should be no change in what the API returns
1.4. Conflict in the host state
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nova host datasource asks to create nova.host entity with state ERROR
#. Vitrage discovery agent datasource asks to create host entity with state
ACTIVE
Expected behavior: Vitrage API returns a single host with a state that depends
on the merge strategy.
+----------------+------------------+
| Merge Strategy | Aggregated state |
+================+==================+
| last_update | ACTIVE |
+----------------+------------------+
| most_credible | ERROR |
+----------------+------------------+
| worst_state | ERROR |
+----------------+------------------+
1.5. Nova and K8s have different vm names
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nova instance datasource asks to create nova.instance entity named 'vm1'
#. K8s datasource asks to create instance entity named 'VM_1'
Both vms are equivalent by the Nova UUID.
Expected behavior: Vitrage API will return a single instance. Its name will
be determined by one of the datasources in a consistent way (meaning it will
be either always the K8s name or always the Nova name).
1.6. One datasource stops reporting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nova host datasource asks to create nova.host entity
#. Vitrage discovery agent datasource asks to create host (nova.host?) entity
#. ...
#. Vitrage discovery agent crashes and stops reporting
#. In the next get_all, Vitrage discovery agent reports nothing
Expected behavior:
* The host is not deleted
* The data that was provided by Nova is returned
2. Two monitors report the same alarm (e.g. Zabbix and Prometheus)
------------------------------------------------------------------
2.1. Zabbix reports CRITICAL, Nagios reports WARNING
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
Expected behavior: Vitrage API returns a single alarm with a severity that
depends on the merge strategy.
+----------------+---------------------+
| Merge Strategy | Aggregated severity |
+================+=====================+
| last_update | WARNING |
+----------------+---------------------+
| most_credible | CRITICAL |
+----------------+---------------------+
| worst_state | CRITICAL |
+----------------+---------------------+
2.2. Zabbix reports CRITICAL, Nagios reports WARNING, Zabbix reports OK
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
#. Zabbix datasource changes the severity to OK
Expected behavior: depends on the merge strategy.
+----------------+---------------------------+
| Merge Strategy | Aggregated severity |
+================+===========================+
| last_update | OK (the alarm is deleted) |
+----------------+---------------------------+
| most_credible | WARNING |
+----------------+---------------------------+
| worst_state | WARNING |
+----------------+---------------------------+
2.3. Zabbix, Nagios and Prometheus report the same alarm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume that the merge strategy is worst_state.
#. Prometheus datasource asks to create Prometheus alarm with severity WARNING
#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
3. Two monitors report similar yet different alarms
---------------------------------------------------
#. Nagios datasource asks to create a Nagios "high CPU" alarm
#. Zabbix datasource asks to create a Zabbix "extremely high CPU" alarm
Expected behavior: Vitrage API returns two alarms
4. A monitor reports the same alarm as a Vitrage deduced alarm
--------------------------------------------------------------
This use case is detailed also in https://review.openstack.org/#/c/547931/
4.1. Nagios reports first
^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL
Expected behavior: Vitrage API returns a single alarm with severity that
depends on the merge strategy.
+----------------+---------------------+
| Merge Strategy | Aggregated severity |
+================+=====================+
| last_update | CRITICAL |
+----------------+---------------------+
| most_credible | WARNING |
+----------------+---------------------+
| worst_state | CRITICAL |
+----------------+---------------------+
4.2. Nagios reports alarm, Vitrage deduced alarm, Nagios reports OK
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nagios datasource asks to create a Nagios alarm
#. Vitrage evaluator asks to create a deduced alarm with severity WARNING
#. Nagios datasource asks to delete the Nagios alarm
Expected behavior: depends on the merge strategy.
+----------------+---------------------------+
| Merge Strategy | Aggregated severity |
+================+===========================+
| last_update | OK (the alarm is deleted) |
+----------------+---------------------------+
| most_credible | OK (the alarm is deleted) |
+----------------+---------------------------+
| worst_state | WARNING |
+----------------+---------------------------+
The behavior for worst_state strategy:
* The alarm is not deleted (Vitrage still identifies a problem, let's not
ignore it)
* The alarm contains all Vitrage properties
* A diagnose action is executed, if such an action is defined
4.3. Nagios, Zabbix and Vitrage report an alarm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL
#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
Expected behavior: Vitrage API returns a single alarm with properties from
Nagios, Zabbix and Vitrage and severity that depends on the merge strategy.
+----------------+---------------------+
| Merge Strategy | Aggregated severity |
+================+=====================+
| last_update | WARNING |
+----------------+---------------------+
| most_credible | WARNING |
+----------------+---------------------+
| worst_state | CRITICAL |
+----------------+---------------------+
5. The user changes the alarm equivalence definition
----------------------------------------------------
5.1. Nagios, Zabbix and Vitrage are equivalent, then the user changes it
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume that the merge strategy is last_update.
#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING
#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
#. Vitrage API returns a single alarm with severity CRITICAL
#. The user changes the equivalence definition so Vitrage and Zabbix are
equivalent to each other but Nagios is not equivalent to them
Expected behavior: Vitrage API returns two alarms:
* Zabbix+Vitrage alarm with severity WARNING
* Nagios alarm with severity CRITICAL
**Note:** Since in Rocky we are going to implement vitrage-graph start-up from
the database, there is no real difference if the user restarts the graph after
he changes the equivalence definition or not.
5.2. Zabbix and Vitrage are equivalent, then the makes Nagios equivalent too
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume that the merge strategy is last_update.
#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING
#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
#. Vitrage API returns two alarms:
* Zabbix+Vitrage alarm with severity WARNING
* Nagios alarm with severity CRITICAL
#. The user changes the equivalence definition so Vitrage, Zabbix and Nagios
are equivalent to each other
Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
6. Template on one datasource should apply to another datasource
----------------------------------------------------------------
6.1. Simple alarm equivalence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
Template example:
::
definitions:
entities:
- entity:
category: ALARM
rawtext: high_cpu
type: zabbix
template_id: zabbix_alarm
scenarios:
- scenario:
condition: zabbix_alarm_on_host
actions:
- ...
#. Nagios datasource asks to create a Nagios HIGH_CPU alarm
#. Zabbix datasource DOES NOT ask to create a Zabbix high_cpu alarm (yet)
Expected behavior: the actions in the scenario are executed as a result of the
Nagios alarm.
6.2. Simple resource equivalence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume that Nova host is equivalent to Vitrage discovery agent host.
Template example:
::
definitions:
entities:
- entity:
category: RESOURCE
type: nova.host
template_id: nova_host
- entity:
category: RESOURCE
type: discovery_host (???)
template_id: discovery_host
scenarios:
- scenario:
condition: discovery_host and discovery_host_contains_instance
actions:
- ...
Expected behavior: the scenario will work if the host contains an instance, no
matter if the host is defined by Nova or by Vitrage discovery agent.
6.3. alarm equivalence + resource equivalence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm
**and** Nova host is equivalent to Vitrage discovery agent host.
Template example:
::
scenarios:
- scenario:
condition: discovery_host and discovery_host_contains_instance and
zabbix_alarm_on_discovery_host
actions:
- ...
Expected behavior: the scenario will work if the host contains an instance, no
matter if the host is defined by Nova or by Vitrage discovery agent; and if
either Zabbix alarm of Nagios alarm was raised on the host.
7. Template on one datasource should **not** apply to another datasource
------------------------------------------------------------------------
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
Template example:
::
definitions:
entities:
- entity:
category: ALARM
rawtext: high_cpu
type: zabbix
severity:warning
template_id: zabbix_alarm
- entity:
category: ALARM
name: HIGH_CPU
type: nagios
template_id: nagios_alarm
scenarios:
- scenario:
condition: zabbix_alarm_on_host
actions:
- ...
This use case is the same as 5.1, with one exception: the template entity
zabbix_alarm is defined only for the case that the severity is warning. What
will happen if a Nagios alarm is raised with severity warning? and what if it
is raised with a different severity?
8. Overlapping templates
------------------------
Is the overlapping templates mechanism somehow related to the equivalence use
cases?
9. Multi Tenancy
----------------
Per-tenant equivalence
^^^^^^^^^^^^^^^^^^^^^^
Entity equivalence should be defined for a specific tenant. One tenant may want
to see Nagios and Zabbix alarms as one alarm, while the other tenant may want
to see them separated.
Cross-tenant equivalence
^^^^^^^^^^^^^^^^^^^^^^^^
Is it possible that equivalent resources will be reported on different tenants?
#. Nova instance datasource asks to create nova.instance for tenant_1
#. k8s datasource asks to create instance (nova.instance?) with the same UUID
for tenant_2
What do we do in such a case?

View File

@ -93,3 +93,4 @@ Design Documents
contributor/templates-loading
contributor/vitrage-ha-and-history-vision
contributor/datasource-snmp-parsing-support
contributor/entity_equivalence_use_cases