Added a document with entity equivalence use cases
This document should define the functional requirements for blueprints like merge alarms and merge resources. Change-Id: Ie65b140607b38c880d0a10b3d63c9d58b49d8b1d
This commit is contained in:
parent
7fc2d5e2a0
commit
3a4344deeb
499
doc/source/contributor/entity_equivalence_use_cases.rst
Normal file
499
doc/source/contributor/entity_equivalence_use_cases.rst
Normal file
@ -0,0 +1,499 @@
|
||||
============================
|
||||
Entity Equivalence Use Cases
|
||||
============================
|
||||
|
||||
Background
|
||||
==========
|
||||
|
||||
There are several use cases that require support for either alarm equivalence
|
||||
or resource equivalence. The design of these features is in progress, and is
|
||||
not trivial. The purpose of this document is to define the basic requirements
|
||||
and use cases that should be supported, regardless of the implementation that
|
||||
will be selected later on.
|
||||
|
||||
The term "equivalence" is used to note resources or alarms that are "equal"
|
||||
although they are reported by different datasources and some of their
|
||||
properties might conflict. Alternative terms could be equality, merge,
|
||||
overlapping, etc.
|
||||
|
||||
|
||||
Basic Equivalence Requirements
|
||||
==============================
|
||||
|
||||
Resource Equivalence
|
||||
--------------------
|
||||
|
||||
We currently have two use cases for resource equivalence.
|
||||
|
||||
#. K8s datasource reports VMs that are also reported by Nova
|
||||
#. Vitrage discovery agent (TBD) reports hosts that are also reported by Nova
|
||||
|
||||
Maybe both cases can be solved hard-coded by the datasources themselves. This
|
||||
option should be checked against the use cases.
|
||||
|
||||
Alarm Equivalence
|
||||
-----------------
|
||||
|
||||
We should support the following use cases:
|
||||
|
||||
#. Equivalent alarms from different monitors, e.g. Zabbix and Nagios
|
||||
#. Non-equivalent alarms from different monitors, e.g. Zabbix and Nagios
|
||||
(meaning the alarms are similar but not the same)
|
||||
#. Equivalence between a monitored alarm and a Vitrage deduced alarm
|
||||
|
||||
Equivalence Definition
|
||||
----------------------
|
||||
|
||||
In order to support these use cases, we **must** define a way for the user to
|
||||
determine which entities are equivalent.
|
||||
|
||||
For resources we should define:
|
||||
|
||||
* Which properties determine the equivalence. E.g. Nova instance UUID equals
|
||||
k8s vm externalID
|
||||
* Optional: what property should be used in case of conflict (could it be done
|
||||
arbitrarily or hard-coded?)
|
||||
|
||||
For alarms we should define:
|
||||
|
||||
* Which properties determine the equivalence. E.g. Zabbix ALARM name "HIGH CPU"
|
||||
equals Prometheus alarm name "high cpu".
|
||||
* Hidden assumption: equivalent alarms are always "on" the same resource.
|
||||
|
||||
Equivalence should be transitive. If the user defines two equivalences with a
|
||||
common entity, then all entities should be equivalent to one another.
|
||||
|
||||
For Example:
|
||||
|
||||
* Zabbix high_cpu ~ Nagios HIGH_CPU
|
||||
* Nagios HIGH_CPU ~ Prometheus High CPU
|
||||
|
||||
Vitrage will handle Zabbix, Nagios and Prometheus CPU alarms as all equivalent
|
||||
to one another.
|
||||
|
||||
**Note**: We must support both hard-coded and user-defined equivalence
|
||||
definitions.
|
||||
|
||||
* Hard-coded equivalence: k8s vms always map to Nova vms by the same strategy.
|
||||
We can't let the user change it.
|
||||
* User-defined equivalence: the end user may decide that two alarms are, or are
|
||||
not, equivalent. The user should be able to change this definition at any
|
||||
time. The equivalence definition should be tenant-specific (see the section
|
||||
about multi tenancy).
|
||||
|
||||
Merge Strategy
|
||||
--------------
|
||||
|
||||
There are different approaches for what information the user should see in case
|
||||
there is a conflict between two datasources. The user should be able to define
|
||||
the wanted "merge strategy" out of the following options:
|
||||
|
||||
#. last_update: Use the properties from the last update.
|
||||
#. most_credible: Use the properties from the most credible datasource.
|
||||
A 'credibility' property should be added to each datasource. By default,
|
||||
most datasources will have 'medium' credibility, except from Vitrage that
|
||||
will have 'low' credibility. The user will be able to change it in
|
||||
vitrage.conf options.
|
||||
If the equivalent datasources have the same credibility, last_update merge
|
||||
strategy will be used.
|
||||
#. worst_state: In case of state/severity calculation: Use the worst state of
|
||||
all.
|
||||
|
||||
The default, which is the current behavior, will be worst_state.
|
||||
|
||||
Equivalence Use Cases
|
||||
=====================
|
||||
|
||||
1. Two datasources report the same resource
|
||||
-------------------------------------------
|
||||
|
||||
1.1. Nova reports first, then Vitrage discovery agent
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nova host datasource asks to create nova.host entity
|
||||
#. Vitrage discovery agent datasource asks to create host (nova.host?) entity
|
||||
|
||||
Expected behavior: Vitrage API returns a single host
|
||||
|
||||
1.2. Vitrage discovery agent reports first
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Similar to 1.a, but the discovery agent reports first
|
||||
|
||||
1.3. Nova reports again on the next get_all
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. An entity in the graph already exists for the host, with properties from
|
||||
both datasources
|
||||
#. Nova host datasources reports the same host again
|
||||
|
||||
Expected behavior: There should be no change in what the API returns
|
||||
|
||||
1.4. Conflict in the host state
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nova host datasource asks to create nova.host entity with state ERROR
|
||||
#. Vitrage discovery agent datasource asks to create host entity with state
|
||||
ACTIVE
|
||||
|
||||
Expected behavior: Vitrage API returns a single host with a state that depends
|
||||
on the merge strategy.
|
||||
|
||||
+----------------+------------------+
|
||||
| Merge Strategy | Aggregated state |
|
||||
+================+==================+
|
||||
| last_update | ACTIVE |
|
||||
+----------------+------------------+
|
||||
| most_credible | ERROR |
|
||||
+----------------+------------------+
|
||||
| worst_state | ERROR |
|
||||
+----------------+------------------+
|
||||
|
||||
1.5. Nova and K8s have different vm names
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nova instance datasource asks to create nova.instance entity named 'vm1'
|
||||
#. K8s datasource asks to create instance entity named 'VM_1'
|
||||
|
||||
Both vms are equivalent by the Nova UUID.
|
||||
|
||||
Expected behavior: Vitrage API will return a single instance. Its name will
|
||||
be determined by one of the datasources in a consistent way (meaning it will
|
||||
be either always the K8s name or always the Nova name).
|
||||
|
||||
1.6. One datasource stops reporting
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nova host datasource asks to create nova.host entity
|
||||
#. Vitrage discovery agent datasource asks to create host (nova.host?) entity
|
||||
#. ...
|
||||
#. Vitrage discovery agent crashes and stops reporting
|
||||
#. In the next get_all, Vitrage discovery agent reports nothing
|
||||
|
||||
Expected behavior:
|
||||
|
||||
* The host is not deleted
|
||||
* The data that was provided by Nova is returned
|
||||
|
||||
2. Two monitors report the same alarm (e.g. Zabbix and Prometheus)
|
||||
------------------------------------------------------------------
|
||||
|
||||
2.1. Zabbix reports CRITICAL, Nagios reports WARNING
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
|
||||
|
||||
Expected behavior: Vitrage API returns a single alarm with a severity that
|
||||
depends on the merge strategy.
|
||||
|
||||
+----------------+---------------------+
|
||||
| Merge Strategy | Aggregated severity |
|
||||
+================+=====================+
|
||||
| last_update | WARNING |
|
||||
+----------------+---------------------+
|
||||
| most_credible | CRITICAL |
|
||||
+----------------+---------------------+
|
||||
| worst_state | CRITICAL |
|
||||
+----------------+---------------------+
|
||||
|
||||
2.2. Zabbix reports CRITICAL, Nagios reports WARNING, Zabbix reports OK
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
|
||||
#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
|
||||
#. Zabbix datasource changes the severity to OK
|
||||
|
||||
|
||||
Expected behavior: depends on the merge strategy.
|
||||
|
||||
+----------------+---------------------------+
|
||||
| Merge Strategy | Aggregated severity |
|
||||
+================+===========================+
|
||||
| last_update | OK (the alarm is deleted) |
|
||||
+----------------+---------------------------+
|
||||
| most_credible | WARNING |
|
||||
+----------------+---------------------------+
|
||||
| worst_state | WARNING |
|
||||
+----------------+---------------------------+
|
||||
|
||||
2.3. Zabbix, Nagios and Prometheus report the same alarm
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Assume that the merge strategy is worst_state.
|
||||
|
||||
#. Prometheus datasource asks to create Prometheus alarm with severity WARNING
|
||||
#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
|
||||
|
||||
Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
|
||||
|
||||
3. Two monitors report similar yet different alarms
|
||||
---------------------------------------------------
|
||||
|
||||
#. Nagios datasource asks to create a Nagios "high CPU" alarm
|
||||
#. Zabbix datasource asks to create a Zabbix "extremely high CPU" alarm
|
||||
|
||||
Expected behavior: Vitrage API returns two alarms
|
||||
|
||||
4. A monitor reports the same alarm as a Vitrage deduced alarm
|
||||
--------------------------------------------------------------
|
||||
|
||||
This use case is detailed also in https://review.openstack.org/#/c/547931/
|
||||
|
||||
4.1. Nagios reports first
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
|
||||
#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL
|
||||
|
||||
Expected behavior: Vitrage API returns a single alarm with severity that
|
||||
depends on the merge strategy.
|
||||
|
||||
+----------------+---------------------+
|
||||
| Merge Strategy | Aggregated severity |
|
||||
+================+=====================+
|
||||
| last_update | CRITICAL |
|
||||
+----------------+---------------------+
|
||||
| most_credible | WARNING |
|
||||
+----------------+---------------------+
|
||||
| worst_state | CRITICAL |
|
||||
+----------------+---------------------+
|
||||
|
||||
4.2. Nagios reports alarm, Vitrage deduced alarm, Nagios reports OK
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nagios datasource asks to create a Nagios alarm
|
||||
#. Vitrage evaluator asks to create a deduced alarm with severity WARNING
|
||||
#. Nagios datasource asks to delete the Nagios alarm
|
||||
|
||||
Expected behavior: depends on the merge strategy.
|
||||
|
||||
+----------------+---------------------------+
|
||||
| Merge Strategy | Aggregated severity |
|
||||
+================+===========================+
|
||||
| last_update | OK (the alarm is deleted) |
|
||||
+----------------+---------------------------+
|
||||
| most_credible | OK (the alarm is deleted) |
|
||||
+----------------+---------------------------+
|
||||
| worst_state | WARNING |
|
||||
+----------------+---------------------------+
|
||||
|
||||
The behavior for worst_state strategy:
|
||||
|
||||
* The alarm is not deleted (Vitrage still identifies a problem, let's not
|
||||
ignore it)
|
||||
* The alarm contains all Vitrage properties
|
||||
* A diagnose action is executed, if such an action is defined
|
||||
|
||||
|
||||
4.3. Nagios, Zabbix and Vitrage report an alarm
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity WARNING
|
||||
#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL
|
||||
#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
|
||||
|
||||
Expected behavior: Vitrage API returns a single alarm with properties from
|
||||
Nagios, Zabbix and Vitrage and severity that depends on the merge strategy.
|
||||
|
||||
+----------------+---------------------+
|
||||
| Merge Strategy | Aggregated severity |
|
||||
+================+=====================+
|
||||
| last_update | WARNING |
|
||||
+----------------+---------------------+
|
||||
| most_credible | WARNING |
|
||||
+----------------+---------------------+
|
||||
| worst_state | CRITICAL |
|
||||
+----------------+---------------------+
|
||||
|
||||
5. The user changes the alarm equivalence definition
|
||||
----------------------------------------------------
|
||||
|
||||
5.1. Nagios, Zabbix and Vitrage are equivalent, then the user changes it
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Assume that the merge strategy is last_update.
|
||||
|
||||
#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING
|
||||
#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
|
||||
#. Vitrage API returns a single alarm with severity CRITICAL
|
||||
#. The user changes the equivalence definition so Vitrage and Zabbix are
|
||||
equivalent to each other but Nagios is not equivalent to them
|
||||
|
||||
Expected behavior: Vitrage API returns two alarms:
|
||||
|
||||
* Zabbix+Vitrage alarm with severity WARNING
|
||||
* Nagios alarm with severity CRITICAL
|
||||
|
||||
**Note:** Since in Rocky we are going to implement vitrage-graph start-up from
|
||||
the database, there is no real difference if the user restarts the graph after
|
||||
he changes the equivalence definition or not.
|
||||
|
||||
5.2. Zabbix and Vitrage are equivalent, then the makes Nagios equivalent too
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Assume that the merge strategy is last_update.
|
||||
|
||||
#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING
|
||||
#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
|
||||
#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
|
||||
#. Vitrage API returns two alarms:
|
||||
|
||||
* Zabbix+Vitrage alarm with severity WARNING
|
||||
* Nagios alarm with severity CRITICAL
|
||||
#. The user changes the equivalence definition so Vitrage, Zabbix and Nagios
|
||||
are equivalent to each other
|
||||
|
||||
Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
|
||||
|
||||
6. Template on one datasource should apply to another datasource
|
||||
----------------------------------------------------------------
|
||||
|
||||
6.1. Simple alarm equivalence
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
|
||||
|
||||
|
||||
Template example:
|
||||
|
||||
::
|
||||
|
||||
definitions:
|
||||
entities:
|
||||
- entity:
|
||||
category: ALARM
|
||||
rawtext: high_cpu
|
||||
type: zabbix
|
||||
template_id: zabbix_alarm
|
||||
|
||||
scenarios:
|
||||
- scenario:
|
||||
condition: zabbix_alarm_on_host
|
||||
actions:
|
||||
- ...
|
||||
|
||||
|
||||
|
||||
#. Nagios datasource asks to create a Nagios HIGH_CPU alarm
|
||||
#. Zabbix datasource DOES NOT ask to create a Zabbix high_cpu alarm (yet)
|
||||
|
||||
Expected behavior: the actions in the scenario are executed as a result of the
|
||||
Nagios alarm.
|
||||
|
||||
|
||||
6.2. Simple resource equivalence
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assume that Nova host is equivalent to Vitrage discovery agent host.
|
||||
|
||||
|
||||
Template example:
|
||||
|
||||
::
|
||||
|
||||
definitions:
|
||||
entities:
|
||||
- entity:
|
||||
category: RESOURCE
|
||||
type: nova.host
|
||||
template_id: nova_host
|
||||
- entity:
|
||||
category: RESOURCE
|
||||
type: discovery_host (???)
|
||||
template_id: discovery_host
|
||||
|
||||
scenarios:
|
||||
- scenario:
|
||||
condition: discovery_host and discovery_host_contains_instance
|
||||
actions:
|
||||
- ...
|
||||
|
||||
|
||||
Expected behavior: the scenario will work if the host contains an instance, no
|
||||
matter if the host is defined by Nova or by Vitrage discovery agent.
|
||||
|
||||
|
||||
6.3. alarm equivalence + resource equivalence
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm
|
||||
**and** Nova host is equivalent to Vitrage discovery agent host.
|
||||
|
||||
|
||||
Template example:
|
||||
|
||||
::
|
||||
|
||||
scenarios:
|
||||
- scenario:
|
||||
condition: discovery_host and discovery_host_contains_instance and
|
||||
zabbix_alarm_on_discovery_host
|
||||
actions:
|
||||
- ...
|
||||
|
||||
|
||||
Expected behavior: the scenario will work if the host contains an instance, no
|
||||
matter if the host is defined by Nova or by Vitrage discovery agent; and if
|
||||
either Zabbix alarm of Nagios alarm was raised on the host.
|
||||
|
||||
|
||||
7. Template on one datasource should **not** apply to another datasource
|
||||
------------------------------------------------------------------------
|
||||
|
||||
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
|
||||
|
||||
Template example:
|
||||
|
||||
::
|
||||
|
||||
definitions:
|
||||
entities:
|
||||
- entity:
|
||||
category: ALARM
|
||||
rawtext: high_cpu
|
||||
type: zabbix
|
||||
severity:warning
|
||||
template_id: zabbix_alarm
|
||||
- entity:
|
||||
category: ALARM
|
||||
name: HIGH_CPU
|
||||
type: nagios
|
||||
template_id: nagios_alarm
|
||||
|
||||
scenarios:
|
||||
- scenario:
|
||||
condition: zabbix_alarm_on_host
|
||||
actions:
|
||||
- ...
|
||||
|
||||
This use case is the same as 5.1, with one exception: the template entity
|
||||
zabbix_alarm is defined only for the case that the severity is warning. What
|
||||
will happen if a Nagios alarm is raised with severity warning? and what if it
|
||||
is raised with a different severity?
|
||||
|
||||
8. Overlapping templates
|
||||
------------------------
|
||||
|
||||
Is the overlapping templates mechanism somehow related to the equivalence use
|
||||
cases?
|
||||
|
||||
9. Multi Tenancy
|
||||
----------------
|
||||
|
||||
Per-tenant equivalence
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Entity equivalence should be defined for a specific tenant. One tenant may want
|
||||
to see Nagios and Zabbix alarms as one alarm, while the other tenant may want
|
||||
to see them separated.
|
||||
|
||||
Cross-tenant equivalence
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Is it possible that equivalent resources will be reported on different tenants?
|
||||
|
||||
#. Nova instance datasource asks to create nova.instance for tenant_1
|
||||
#. k8s datasource asks to create instance (nova.instance?) with the same UUID
|
||||
for tenant_2
|
||||
|
||||
What do we do in such a case?
|
@ -93,3 +93,4 @@ Design Documents
|
||||
contributor/templates-loading
|
||||
contributor/vitrage-ha-and-history-vision
|
||||
contributor/datasource-snmp-parsing-support
|
||||
contributor/entity_equivalence_use_cases
|
||||
|
Loading…
Reference in New Issue
Block a user