Added a document with entity equivalence use cases

This document should define the functional requirements for blueprints like merge alarms and merge resources. Change-Id: Ie65b140607b38c880d0a10b3d63c9d58b49d8b1d
2018-03-07 17:08:54 +00:00 · 2018-03-07 17:08:54 +00:00 · 3a4344deeb
commit 3a4344deeb
parent 7fc2d5e2a0
2 changed files with 500 additions and 0 deletions
--- a/doc/source/contributor/entity_equivalence_use_cases.rst
+++ b/doc/source/contributor/entity_equivalence_use_cases.rst
@ -0,0 +1,499 @@
+============================
+Entity Equivalence Use Cases
+============================
+
+Background
+==========
+
+There are several use cases that require support for either alarm equivalence
+or resource equivalence. The design of these features is in progress, and is
+not trivial. The purpose of this document is to define the basic requirements
+and use cases that should be supported, regardless of the implementation that
+will be selected later on.
+
+The term "equivalence" is used to note resources or alarms that are "equal"
+although they are reported by different datasources and some of their
+properties might conflict. Alternative terms could be equality, merge,
+overlapping, etc.
+
+
+Basic Equivalence Requirements
+==============================
+
+Resource Equivalence
+--------------------
+
+We currently have two use cases for resource equivalence.
+
+#. K8s datasource reports VMs that are also reported by Nova
+#. Vitrage discovery agent (TBD) reports hosts that are also reported by Nova
+
+Maybe both cases can be solved hard-coded by the datasources themselves. This
+option should be checked against the use cases.
+
+Alarm Equivalence
+-----------------
+
+We should support the following use cases:
+
+#. Equivalent alarms from different monitors, e.g. Zabbix and Nagios
+#. Non-equivalent alarms from different monitors, e.g. Zabbix and Nagios
+   (meaning the alarms are similar but not the same)
+#. Equivalence between a monitored alarm and a Vitrage deduced alarm
+
+Equivalence Definition
+----------------------
+
+In order to support these use cases, we **must** define a way for the user to
+determine which entities are equivalent.
+
+For resources we should define:
+
+* Which properties determine the equivalence. E.g. Nova instance UUID equals
+  k8s vm externalID
+* Optional: what property should be used in case of conflict (could it be done
+  arbitrarily or hard-coded?)
+
+For alarms we should define:
+
+* Which properties determine the equivalence. E.g. Zabbix ALARM name "HIGH CPU"
+  equals Prometheus alarm name "high cpu".
+* Hidden assumption: equivalent alarms are always "on" the same resource.
+
+Equivalence should be transitive. If the user defines two equivalences with a
+common entity, then all entities should be equivalent to one another.
+
+For Example:
+
+* Zabbix high_cpu ~ Nagios HIGH_CPU
+* Nagios HIGH_CPU ~ Prometheus High CPU
+
+Vitrage will handle Zabbix, Nagios and Prometheus CPU alarms as all equivalent
+to one another.
+
+**Note**: We must support both hard-coded and user-defined equivalence
+definitions.
+
+* Hard-coded equivalence: k8s vms always map to Nova vms by the same strategy.
+  We can't let the user change it.
+* User-defined equivalence: the end user may decide that two alarms are, or are
+  not, equivalent. The user should be able to change this definition at any
+  time. The equivalence definition should be tenant-specific (see the section
+  about multi tenancy).
+
+Merge Strategy
+--------------
+
+There are different approaches for what information the user should see in case
+there is a conflict between two datasources. The user should be able to define
+the wanted "merge strategy" out of the following options:
+
+#. last_update: Use the properties from the last update.
+#. most_credible: Use the properties from the most credible datasource.
+   A 'credibility' property should be added to each datasource. By default,
+   most datasources will have 'medium' credibility, except from Vitrage that
+   will have 'low' credibility. The user will be able to change it in
+   vitrage.conf options.
+   If the equivalent datasources have the same credibility, last_update merge
+   strategy will be used.
+#. worst_state: In case of state/severity calculation: Use the worst state of
+   all.
+
+The default, which is the current behavior, will be worst_state.
+
+Equivalence Use Cases
+=====================
+
+1. Two datasources report the same resource
+-------------------------------------------
+
+1.1. Nova reports first, then Vitrage discovery agent
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nova host datasource asks to create nova.host entity
+#. Vitrage discovery agent datasource asks to create host (nova.host?) entity
+
+Expected behavior: Vitrage API returns a single host
+
+1.2. Vitrage discovery agent reports first
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Similar to 1.a, but the discovery agent reports first
+
+1.3. Nova reports again on the next get_all
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. An entity in the graph already exists for the host, with properties from
+   both datasources
+#. Nova host datasources reports the same host again
+
+Expected behavior: There should be no change in what the API returns
+
+1.4. Conflict in the host state
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nova host datasource asks to create nova.host entity with state ERROR
+#. Vitrage discovery agent datasource asks to create host entity with state
+   ACTIVE
+
+Expected behavior: Vitrage API returns a single host with a state that depends
+on the merge strategy.
+
+----------------+------------------+
+| Merge Strategy | Aggregated state |
+================+==================+
+| last_update    | ACTIVE           |
+----------------+------------------+
+| most_credible  | ERROR            |
+----------------+------------------+
+| worst_state    | ERROR            |
+----------------+------------------+
+
+1.5. Nova and K8s have different vm names
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nova instance datasource asks to create nova.instance entity named 'vm1'
+#. K8s datasource asks to create instance entity named 'VM_1'
+
+Both vms are equivalent by the Nova UUID.
+
+Expected behavior: Vitrage API will return a single instance. Its name will
+be determined by one of the datasources in a consistent way (meaning it will
+be either always the K8s name or always the Nova name).
+
+1.6. One datasource stops reporting
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nova host datasource asks to create nova.host entity
+#. Vitrage discovery agent datasource asks to create host (nova.host?) entity
+#. ...
+#. Vitrage discovery agent crashes and stops reporting
+#. In the next get_all, Vitrage discovery agent reports nothing
+
+Expected behavior:
+
+* The host is not deleted
+* The data that was provided by Nova is returned
+
+2. Two monitors report the same alarm (e.g. Zabbix and Prometheus)
+------------------------------------------------------------------
+
+2.1. Zabbix reports CRITICAL, Nagios reports WARNING
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
+#. Nagios datasource asks to create a Nagios alarm with severity WARNING
+
+Expected behavior: Vitrage API returns a single alarm with a severity that
+depends on the merge strategy.
+
+----------------+---------------------+
+| Merge Strategy | Aggregated severity |
+================+=====================+
+| last_update    | WARNING             |
+----------------+---------------------+
+| most_credible  | CRITICAL            |
+----------------+---------------------+
+| worst_state    | CRITICAL            |
+----------------+---------------------+
+
+2.2. Zabbix reports CRITICAL, Nagios reports WARNING, Zabbix reports OK
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nagios datasource asks to create a Nagios alarm with severity WARNING
+#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
+#. Zabbix datasource changes the severity to OK
+
+
+Expected behavior: depends on the merge strategy.
+
+----------------+---------------------------+
+| Merge Strategy | Aggregated severity       |
+================+===========================+
+| last_update    | OK (the alarm is deleted) |
+----------------+---------------------------+
+| most_credible  | WARNING                   |
+----------------+---------------------------+
+| worst_state    | WARNING                   |
+----------------+---------------------------+
+
+2.3. Zabbix, Nagios and Prometheus report the same alarm
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Assume that the merge strategy is worst_state.
+
+#. Prometheus datasource asks to create Prometheus alarm with severity WARNING
+#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL
+#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
+
+Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
+
+3. Two monitors report similar yet different alarms
+---------------------------------------------------
+
+#. Nagios datasource asks to create a Nagios "high CPU" alarm
+#. Zabbix datasource asks to create a Zabbix "extremely high CPU" alarm
+
+Expected behavior: Vitrage API returns two alarms
+
+4. A monitor reports the same alarm as a Vitrage deduced alarm
+--------------------------------------------------------------
+
+This use case is detailed also in https://review.openstack.org/#/c/547931/
+
+4.1. Nagios reports first
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nagios datasource asks to create a Nagios alarm with severity WARNING
+#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL
+
+Expected behavior: Vitrage API returns a single alarm with severity that
+depends on the merge strategy.
+
+----------------+---------------------+
+| Merge Strategy | Aggregated severity |
+================+=====================+
+| last_update    | CRITICAL            |
+----------------+---------------------+
+| most_credible  | WARNING             |
+----------------+---------------------+
+| worst_state    | CRITICAL            |
+----------------+---------------------+
+
+4.2. Nagios reports alarm, Vitrage deduced alarm, Nagios reports OK
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nagios datasource asks to create a Nagios alarm
+#. Vitrage evaluator asks to create a deduced alarm with severity WARNING
+#. Nagios datasource asks to delete the Nagios alarm
+
+Expected behavior: depends on the merge strategy.
+
+----------------+---------------------------+
+| Merge Strategy | Aggregated severity       |
+================+===========================+
+| last_update    | OK (the alarm is deleted) |
+----------------+---------------------------+
+| most_credible  | OK (the alarm is deleted) |
+----------------+---------------------------+
+| worst_state    | WARNING                   |
+----------------+---------------------------+
+
+The behavior for worst_state strategy:
+
+* The alarm is not deleted (Vitrage still identifies a problem, let's not
+  ignore it)
+* The alarm contains all Vitrage properties
+* A diagnose action is executed, if such an action is defined
+
+
+4.3. Nagios, Zabbix and Vitrage report an alarm
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Nagios datasource asks to create a Nagios alarm with severity WARNING
+#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL
+#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
+
+Expected behavior: Vitrage API returns a single alarm with properties from
+Nagios, Zabbix and Vitrage and severity that depends on the merge strategy.
+
+----------------+---------------------+
+| Merge Strategy | Aggregated severity |
+================+=====================+
+| last_update    | WARNING             |
+----------------+---------------------+
+| most_credible  | WARNING             |
+----------------+---------------------+
+| worst_state    | CRITICAL            |
+----------------+---------------------+
+
+5. The user changes the alarm equivalence definition
+----------------------------------------------------
+
+5.1. Nagios, Zabbix and Vitrage are equivalent, then the user changes it
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Assume that the merge strategy is last_update.
+
+#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING
+#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
+#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
+#. Vitrage API returns a single alarm with severity CRITICAL
+#. The user changes the equivalence definition so Vitrage and Zabbix are
+   equivalent to each other but Nagios is not equivalent to them
+
+Expected behavior: Vitrage API returns two alarms:
+
+* Zabbix+Vitrage alarm with severity WARNING
+* Nagios alarm with severity CRITICAL
+
+**Note:** Since in Rocky we are going to implement vitrage-graph start-up from
+the database, there is no real difference if the user restarts the graph after
+he changes the equivalence definition or not.
+
+5.2. Zabbix and Vitrage are equivalent, then the makes Nagios equivalent too
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Assume that the merge strategy is last_update.
+
+#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING
+#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING
+#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL
+#. Vitrage API returns two alarms:
+
+   * Zabbix+Vitrage alarm with severity WARNING
+   * Nagios alarm with severity CRITICAL
+#. The user changes the equivalence definition so Vitrage, Zabbix and Nagios
+   are equivalent to each other
+
+Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
+
+6. Template on one datasource should apply to another datasource
+----------------------------------------------------------------
+
+6.1. Simple alarm equivalence
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
+
+
+Template example:
+
+ ::
+
+  definitions:
+   entities:
+    - entity:
+       category: ALARM
+       rawtext: high_cpu
+       type: zabbix
+       template_id: zabbix_alarm
+
+  scenarios:
+   - scenario:
+      condition: zabbix_alarm_on_host
+      actions:
+       - ...
+
+
+
+#. Nagios datasource asks to create a Nagios HIGH_CPU alarm
+#. Zabbix datasource DOES NOT ask to create a Zabbix high_cpu alarm (yet)
+
+Expected behavior: the actions in the scenario are executed as a result of the
+Nagios alarm.
+
+
+6.2. Simple resource equivalence
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Assume that Nova host is equivalent to Vitrage discovery agent host.
+
+
+Template example:
+
+ ::
+
+  definitions:
+   entities:
+    - entity:
+       category: RESOURCE
+       type: nova.host
+       template_id: nova_host
+    - entity:
+       category: RESOURCE
+       type: discovery_host (???)
+       template_id: discovery_host
+
+  scenarios:
+   - scenario:
+      condition: discovery_host and discovery_host_contains_instance
+      actions:
+       - ...
+
+
+Expected behavior: the scenario will work if the host contains an instance, no
+matter if the host is defined by Nova or by Vitrage discovery agent.
+
+
+6.3. alarm equivalence + resource equivalence
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm
+**and** Nova host is equivalent to Vitrage discovery agent host.
+
+
+Template example:
+
+ ::
+
+  scenarios:
+   - scenario:
+      condition: discovery_host and discovery_host_contains_instance and
+                 zabbix_alarm_on_discovery_host
+      actions:
+       - ...
+
+
+Expected behavior: the scenario will work if the host contains an instance, no
+matter if the host is defined by Nova or by Vitrage discovery agent; and if
+either Zabbix alarm of Nagios alarm was raised on the host.
+
+
+7. Template on one datasource should **not** apply to another datasource
+------------------------------------------------------------------------
+
+Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
+
+Template example:
+
+ ::
+
+  definitions:
+   entities:
+    - entity:
+       category: ALARM
+       rawtext: high_cpu
+       type: zabbix
+       severity:warning
+       template_id: zabbix_alarm
+    - entity:
+       category: ALARM
+       name: HIGH_CPU
+       type: nagios
+       template_id: nagios_alarm
+
+  scenarios:
+   - scenario:
+      condition: zabbix_alarm_on_host
+      actions:
+       - ...
+
+This use case is the same as 5.1, with one exception: the template entity
+zabbix_alarm is defined only for the case that the severity is warning. What
+will happen if a Nagios alarm is raised with severity warning? and what if it
+is raised with a different severity?
+
+8. Overlapping templates
+------------------------
+
+Is the overlapping templates mechanism somehow related to the equivalence use
+cases?
+
+9. Multi Tenancy
+----------------
+
+Per-tenant equivalence
+^^^^^^^^^^^^^^^^^^^^^^
+
+Entity equivalence should be defined for a specific tenant. One tenant may want
+to see Nagios and Zabbix alarms as one alarm, while the other tenant may want
+to see them separated.
+
+Cross-tenant equivalence
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Is it possible that equivalent resources will be reported on different tenants?
+
+#. Nova instance datasource asks to create nova.instance for tenant_1
+#. k8s datasource asks to create instance (nova.instance?) with the same UUID
+   for tenant_2
+
+What do we do in such a case?
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -93,3 +93,4 @@ Design Documents
   contributor/templates-loading
   contributor/vitrage-ha-and-history-vision
   contributor/datasource-snmp-parsing-support
+   contributor/entity_equivalence_use_cases