From 3a4344deeb3f111b4ea9cae7e404d72a21adbc72 Mon Sep 17 00:00:00 2001 From: Ifat Afek Date: Wed, 7 Mar 2018 17:08:54 +0000 Subject: [PATCH] Added a document with entity equivalence use cases This document should define the functional requirements for blueprints like merge alarms and merge resources. Change-Id: Ie65b140607b38c880d0a10b3d63c9d58b49d8b1d --- .../entity_equivalence_use_cases.rst | 499 ++++++++++++++++++ doc/source/index.rst | 1 + 2 files changed, 500 insertions(+) create mode 100644 doc/source/contributor/entity_equivalence_use_cases.rst diff --git a/doc/source/contributor/entity_equivalence_use_cases.rst b/doc/source/contributor/entity_equivalence_use_cases.rst new file mode 100644 index 000000000..0041bcb1a --- /dev/null +++ b/doc/source/contributor/entity_equivalence_use_cases.rst @@ -0,0 +1,499 @@ +============================ +Entity Equivalence Use Cases +============================ + +Background +========== + +There are several use cases that require support for either alarm equivalence +or resource equivalence. The design of these features is in progress, and is +not trivial. The purpose of this document is to define the basic requirements +and use cases that should be supported, regardless of the implementation that +will be selected later on. + +The term "equivalence" is used to note resources or alarms that are "equal" +although they are reported by different datasources and some of their +properties might conflict. Alternative terms could be equality, merge, +overlapping, etc. + + +Basic Equivalence Requirements +============================== + +Resource Equivalence +-------------------- + +We currently have two use cases for resource equivalence. + +#. K8s datasource reports VMs that are also reported by Nova +#. Vitrage discovery agent (TBD) reports hosts that are also reported by Nova + +Maybe both cases can be solved hard-coded by the datasources themselves. This +option should be checked against the use cases. + +Alarm Equivalence +----------------- + +We should support the following use cases: + +#. Equivalent alarms from different monitors, e.g. Zabbix and Nagios +#. Non-equivalent alarms from different monitors, e.g. Zabbix and Nagios + (meaning the alarms are similar but not the same) +#. Equivalence between a monitored alarm and a Vitrage deduced alarm + +Equivalence Definition +---------------------- + +In order to support these use cases, we **must** define a way for the user to +determine which entities are equivalent. + +For resources we should define: + +* Which properties determine the equivalence. E.g. Nova instance UUID equals + k8s vm externalID +* Optional: what property should be used in case of conflict (could it be done + arbitrarily or hard-coded?) + +For alarms we should define: + +* Which properties determine the equivalence. E.g. Zabbix ALARM name "HIGH CPU" + equals Prometheus alarm name "high cpu". +* Hidden assumption: equivalent alarms are always "on" the same resource. + +Equivalence should be transitive. If the user defines two equivalences with a +common entity, then all entities should be equivalent to one another. + +For Example: + +* Zabbix high_cpu ~ Nagios HIGH_CPU +* Nagios HIGH_CPU ~ Prometheus High CPU + +Vitrage will handle Zabbix, Nagios and Prometheus CPU alarms as all equivalent +to one another. + +**Note**: We must support both hard-coded and user-defined equivalence +definitions. + +* Hard-coded equivalence: k8s vms always map to Nova vms by the same strategy. + We can't let the user change it. +* User-defined equivalence: the end user may decide that two alarms are, or are + not, equivalent. The user should be able to change this definition at any + time. The equivalence definition should be tenant-specific (see the section + about multi tenancy). + +Merge Strategy +-------------- + +There are different approaches for what information the user should see in case +there is a conflict between two datasources. The user should be able to define +the wanted "merge strategy" out of the following options: + +#. last_update: Use the properties from the last update. +#. most_credible: Use the properties from the most credible datasource. + A 'credibility' property should be added to each datasource. By default, + most datasources will have 'medium' credibility, except from Vitrage that + will have 'low' credibility. The user will be able to change it in + vitrage.conf options. + If the equivalent datasources have the same credibility, last_update merge + strategy will be used. +#. worst_state: In case of state/severity calculation: Use the worst state of + all. + +The default, which is the current behavior, will be worst_state. + +Equivalence Use Cases +===================== + +1. Two datasources report the same resource +------------------------------------------- + +1.1. Nova reports first, then Vitrage discovery agent +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nova host datasource asks to create nova.host entity +#. Vitrage discovery agent datasource asks to create host (nova.host?) entity + +Expected behavior: Vitrage API returns a single host + +1.2. Vitrage discovery agent reports first +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to 1.a, but the discovery agent reports first + +1.3. Nova reports again on the next get_all +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. An entity in the graph already exists for the host, with properties from + both datasources +#. Nova host datasources reports the same host again + +Expected behavior: There should be no change in what the API returns + +1.4. Conflict in the host state +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nova host datasource asks to create nova.host entity with state ERROR +#. Vitrage discovery agent datasource asks to create host entity with state + ACTIVE + +Expected behavior: Vitrage API returns a single host with a state that depends +on the merge strategy. + ++----------------+------------------+ +| Merge Strategy | Aggregated state | ++================+==================+ +| last_update | ACTIVE | ++----------------+------------------+ +| most_credible | ERROR | ++----------------+------------------+ +| worst_state | ERROR | ++----------------+------------------+ + +1.5. Nova and K8s have different vm names +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nova instance datasource asks to create nova.instance entity named 'vm1' +#. K8s datasource asks to create instance entity named 'VM_1' + +Both vms are equivalent by the Nova UUID. + +Expected behavior: Vitrage API will return a single instance. Its name will +be determined by one of the datasources in a consistent way (meaning it will +be either always the K8s name or always the Nova name). + +1.6. One datasource stops reporting +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nova host datasource asks to create nova.host entity +#. Vitrage discovery agent datasource asks to create host (nova.host?) entity +#. ... +#. Vitrage discovery agent crashes and stops reporting +#. In the next get_all, Vitrage discovery agent reports nothing + +Expected behavior: + +* The host is not deleted +* The data that was provided by Nova is returned + +2. Two monitors report the same alarm (e.g. Zabbix and Prometheus) +------------------------------------------------------------------ + +2.1. Zabbix reports CRITICAL, Nagios reports WARNING +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL +#. Nagios datasource asks to create a Nagios alarm with severity WARNING + +Expected behavior: Vitrage API returns a single alarm with a severity that +depends on the merge strategy. + ++----------------+---------------------+ +| Merge Strategy | Aggregated severity | ++================+=====================+ +| last_update | WARNING | ++----------------+---------------------+ +| most_credible | CRITICAL | ++----------------+---------------------+ +| worst_state | CRITICAL | ++----------------+---------------------+ + +2.2. Zabbix reports CRITICAL, Nagios reports WARNING, Zabbix reports OK +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nagios datasource asks to create a Nagios alarm with severity WARNING +#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL +#. Zabbix datasource changes the severity to OK + + +Expected behavior: depends on the merge strategy. + ++----------------+---------------------------+ +| Merge Strategy | Aggregated severity | ++================+===========================+ +| last_update | OK (the alarm is deleted) | ++----------------+---------------------------+ +| most_credible | WARNING | ++----------------+---------------------------+ +| worst_state | WARNING | ++----------------+---------------------------+ + +2.3. Zabbix, Nagios and Prometheus report the same alarm +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Assume that the merge strategy is worst_state. + +#. Prometheus datasource asks to create Prometheus alarm with severity WARNING +#. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL +#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL + +Expected behavior: Vitrage API returns a single alarm with severity CRITICAL + +3. Two monitors report similar yet different alarms +--------------------------------------------------- + +#. Nagios datasource asks to create a Nagios "high CPU" alarm +#. Zabbix datasource asks to create a Zabbix "extremely high CPU" alarm + +Expected behavior: Vitrage API returns two alarms + +4. A monitor reports the same alarm as a Vitrage deduced alarm +-------------------------------------------------------------- + +This use case is detailed also in https://review.openstack.org/#/c/547931/ + +4.1. Nagios reports first +^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nagios datasource asks to create a Nagios alarm with severity WARNING +#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL + +Expected behavior: Vitrage API returns a single alarm with severity that +depends on the merge strategy. + ++----------------+---------------------+ +| Merge Strategy | Aggregated severity | ++================+=====================+ +| last_update | CRITICAL | ++----------------+---------------------+ +| most_credible | WARNING | ++----------------+---------------------+ +| worst_state | CRITICAL | ++----------------+---------------------+ + +4.2. Nagios reports alarm, Vitrage deduced alarm, Nagios reports OK +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nagios datasource asks to create a Nagios alarm +#. Vitrage evaluator asks to create a deduced alarm with severity WARNING +#. Nagios datasource asks to delete the Nagios alarm + +Expected behavior: depends on the merge strategy. + ++----------------+---------------------------+ +| Merge Strategy | Aggregated severity | ++================+===========================+ +| last_update | OK (the alarm is deleted) | ++----------------+---------------------------+ +| most_credible | OK (the alarm is deleted) | ++----------------+---------------------------+ +| worst_state | WARNING | ++----------------+---------------------------+ + +The behavior for worst_state strategy: + +* The alarm is not deleted (Vitrage still identifies a problem, let's not + ignore it) +* The alarm contains all Vitrage properties +* A diagnose action is executed, if such an action is defined + + +4.3. Nagios, Zabbix and Vitrage report an alarm +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Nagios datasource asks to create a Nagios alarm with severity WARNING +#. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL +#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING + +Expected behavior: Vitrage API returns a single alarm with properties from +Nagios, Zabbix and Vitrage and severity that depends on the merge strategy. + ++----------------+---------------------+ +| Merge Strategy | Aggregated severity | ++================+=====================+ +| last_update | WARNING | ++----------------+---------------------+ +| most_credible | WARNING | ++----------------+---------------------+ +| worst_state | CRITICAL | ++----------------+---------------------+ + +5. The user changes the alarm equivalence definition +---------------------------------------------------- + +5.1. Nagios, Zabbix and Vitrage are equivalent, then the user changes it +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Assume that the merge strategy is last_update. + +#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING +#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING +#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL +#. Vitrage API returns a single alarm with severity CRITICAL +#. The user changes the equivalence definition so Vitrage and Zabbix are + equivalent to each other but Nagios is not equivalent to them + +Expected behavior: Vitrage API returns two alarms: + +* Zabbix+Vitrage alarm with severity WARNING +* Nagios alarm with severity CRITICAL + +**Note:** Since in Rocky we are going to implement vitrage-graph start-up from +the database, there is no real difference if the user restarts the graph after +he changes the equivalence definition or not. + +5.2. Zabbix and Vitrage are equivalent, then the makes Nagios equivalent too +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Assume that the merge strategy is last_update. + +#. Vitrage datasource asks to create a Zabbix alarm with severity WARNING +#. Zabbix datasource asks to create a Zabbix alarm with severity WARNING +#. Nagios datasource asks to create a Nagios alarm with severity CRITICAL +#. Vitrage API returns two alarms: + + * Zabbix+Vitrage alarm with severity WARNING + * Nagios alarm with severity CRITICAL +#. The user changes the equivalence definition so Vitrage, Zabbix and Nagios + are equivalent to each other + +Expected behavior: Vitrage API returns a single alarm with severity CRITICAL + +6. Template on one datasource should apply to another datasource +---------------------------------------------------------------- + +6.1. Simple alarm equivalence +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm. + + +Template example: + + :: + + definitions: + entities: + - entity: + category: ALARM + rawtext: high_cpu + type: zabbix + template_id: zabbix_alarm + + scenarios: + - scenario: + condition: zabbix_alarm_on_host + actions: + - ... + + + +#. Nagios datasource asks to create a Nagios HIGH_CPU alarm +#. Zabbix datasource DOES NOT ask to create a Zabbix high_cpu alarm (yet) + +Expected behavior: the actions in the scenario are executed as a result of the +Nagios alarm. + + +6.2. Simple resource equivalence +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Assume that Nova host is equivalent to Vitrage discovery agent host. + + +Template example: + + :: + + definitions: + entities: + - entity: + category: RESOURCE + type: nova.host + template_id: nova_host + - entity: + category: RESOURCE + type: discovery_host (???) + template_id: discovery_host + + scenarios: + - scenario: + condition: discovery_host and discovery_host_contains_instance + actions: + - ... + + +Expected behavior: the scenario will work if the host contains an instance, no +matter if the host is defined by Nova or by Vitrage discovery agent. + + +6.3. alarm equivalence + resource equivalence +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm +**and** Nova host is equivalent to Vitrage discovery agent host. + + +Template example: + + :: + + scenarios: + - scenario: + condition: discovery_host and discovery_host_contains_instance and + zabbix_alarm_on_discovery_host + actions: + - ... + + +Expected behavior: the scenario will work if the host contains an instance, no +matter if the host is defined by Nova or by Vitrage discovery agent; and if +either Zabbix alarm of Nagios alarm was raised on the host. + + +7. Template on one datasource should **not** apply to another datasource +------------------------------------------------------------------------ + +Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm. + +Template example: + + :: + + definitions: + entities: + - entity: + category: ALARM + rawtext: high_cpu + type: zabbix + severity:warning + template_id: zabbix_alarm + - entity: + category: ALARM + name: HIGH_CPU + type: nagios + template_id: nagios_alarm + + scenarios: + - scenario: + condition: zabbix_alarm_on_host + actions: + - ... + +This use case is the same as 5.1, with one exception: the template entity +zabbix_alarm is defined only for the case that the severity is warning. What +will happen if a Nagios alarm is raised with severity warning? and what if it +is raised with a different severity? + +8. Overlapping templates +------------------------ + +Is the overlapping templates mechanism somehow related to the equivalence use +cases? + +9. Multi Tenancy +---------------- + +Per-tenant equivalence +^^^^^^^^^^^^^^^^^^^^^^ + +Entity equivalence should be defined for a specific tenant. One tenant may want +to see Nagios and Zabbix alarms as one alarm, while the other tenant may want +to see them separated. + +Cross-tenant equivalence +^^^^^^^^^^^^^^^^^^^^^^^^ + +Is it possible that equivalent resources will be reported on different tenants? + +#. Nova instance datasource asks to create nova.instance for tenant_1 +#. k8s datasource asks to create instance (nova.instance?) with the same UUID + for tenant_2 + +What do we do in such a case? diff --git a/doc/source/index.rst b/doc/source/index.rst index 4fbcbc4e8..8275caf9f 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -93,3 +93,4 @@ Design Documents contributor/templates-loading contributor/vitrage-ha-and-history-vision contributor/datasource-snmp-parsing-support + contributor/entity_equivalence_use_cases