From 6161217a882a4706992e502adeaf780d1e472e25 Mon Sep 17 00:00:00 2001 From: Ifat Afek Date: Wed, 17 Oct 2018 12:52:07 +0000 Subject: [PATCH] Adjust the Vitrage & Mistral use case to the new template format Change-Id: Iae1bb26e3c6061f63ef4fac58e354c56cb32e91b --- doc/source/use-cases.rst | 2 +- .../nic-failure-affects-instance-and-app.rst | 101 +++++++++++++++ use-cases/vitrage-mistral-integration.rst | 119 ------------------ 3 files changed, 102 insertions(+), 120 deletions(-) create mode 100644 use-cases/nic-failure-affects-instance-and-app.rst delete mode 100644 use-cases/vitrage-mistral-integration.rst diff --git a/doc/source/use-cases.rst b/doc/source/use-cases.rst index 82daf4f..76e6814 100644 --- a/doc/source/use-cases.rst +++ b/doc/source/use-cases.rst @@ -8,4 +8,4 @@ a starting point. :glob: :maxdepth: 1 - use-cases/vitrage-mistral-integration.rst + use-cases/nic-failure-affects-instance-and-app.rst diff --git a/use-cases/nic-failure-affects-instance-and-app.rst b/use-cases/nic-failure-affects-instance-and-app.rst new file mode 100644 index 0000000..c47a9dd --- /dev/null +++ b/use-cases/nic-failure-affects-instance-and-app.rst @@ -0,0 +1,101 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================================== +NIC failure affects instances and applications +============================================== + +As a cloud operator, whenever one of my cloud's compute nodes has a NIC +failure, I want to be notified of all affected resources including instances +and applications. Moreover, I want the failed instances to be migrated away to +another hardware so my applications will continue to function. + + +Problem description +=================== + +A NIC failure may cause the host, as well as all instances running on it, to +become unreachable. This may also affect applications that are using these +instances and lose their high-availability. + + +Fault class +=========== + +Network failure + + +OpenStack projects used +======================= + +* Zabbix (or any other 3rd party monitor) +* Vitrage +* Mistral + + +Remediation class +================= + +Reactive + + +Fault detection +=============== + +There is no OpenStack component that detects a NIC failure, so it has to be +done using a 3rd party monitor like Zabbix. + + +Inputs and decision-making +========================== + +Based on the NIC failure detection, the cloud operator should understand which +resources and applications are affected. + + +Remediation +=========== + +Instances that became unreachable due the the network failure should be +migrated to another host, so the applications should continue to function. + + +Existing implementation(s) +========================== + +To identify the failed resources, the cloud operator can use Vitrage. Vitrage +will be notified by the external monitor (such as Zabbix) about the failed NIC. +Based on its cloud topology awareness, Vitrage will raise additional alarms on +the host, instances and affected applications. + +An affected application will most likely be running in HA mode, so it will +perform a fail-over to the standby instance. However, it will lose its +high-availability. + +The cloud operator can see this information in Vitrage Entity Graph, locate +a failed instance that affects an application, and ask to execute a +VM-migration Mistral workflow on that instance. + +Alternatively, Vitrage can **automatically** execute a Mistral workflow that +will migrate the failed instance to a different host, so the application will +get back to a fully-operational state. + +.. figure:: ./vitrage_and_mistral.png + :scale: 100 % + :align: center + :alt: alternate text + + +Future work +=========== + +None (supported from OpenStack Queens and on) + + +Dependencies +============ + +None diff --git a/use-cases/vitrage-mistral-integration.rst b/use-cases/vitrage-mistral-integration.rst deleted file mode 100644 index 2493085..0000000 --- a/use-cases/vitrage-mistral-integration.rst +++ /dev/null @@ -1,119 +0,0 @@ -=============================== -Vitrage and Mistral Integration -=============================== - -Overview -======== - -Self-healing and fast recovery in real world cloud systems is challenging... - -* Failures happen in real distributed systems -* A single failure may affect many resources -* We can see symptoms but it’s hard to find the root cause -* Recovery might be complicated - -The integration of Vitrage and Mistral can help identifying the root cause and -taking corrective actions, in an end-to-end self-healing scenario. - -Vitrage is the OpenStack Root Cause Analysis service for organizing, analyzing -and visualizing OpenStack and external alarms. It is used to provide insights -about the root cause of problems and deduce their existence before they are -directly reported. - -Mistral is the OpenStack workflow service. It aims to provide a mechanism to -define tasks and workflows without writing code, manage and execute them in the -cloud environment. - - -Use Cases -========= - -The integration of Vitrage with Mistral supports two kinds of use cases: - -* Automatic workflow execution, based on predefined conditions -* Manual workflow execution from Vitrage Entity Graph (WIP in Rocky) - - -Use Case 1: NIC failure causes automatic instance migration ------------------------------------------------------------ - -*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC -failure, I want to be notified of all affected resources including instances -and applications. Moreover, I want the failed instances to be migrated away to -another hardware."* - -In a complex system, a failure in one resource can have a wide effect on other -resources. One example is a NIC failure, that may cause the host, as well as -all instances running on it, to become unreachable. This may also affect -applications that are using these instances and lose their high-availability. - -To identify the failed resources, the cloud operator can use Vitrage. Vitrage -will be notified by an external monitor (such as Zabbix) about the failed NIC. -Based on its cloud topology awareness, Vitrage will raise additional alarms on -the host, instances and affected applications. - -An affected application will most likely be running in HA mode, so it will -perform a fail-over to the standby instance. However, it will lose its -high-availability. In order to fix it, Vitrage can execute a Mistral workflow -that will migrate the failed instance to a different host, so the application -will get back to a fully-operational state. - -.. figure:: ./vitrage_and_mistral.png - :scale: 100 % - :align: center - :alt: alternate text - -Use Case 2: NIC failure with an optional manual instance migration ------------------------------------------------------------------- - -*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC -failure, I want to be notified of all affected resources including instances -and applications. I then want an easy way to manually migrate a failed -instance to another compute and track its state."* - -This is currently WIP in Rocky. - -The use case is similar to use case 1, but in this use case the cloud operator -did not pre-configured Vitrage to execute a Mistral workflow when an -application is affected by an instance being unreachable. - -As a result of a NIC failure, Vitrage raises alarms on the host, its instances -and the applications that are using them. The cloud operator can see this -information in Vitrage Entity Graph, locate a failed instance that affects an -application, and ask to execute a VM-migration Mistral workflow on that -instance. - - -Technical Details -================= - -Vitrage ``evaluator templates`` define the business logic and the way that -Vitrage handles alarms and resource states. A template contains ``scenarios``, -where each scenario is made of ``condition`` and ``actions``. - -Among other actions (like raise an alarm or modify the state of a resource), -the cloud operator can ask to execute a Mistral workflow with certain -parameters. For example, the cloud operator can define this scenario: - -* ``condition:`` an application contains an instance that is unreachable -* ``action:`` execute a Mistral VM-Migration workflow on that instance - -More details about Vitrage template definitions can be found here_ - -.. _here: https://docs.openstack.org/vitrage/latest/contributor/vitrage-template-format.html - - -Note that Vitrage could call Nova evacuate directly for the failed instance, -but using a Mistral workflow is a much more robust option. Mistral can track -the Nova evacuation process, check its status and verify that everything worked -as expected. - - -References -========== - -- https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral - -- https://wiki.openstack.org/wiki/Vitrage - -- https://docs.openstack.org/mistral/latest/