Added documentation about the Vitrage+Mistral integration
Change-Id: Ibd3ab03103a0ccbfec5f5795f73cdff30e5826ce Story: 2001430 Task: 6513
This commit is contained in:
parent
57b2e7d730
commit
d2c304bdd7
@ -4,7 +4,7 @@
|
||||
contain the root `toctree` directive.
|
||||
|
||||
self-healing-sig Design Specifications
|
||||
==================================================
|
||||
======================================
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
@ -14,7 +14,7 @@ self-healing-sig Design Specifications
|
||||
|
||||
|
||||
self-healing-sig Repository Information
|
||||
===================================================
|
||||
=======================================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
119
specs/vitrage-mistral-integration.rst
Normal file
119
specs/vitrage-mistral-integration.rst
Normal file
@ -0,0 +1,119 @@
|
||||
===============================
|
||||
Vitrage and Mistral Integration
|
||||
===============================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
Self-healing and fast recovery in real world cloud systems is challenging...
|
||||
|
||||
* Failures happen in real distributed systems
|
||||
* A single failure may affect many resources
|
||||
* We can see symptoms but it’s hard to find the root cause
|
||||
* Recovery might be complicated
|
||||
|
||||
The integration of Vitrage and Mistral can help identifying the root cause and
|
||||
taking corrective actions, in an end-to-end self-healing scenario.
|
||||
|
||||
Vitrage is the OpenStack Root Cause Analysis service for organizing, analyzing
|
||||
and visualizing OpenStack and external alarms. It is used to provide insights
|
||||
about the root cause of problems and deduce their existence before they are
|
||||
directly reported.
|
||||
|
||||
Mistral is the OpenStack workflow service. It aims to provide a mechanism to
|
||||
define tasks and workflows without writing code, manage and execute them in the
|
||||
cloud environment.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
The integration of Vitrage with Mistral supports two kinds of use cases:
|
||||
|
||||
* Automatic workflow execution, based on predefined conditions
|
||||
* Manual workflow execution from Vitrage Entity Graph (WIP in Rocky)
|
||||
|
||||
|
||||
Use Case 1: NIC failure causes automatic instance migration
|
||||
-----------------------------------------------------------
|
||||
|
||||
*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC
|
||||
failure, I want to be notified of all affected resources including instances
|
||||
and applications. Moreover, I want the failed instances to be migrated away to
|
||||
another hardware."*
|
||||
|
||||
In a complex system, a failure in one resource can have a wide effect on other
|
||||
resources. One example is a NIC failure, that may cause the host, as well as
|
||||
all instances running on it, to become unreachable. This may also affect
|
||||
applications that are using these instances and lose their high-availability.
|
||||
|
||||
To identify the failed resources, the cloud operator can use Vitrage. Vitrage
|
||||
will be notified by an external monitor (such as Zabbix) about the failed NIC.
|
||||
Based on its cloud topology awareness, Vitrage will raise additional alarms on
|
||||
the host, instances and affected applications.
|
||||
|
||||
An affected application will most likely be running in HA mode, so it will
|
||||
perform a fail-over to the standby instance. However, it will lose its
|
||||
high-availability. In order to fix it, Vitrage can execute a Mistral workflow
|
||||
that will migrate the failed instance to a different host, so the application
|
||||
will get back to a fully-operational state.
|
||||
|
||||
.. figure:: ./vitrage_and_mistral.png
|
||||
:scale: 100 %
|
||||
:align: center
|
||||
:alt: alternate text
|
||||
|
||||
Use Case 2: NIC failure with an optional manual instance migration
|
||||
------------------------------------------------------------------
|
||||
|
||||
*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC
|
||||
failure, I want to be notified of all affected resources including instances
|
||||
and applications. I then want an easy way to manually migrate a failed
|
||||
instance to another compute and track its state."*
|
||||
|
||||
This is currently WIP in Rocky.
|
||||
|
||||
The use case is similar to use case 1, but in this use case the cloud operator
|
||||
did not pre-configured Vitrage to execute a Mistral workflow when an
|
||||
application is affected by an instance being unreachable.
|
||||
|
||||
As a result of a NIC failure, Vitrage raises alarms on the host, its instances
|
||||
and the applications that are using them. The cloud operator can see this
|
||||
information in Vitrage Entity Graph, locate a failed instance that affects an
|
||||
application, and ask to execute a VM-migration Mistral workflow on that
|
||||
instance.
|
||||
|
||||
|
||||
Technical Details
|
||||
=================
|
||||
|
||||
Vitrage ``evaluator templates`` define the business logic and the way that
|
||||
Vitrage handles alarms and resource states. A template contains ``scenarios``,
|
||||
where each scenario is made of ``condition`` and ``actions``.
|
||||
|
||||
Among other actions (like raise an alarm or modify the state of a resource),
|
||||
the cloud operator can ask to execute a Mistral workflow with certain
|
||||
parameters. For example, the cloud operator can define this scenario:
|
||||
|
||||
* ``condition:`` an application contains an instance that is unreachable
|
||||
* ``action:`` execute a Mistral VM-Migration workflow on that instance
|
||||
|
||||
More details about Vitrage template definitions can be found here_
|
||||
|
||||
.. _here: https://docs.openstack.org/vitrage/latest/contributor/vitrage-template-format.html
|
||||
|
||||
|
||||
Note that Vitrage could call Nova evacuate directly for the failed instance,
|
||||
but using a Mistral workflow is a much more robust option. Mistral can track
|
||||
the Nova evacuation process, check its status and verify that everything worked
|
||||
as expected.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral
|
||||
|
||||
- https://wiki.openstack.org/wiki/Vitrage
|
||||
|
||||
- https://docs.openstack.org/mistral/latest/
|
BIN
specs/vitrage_and_mistral.png
Normal file
BIN
specs/vitrage_and_mistral.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 74 KiB |
Loading…
Reference in New Issue
Block a user