VM Recovery

The purpose of this spec is to describe a method for recover the VMs from VM failures. Change-Id: I3648aacc2cfefe2bb5981f694415ceab17b2dfb8
2016-10-17 17:42:27 +09:00 · 2016-10-17 17:42:27 +09:00 · 8a4a70db74
parent e243a2c545
commit 8a4a70db74
1 changed files with 200 additions and 0 deletions
--- a/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
+++ b/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
@ -0,0 +1,200 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+VM Recovery
+==========================================
+
+The purpose of this spec is to describe a method to recover
+individual virtual machines that are marked as failed by
+the VM monitoring component.
+
+Problem description
+===================
+
+VM failure can be detected by VM monitoring method discussed in
+`vm monitoring spec`__.
+
+__ https://review.openstack.org/#/c/352217/
+
+When VM failure event is detected, appropriate recovery actions must
+be taken. Those recovery actions should be decided using configurable
+policies based on inputs such as the state of storage (shared or
+otherwise), status of the VM, and cause of the VM failure.
+
+Use Cases
+---------
+
+As a cloud operator, I would like to provide my users with highly
+available VMs to meet high SLA requirements. There are several types
+of VM failure events that can occur in OpenStack clouds.
+We need to make sure such events can be detected and recovered
+by the system. Possible VM failure events include:
+
+- VM crashes
+
+- VM hangs
+
+Possible recovery methods include:
+
+- VM restart (stop and start)
+
+- VM restart on different host
+
+Scope
+-----
+
+This spec only addresses recovery from isolated failures of individual
+VMs. Monitoring of the VMs, and detection and recovery from wider
+failures, such as failure of a whole compute host, will be covered by
+separate specs, and are therefore out of scope for this spec.
+
+This spec has the following goals:
+
+1. Encourage all implementations of VM recovery, whether upstream or
+   downstream, to receive failure notifications in a standardized
+   manner. This will allow cloud vendors and operators to implement
+   HA of the compute plane via a collection of compatible components
+   (of which one is compute node monitoring), whilst not being tied to
+   any one implementation.
+
+2. Suggest appropriate actions which can be taken for each failure
+   case.
+
+3. Provide details of and recommend a specific implementation which
+   for the most part already exists and is proven to work.
+
+4. Identify gaps with that implementation and corresponding future
+   work required.
+
+Proposed change
+===============
+
+VM monitors send failure events to a recovery workflow service. This
+workflow service can analyze the content of the failure event message
+and execute the appropriate recovery action. This workflow service
+could also handle the advanced recovery options such as maximum
+restart threshold, execute next recovery action or execute multiple
+workflows.
+
+If a VM crashes, the first approach to recovery is stop and start the
+VM from nova-api. The maximum restart threshold should be
+configurable, and it could be 0, which means do not restart and go to
+next recovery method. If restart fails, or threshold is 0, it should
+try to restart the VM on a different host. The threshold could even be
+-1, to indicate an infinite number of retries on this host, preventing
+the VM from ever being restarted on a different host. This might be
+desirable in certain configurations where there is no shared storage
+for ephemeral disks, and rebuild of a disk from a glance image during
+``nova evacuate`` is undesirable.
+
+If a VM hangs due to an I/O error, the recovery service may be
+required to automatically disable the ``nova-compute`` service on that
+host and restart the VM on a different host. It could also migrate
+other VMs from the host, in order to preempt further I/O errors.
+
+Implementation
+==============
+
+There are at least three possible ways to implement the proposed
+change:
+
+1. Use Masakari as recovery workflow service
+
+   VM monitors send the failure events to Masakari using Masakari's
+   notification API. Masakari will execute pre-defined recovery actions.
+
+2. Use Mistral as recovery workflow service
+
+   VM monitors call the Mistral workflow to execute execute appropriate
+   recovery actions.
+
+3. Use Masakari as recovery engine and Mistral as workflow service
+
+   VM monitors send the failure events to Masakari and Masakari will
+   analyze the content of the failure event message and call Mistral
+   workflow to execute recovery actions.
+
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+The HTTP API of the VM recovery workflow service needs to be able to
+receive events in the format they are sent by the VM monitor.
+
+Security impact
+---------------
+
+Ideally it should be possible for the VM monitor to send instance
+event data securely to the recovery workflow service (e.g. via TLS),
+without relying on the security of the admin network over which the
+data is sent.
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+
+Developer impact
+----------------
+
+Documentation Impact
+--------------------
+
+The service should be documented in the |ha-guide|_.
+
+.. |ha-guide| replace:: OpenStack High Availability Guide
+.. _ha-guide: http://docs.openstack.org/ha-guide/
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  <launchpad-id or None>
+
+Other contributors:
+  <launchpad-id or None>
+
+
+Work Items
+==========
+
+ WIP
+
+Dependencies
+============
+
+
+Testing
+=======
+
+
+Documentation Impact
+====================
+
+
+
+References
+==========
+
+
+
+History
+=======