From e243a2c5450a80a6916c1a4c6e1ea7349900e684 Mon Sep 17 00:00:00 2001 From: Dawid Deja Date: Fri, 14 Oct 2016 13:57:09 +0200 Subject: [PATCH] Spec for host recovery Change-Id: Ifa2583901cd2dff0b450d81fd7de96b27e9c315a --- .../newton-instance-ha-host-recovery.rst | 161 ++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100644 specs/newton/approved/newton-instance-ha-host-recovery.rst diff --git a/specs/newton/approved/newton-instance-ha-host-recovery.rst b/specs/newton/approved/newton-instance-ha-host-recovery.rst new file mode 100644 index 0000000..0163459 --- /dev/null +++ b/specs/newton/approved/newton-instance-ha-host-recovery.rst @@ -0,0 +1,161 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============= +Host Recovery +============= + +The purpose of this spec is to describe a method to recover all virtual +machines that are on the host after its failure. + +Problem description +=================== + +In case of whole compute node failure, recovering of instances is crucial for +providing the high availability for the virtual machines. On the other hand, +automatic recovery of some instances may cause even more problems than the fact, +that they were suddenly turned off. + +When taking both arguments into account it seems obvious that there is a need +for automatic recovery that has to be configurable, on both instance and host +level. This spec is to describe what are possible actions in case of compute +node failure and to describe the configuration. Automatic recovery of +particular instances is out of scope of this spec and would be described in +another document. + +Use Cases +--------- + +* As a cloud operator, I would like to provide my users with highly +available VMs to meet high SLA requirements. Therefore, I need some of my VMs +to automatically resurrect after compute node failure. + +Proposed change +=============== + +VMs recovery can be perform on the control plane of OpenStack cloud. It would be +done using mistral workflow service and pacemaker resource agent. The resource +agent would be responsible for starting the workflow, whereas mistral would +be responsible for performing *nova_evacuate* for each VM and for observing the +state of each evacuated VM. Usage of mistral would ensure that evacuation +workflow will end, even if some of the controllers dies during the process. + +Alternatives +------------ + +1. We may not use mistral workflow at all and do all *nova_evacuate* related +stuff in the pacemaker resource agents. But this means that we would have to +implement all the HA mechanism in it, which would be difficult. + +2. We may try to implement real *host-evacuate* in nova. Right now +*host-evacuate* iterate over all instances from given host on the client side. +We can try to change it and implement it in nova, but nova cores were against +this change in the past. + +Data model impact +----------------- + +None + +API impact +---------- + +None + +Security impact +--------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +There would be extra amount of RAM and CPU needed on each controller node to +run both pacemaker and mistral services. If they are already present on the +control plane, there would be no performance impact. + +Other deployer impact +--------------------- + +Distributions need to package and deploy an extra services on each +controller node. Those services are mistral service and pacemaker resource +agent. + +Developer impact +---------------- + +Nothing other than the listed work items below. + +Implementation +============== + +Resource agent would receive information from host monitor, that given host +is down. Then it would send a request to mistral to start recovery workflow. +Request needs to have below input parameters: + +.. code-block:: json + { + "search_opts": { + "host": COMPUTE_NAME + }, + "on_shared_storage": [true|false] + } + +Assignee(s) +----------- + +Primary assignee: + + +Other contributors: + + +Work Items +---------- + +* Prepare resource agent that would trigger mistral +* Prepare mistral workflow +* Document changes in HA guide + +Dependencies +============ + +Host monitor + +Testing +======= + +Documentation Impact +==================== + +The service should be documented in the ha-guide. + +References +========== + +- `Instance HA etherpad started at Newton Design Summit in Austin + `_ + +- `"High Availability for Virtual Machines" user story + `_ + +- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin + `_ + +- `automatic-evacuation etherpad + `_ + +- `Instance auto-evacuation cross project spec (WIP) + `_ + + +History +=======