Instance HA Specification

Change-Id: I431ddb209e7a13c39b2a9645d39e122db2d9dd30
This commit is contained in:
Michele Baldessari 2016-10-20 11:57:54 +02:00 committed by Alex Schultz
parent 3c7dc4f847
commit abc0d5a3dd
1 changed files with 145 additions and 0 deletions

View File

@ -0,0 +1,145 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==========================
Instance High Availability
==========================
Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/tripleo/+spec/instance-ha
A very often requested feature by operators and customers is to be able to
automatically resurrect VMs that were running on a compute node that failed (either
due to hardware failures, networking issues or general server problems).
Currently we have a downstream-only procedure which consists of many manual
steps to configure Instance HA:
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview
What we would like to implement here is basically an optional opt-in automatic
deployment of a cloud that has Instance HA support.
Problem Description
===================
Currently if a compute node has a hardware failure or a kernel panic all the
instances that were running on the node, will be gone and manual intervention
is needed to resurrect these instances on another compute node.
Proposed Change
===============
Overview
--------
The proposed change would be to add a few additional puppet-tripleo profiles that would help
us configure the pacemaker resources needed for instance HA. Unlike in previous iterations
we won't need to move nova-compute resources under pacemaker's management. We managed to
achieve the same result without touching the compute nodes (except by setting
up pacemaker_remote on the computes, but that support exists already)
Alternatives
------------
There are a few specs that are modeling host recovery:
Host Recovery - https://review.openstack.org/#/c/386554/
Instances auto evacuation - https://review.openstack.org/#/c/257809
The first spec uses pacemaker in a very similar way but is too new
and too high level to really be able to comment at this point in time.
The second one has been stalled for a long time and it looks like there
is no consensus yet on the approaches needed. The longterm goal is
to morph the Instance HA deployment into the spec that gets accepted.
We are actively working on both specs as well. In any case we have
discussed the long-term plan with SuSe and NTT and we agreed
on a long-term plan of which this spec is the first step for TripleO.
Security Impact
---------------
No additional security impact.
Other End User Impact
---------------------
End users are not impacted except for the fact that VMs can be resurrected
automatically on a non-failed compute node.
Performance Impact
------------------
There are no performance related impacts as compared to a current deployment.
Other Deployer Impact
---------------------
So this change does not affect the default deployments. What it does it adds a boolean
and some additional profiles so that a deployer can have a cloud configured with Instance
HA support out of the box.
* One top-level parameter to enable the Instance HA deployment
* Although fencing configuration is already currently supported by tripleo, we will need
to improve bits and pieces so that we won't need an extra command to generate the
fencing parameters.
* Upgrades will be impacted by this change in the sense that we will need to make sure to test
them when Instance HA is enabled.
Developer Impact
----------------
No developer impact is planned.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
michele
Other contributors:
cmsj, abeekhof
Work Items
----------
* Make the fencing configuration fully automated (this is mostly done already, we need oooq integration
and some optimization)
* Add the logic and needed resources on the control-plane
* Test the upgrade path when Instance HA is configured
Testing
=======
Testing this manually is fairly simple:
* Deploy with Instance HA configured and two compute nodes
* Spawn a test VM
* Crash the compute node where the VM is running
* Observe the VM being resurrected on the other compute node
Testing this in CI is doable but might be a bit more challenging due to resource constraints.
Documentation Impact
====================
A section under advanced configuration is needed explaining the deployment of
a cloud that supports Instance HA.
References
==========
* https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/