Instance HA Specification
Change-Id: I431ddb209e7a13c39b2a9645d39e122db2d9dd30
This commit is contained in:
parent
3c7dc4f847
commit
abc0d5a3dd
|
@ -0,0 +1,145 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================
|
||||
Instance High Availability
|
||||
==========================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/instance-ha
|
||||
|
||||
A very often requested feature by operators and customers is to be able to
|
||||
automatically resurrect VMs that were running on a compute node that failed (either
|
||||
due to hardware failures, networking issues or general server problems).
|
||||
Currently we have a downstream-only procedure which consists of many manual
|
||||
steps to configure Instance HA:
|
||||
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview
|
||||
|
||||
What we would like to implement here is basically an optional opt-in automatic
|
||||
deployment of a cloud that has Instance HA support.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently if a compute node has a hardware failure or a kernel panic all the
|
||||
instances that were running on the node, will be gone and manual intervention
|
||||
is needed to resurrect these instances on another compute node.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposed change would be to add a few additional puppet-tripleo profiles that would help
|
||||
us configure the pacemaker resources needed for instance HA. Unlike in previous iterations
|
||||
we won't need to move nova-compute resources under pacemaker's management. We managed to
|
||||
achieve the same result without touching the compute nodes (except by setting
|
||||
up pacemaker_remote on the computes, but that support exists already)
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are a few specs that are modeling host recovery:
|
||||
|
||||
Host Recovery - https://review.openstack.org/#/c/386554/
|
||||
Instances auto evacuation - https://review.openstack.org/#/c/257809
|
||||
|
||||
The first spec uses pacemaker in a very similar way but is too new
|
||||
and too high level to really be able to comment at this point in time.
|
||||
The second one has been stalled for a long time and it looks like there
|
||||
is no consensus yet on the approaches needed. The longterm goal is
|
||||
to morph the Instance HA deployment into the spec that gets accepted.
|
||||
We are actively working on both specs as well. In any case we have
|
||||
discussed the long-term plan with SuSe and NTT and we agreed
|
||||
on a long-term plan of which this spec is the first step for TripleO.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
No additional security impact.
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
End users are not impacted except for the fact that VMs can be resurrected
|
||||
automatically on a non-failed compute node.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There are no performance related impacts as compared to a current deployment.
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
So this change does not affect the default deployments. What it does it adds a boolean
|
||||
and some additional profiles so that a deployer can have a cloud configured with Instance
|
||||
HA support out of the box.
|
||||
|
||||
* One top-level parameter to enable the Instance HA deployment
|
||||
|
||||
* Although fencing configuration is already currently supported by tripleo, we will need
|
||||
to improve bits and pieces so that we won't need an extra command to generate the
|
||||
fencing parameters.
|
||||
|
||||
* Upgrades will be impacted by this change in the sense that we will need to make sure to test
|
||||
them when Instance HA is enabled.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
No developer impact is planned.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
michele
|
||||
|
||||
Other contributors:
|
||||
cmsj, abeekhof
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Make the fencing configuration fully automated (this is mostly done already, we need oooq integration
|
||||
and some optimization)
|
||||
|
||||
* Add the logic and needed resources on the control-plane
|
||||
|
||||
* Test the upgrade path when Instance HA is configured
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Testing this manually is fairly simple:
|
||||
|
||||
* Deploy with Instance HA configured and two compute nodes
|
||||
|
||||
* Spawn a test VM
|
||||
|
||||
* Crash the compute node where the VM is running
|
||||
|
||||
* Observe the VM being resurrected on the other compute node
|
||||
|
||||
Testing this in CI is doable but might be a bit more challenging due to resource constraints.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A section under advanced configuration is needed explaining the deployment of
|
||||
a cloud that supports Instance HA.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/
|
Loading…
Reference in New Issue