.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ========================== Instance High Availability ========================== Include the URL of your launchpad blueprint: https://blueprints.launchpad.net/tripleo/+spec/instance-ha A very often requested feature by operators and customers is to be able to automatically resurrect VMs that were running on a compute node that failed (either due to hardware failures, networking issues or general server problems). Currently we have a downstream-only procedure which consists of many manual steps to configure Instance HA: https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview What we would like to implement here is basically an optional opt-in automatic deployment of a cloud that has Instance HA support. Problem Description =================== Currently if a compute node has a hardware failure or a kernel panic all the instances that were running on the node, will be gone and manual intervention is needed to resurrect these instances on another compute node. Proposed Change =============== Overview -------- The proposed change would be to add a few additional puppet-tripleo profiles that would help us configure the pacemaker resources needed for instance HA. Unlike in previous iterations we won't need to move nova-compute resources under pacemaker's management. We managed to achieve the same result without touching the compute nodes (except by setting up pacemaker_remote on the computes, but that support exists already) Alternatives ------------ There are a few specs that are modeling host recovery: Host Recovery - https://review.openstack.org/#/c/386554/ Instances auto evacuation - https://review.openstack.org/#/c/257809 The first spec uses pacemaker in a very similar way but is too new and too high level to really be able to comment at this point in time. The second one has been stalled for a long time and it looks like there is no consensus yet on the approaches needed. The longterm goal is to morph the Instance HA deployment into the spec that gets accepted. We are actively working on both specs as well. In any case we have discussed the long-term plan with SuSe and NTT and we agreed on a long-term plan of which this spec is the first step for TripleO. Security Impact --------------- No additional security impact. Other End User Impact --------------------- End users are not impacted except for the fact that VMs can be resurrected automatically on a non-failed compute node. Performance Impact ------------------ There are no performance related impacts as compared to a current deployment. Other Deployer Impact --------------------- So this change does not affect the default deployments. What it does it adds a boolean and some additional profiles so that a deployer can have a cloud configured with Instance HA support out of the box. * One top-level parameter to enable the Instance HA deployment * Although fencing configuration is already currently supported by tripleo, we will need to improve bits and pieces so that we won't need an extra command to generate the fencing parameters. * Upgrades will be impacted by this change in the sense that we will need to make sure to test them when Instance HA is enabled. Developer Impact ---------------- No developer impact is planned. Implementation ============== Assignee(s) ----------- Primary assignee: michele Other contributors: cmsj, abeekhof Work Items ---------- * Make the fencing configuration fully automated (this is mostly done already, we need oooq integration and some optimization) * Add the logic and needed resources on the control-plane * Test the upgrade path when Instance HA is configured Testing ======= Testing this manually is fairly simple: * Deploy with Instance HA configured and two compute nodes * Spawn a test VM * Crash the compute node where the VM is running * Observe the VM being resurrected on the other compute node Testing this in CI is doable but might be a bit more challenging due to resource constraints. Documentation Impact ==================== A section under advanced configuration is needed explaining the deployment of a cloud that supports Instance HA. References ========== * https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/