openstack-resource-agents-s.../specs/newton/approved/newton-instance-ha-host-rec...

4.2 KiB

Host Recovery

The purpose of this spec is to describe a method to recover all virtual machines that are on the host after its failure.

Problem description

In case of whole compute node failure, recovering of instances is crucial for providing the high availability for the virtual machines. On the other hand, automatic recovery of some instances may cause even more problems than the fact, that they were suddenly turned off.

When taking both arguments into account it seems obvious that there is a need for automatic recovery that has to be configurable, on both instance and host level. This spec is to describe what are possible actions in case of compute node failure and to describe the configuration. Automatic recovery of particular instances is out of scope of this spec and would be described in another document.

Use Cases

* As a cloud operator, I would like to provide my users with highly available VMs to meet high SLA requirements. Therefore, I need some of my VMs to automatically resurrect after compute node failure.

Proposed change

VMs recovery can be perform on the control plane of OpenStack cloud. It would be done using mistral workflow service and pacemaker resource agent. The resource agent would be responsible for starting the workflow, whereas mistral would be responsible for performing nova_evacuate for each VM and for observing the state of each evacuated VM. Usage of mistral would ensure that evacuation workflow will end, even if some of the controllers dies during the process.

Alternatives

1. We may not use mistral workflow at all and do all nova_evacuate related stuff in the pacemaker resource agents. But this means that we would have to implement all the HA mechanism in it, which would be difficult.

2. We may try to implement real host-evacuate in nova. Right now host-evacuate iterate over all instances from given host on the client side. We can try to change it and implement it in nova, but nova cores were against this change in the past.

Data model impact

None

API impact

None

Security impact

None

Other end user impact

None

Performance Impact

There would be extra amount of RAM and CPU needed on each controller node to run both pacemaker and mistral services. If they are already present on the control plane, there would be no performance impact.

Other deployer impact

Distributions need to package and deploy an extra services on each controller node. Those services are mistral service and pacemaker resource agent.

Developer impact

Nothing other than the listed work items below.

Implementation

Resource agent would receive information from host monitor, that given host is down. Then it would send a request to mistral to start recovery workflow. Request needs to have below input parameters:

Assignee(s)

Primary assignee:

<launchpad-id or None>

Other contributors:

<launchpad-id or None>

Work Items

  • Prepare resource agent that would trigger mistral
  • Prepare mistral workflow
  • Document changes in HA guide

Dependencies

Host monitor

Testing

Documentation Impact

The service should be documented in the ha-guide.

References

History