5.2 KiB

Raw Blame History

VM Recovery

The purpose of this spec is to describe a method to recover individual virtual machines that are marked as failed by the VM monitoring component.

Problem description

VM failure can be detected by VM monitoring method discussed in vm monitoring spec.

When VM failure event is detected, appropriate recovery actions must be taken. Those recovery actions should be decided using configurable policies based on inputs such as the state of storage (shared or otherwise), status of the VM, and cause of the VM failure.

Use Cases

As a cloud operator, I would like to provide my users with highly available VMs to meet high SLA requirements. There are several types of VM failure events that can occur in OpenStack clouds. We need to make sure such events can be detected and recovered by the system. Possible VM failure events include:

VM crashes
VM hangs

Possible recovery methods include:

VM restart (stop and start)
VM restart on different host

Scope

This spec only addresses recovery from isolated failures of individual VMs. Monitoring of the VMs, and detection and recovery from wider failures, such as failure of a whole compute host, will be covered by separate specs, and are therefore out of scope for this spec.

This spec has the following goals:

Encourage all implementations of VM recovery, whether upstream or downstream, to receive failure notifications in a standardized manner. This will allow cloud vendors and operators to implement HA of the compute plane via a collection of compatible components (of which one is compute node monitoring), whilst not being tied to any one implementation.
Suggest appropriate actions which can be taken for each failure case.
Provide details of and recommend a specific implementation which for the most part already exists and is proven to work.
Identify gaps with that implementation and corresponding future work required.

Proposed change

VM monitors send failure events to a recovery workflow service. This workflow service can analyze the content of the failure event message and execute the appropriate recovery action. This workflow service could also handle the advanced recovery options such as maximum restart threshold, execute next recovery action or execute multiple workflows.

If a VM crashes, the first approach to recovery is stop and start the VM from nova-api. The maximum restart threshold should be configurable, and it could be 0, which means do not restart and go to next recovery method. If restart fails, or threshold is 0, it should try to restart the VM on a different host. The threshold could even be -1, to indicate an infinite number of retries on this host, preventing the VM from ever being restarted on a different host. This might be desirable in certain configurations where there is no shared storage for ephemeral disks, and rebuild of a disk from a glance image during nova evacuate is undesirable.

If a VM hangs due to an I/O error, the recovery service may be required to automatically disable the nova-compute service on that host and restart the VM on a different host. It could also migrate other VMs from the host, in order to preempt further I/O errors.

Implementation

There are at least three possible ways to implement the proposed change:

Use Masakari as recovery workflow service

VM monitors send the failure events to Masakari using Masakari's notification API. Masakari will execute pre-defined recovery actions.
Use Mistral as recovery workflow service

VM monitors call the Mistral workflow to execute execute appropriate recovery actions.
Use Masakari as recovery engine and Mistral as workflow service

VM monitors send the failure events to Masakari and Masakari will analyze the content of the failure event message and call Mistral workflow to execute recovery actions.

Data model impact

None

REST API impact

The HTTP API of the VM recovery workflow service needs to be able to receive events in the format they are sent by the VM monitor.

Security impact

Ideally it should be possible for the VM monitor to send instance event data securely to the recovery workflow service (e.g. via TLS), without relying on the security of the admin network over which the data is sent.

5.2 KiB

Raw Blame History

VM Recovery

Problem description

Use Cases

Scope

Proposed change

Implementation

Data model impact

REST API impact

Security impact

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Documentation Impact

Assignee(s)

Work Items

Dependencies

Testing

Documentation Impact

References

History

5.2 KiB Raw Blame History

VM Recovery

Problem description

Use Cases

Scope

Proposed change

Implementation

Data model impact

REST API impact

Security impact

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Documentation Impact

Assignee(s)

Work Items

Dependencies

Testing

Documentation Impact

References

History

5.2 KiB

Raw Blame History