15 KiB

Raw Blame History

Host Monitoring

The purpose of this spec is to describe a method for monitoring the health of OpenStack compute nodes.

Problem description

Monitoring compute node health is essential for providing high availability for VMs. A health monitor must be able to detect crashes, freezes, network connectivity issues, and any other OS-level errors on the compute node which prevent it from being able to run the necessary services in order to host existing or new VMs.

Use Cases

As a cloud operator, I would like to provide my users with highly available VMs to meet high SLA requirements. Therefore, I need my compute nodes automatically monitored for hardware failure, kernel crashes and hangs, and other failures at the operating system level. Any failure event detected needs to be passed to a compute host recovery workflow service which can then take the appropriate remedial action.

For example, if a compute host fails (or appears to fail to the extent that the monitor can detect), the recovery service will typically identify all VMs which were running on this compute host, and may take any of the following possible actions:

Fence the host (STONITH) to eliminate the risk of a still-running instance being resurrected elsewhere (see the next step) and simultaneously running in two places as a result, which could cause data corruption.
Resurrect some or all of the VMs on other compute hosts.
Notify the cloud operator.
Notify affected users.
Make the failure and recovery events available to telemetry / auditing systems.

Scope

This spec only addresses monitoring the health of the compute node hardware and basic operating system functions, and notifying appropriate recovery components in the case of any failure.

Monitoring the health of nova-compute and other processes it depends on, such as libvirtd and anything else at or above the hypervisor layer, including individual VMs, will be covered by separate specs, and are therefore out of scope for this spec.

Any kind of recovery workflow is also out of scope and will be covered by separate specs.

This spec has the following goals:

Encourage all implementations of compute node monitoring, whether upstream or downstream, to output failure notifications in a standardized manner. This will allow cloud vendors and operators to implement HA of the compute plane via a collection of compatible components (of which one is compute node monitoring), whilst not being tied to any one implementation.
Provide details of and recommend a specific implementation which for the most part already exists and is proven to work.
Identify gaps with that implementation and corresponding future work required.

Acceptance criteria

Here the words "must", "should" etc. are used with the strict meaning defined in RFC2119.

Compute nodes must be automatically monitored for hardware failure, kernel crashes and hangs, and other failures at the operating system level.
The solution must scale to hundreds of compute hosts.
Any failure event detected must cause the component responsible for alerting to send a notification to a configurable endpoint so that it can be consumed by the cloud operator's choice of compute node recovery workflow controller.
If a failure notification is not accepted by the recovery component, it should be persisted within the monitoring/alerting components, and sending of the notification should be retried periodically until it succeeds. This will ensure that remediation of failures is never dropped due to temporary failure or other unavailability of any component.
The alerting component must be extensible in order to allow communication with multiple types of recovery workflow controller, via a driver abstraction layer, and drivers for each type. At least one driver must be implemented initially.
One of the drivers should send notifications to an HTTP endpoint using a standardized JSON format as the payload.
Another driver should send notifications to the masakari API server.

Implementation

The implementation described here was presented at OpenStack Day Israel, June 2017, from which this diagram should assist in understanding the below description.

Running a pacemaker_remote service on each compute host allows it to be monitored by a central Pacemaker cluster via a straight-forward TCP connection. This is an ideal solution to this problem for the following reasons:

Pacemaker can scale to handling a very large number of remote nodes.
pacemaker_remote can be simultaneously used for monitoring and managing services on each compute host.
pacemaker_remote is a very lightweight service which will not cause any significantly increased load on each compute host.
Pacemaker has excellent support for fencing for a wide range of STONITH devices, and it is easy to extend support to other devices, as shown by the fence_agents repository.
Pacemaker is easily extensible via OCF Resource Agents, which allow custom design of monitoring and of the automated reaction when those monitors fail.
Many clouds will already be running one or more Pacemaker clusters on the control plane, as recommended by the OpenStack High Availability Guide_, so deployment complexity is not significantly increased.
This architecture is already implemented and proven via the commercially supported enterprise products RHEL OpenStack Platform and SUSE OpenStack Cloud, and via masakari which is used by production deployments at NTT.

Since many different tools are currently in use for deployment of OpenStack with HA, configuration of Pacemaker is currently out of scope for upstream projects, so the exact details will be left as the responsibility of each individual deployer. Nevertheless, examples of partial configurations for Pacemaker are given below.

Fencing

Fencing is technically outside the scope of this spec, in order to allow any cloud operator to choose their own clustering technology whilst remaining compliant and hence compatible with the notification standard described here. However, Pacemaker offers such a convenient solution to fencing which is also used to send the failure notification, so it is described here in full.

Pacemaker already implements effective heartbeat monitoring of its remote nodes via the TCP connection with pacemaker_remote, so it only remains to ensure that the correct steps are taken when the monitor detects failure:

Firstly, the compute host must be fenced via an appropriate STONITH agent, for the reasons stated above.
Once the host has been fenced, the monitor must mark the host as needing remediation in a manner which is persisted to disk (in case of changes in cluster state during handling of the failure) and read/write-accessible by a separate alerting component which can hand over responsibility of processing the failure to a recovery workflow controller, by sending it the appropriate notification.

These steps should be implemented by using two features of Pacemaker. Firstly, its fencing_topology configuration directive to implement the second step as a custom fencing agent which is triggered after the first step is complete. For example, the custom fencing agent might be set up via a Pacemaker primitive resource such as:

primitive fence-nova stonith:fence_compute \
    params auth-url="http://cluster.my.cloud.com:5000/v3/" \
           domain=my.cloud.com \
           tenant-name=admin \
           endpoint-type=internalURL \
           login=admin \
           passwd=s3kr1t \
    op monitor interval=10m

and then it could be configured as the second device in the fencing sequence:

fencing_topology compute1: stonith-compute1,fence-nova

Secondly, the fence_compute agent here should persist the marking of the fenced compute host via attrd, so that a separate alerting component can transfer ownership of this host's failure to a recovery workflow controller by sending it the appropriate notification message.

It is worth noting that the fence_compute fencing agent already exists as part of an earlier architecture, so it is strongly recommended to reuse and adapt the existing implementation rather than writing a new one from scratch.

Sending failure notifications to a host recovery workflow controller

There must be a highly available service responsible for taking host failures marked in attrd, notifying a recovery workflow controller, and updating attrd accordingly once appropriate action has been taken. A suggested name for this service is nova-host-alerter.

It should be easy to ensure this alerter service is highly available by placing it under management of the existing Pacemaker cluster. It could be written as an OCF resource agent, or as a Python daemon which is controlled by an OCF / LSB / systemd resource agent.

The alerter service must contain an extensible driver-based architecture, so that it is capable of sending notifications to a number of different recovery workflow controllers.

In particular it must have a driver for sending notifications via the masakari API. If the service is implemented as a shell script, this could be achieved by invoking masakari's notification-create CLI, or if in Python, via the python-masakariclient library.

Ideally it should also have a driver for sending HTTP POST messages to a configurable endpoint with JSON data formatted in the following form:

{
    "id": UUID,
    "event_type": "host failure",
    "version": "1.0",
    "generated_time" : TIMESTAMP,
    "payload": {
        "hostname": COMPUTE_NAME
        "on_shared_storage": [true|false],
        "failure_time" : TIMESTAMP
    },
}

COMPUTE_NAME refers to the FQDN of the compute node on which the failures have occurred. on_shared_storage is true if and only if the compute host's instances are backed by shared storage. failure_time provides a timestamp (in seconds since the UNIX epoch) for when the failure occurred.

This is already implemented as fence_evacuate.py, although the message sent by that script is currently specifically formatted to be consumed by Mistral.

Alternatives

No alternatives to the overall architecture are obviously apparent at this point. However it is possible that the use of attrd (which is functional but not comprehensively documented) could be substituted for some other highly available key/value attribute store, such as etcd.

Impact assessment

Data model impact

None

API impact

The HTTP API of the host recovery workflow service needs to be able to receive events in the format they are sent by this host monitor.

Security impact

Ideally it should be possible for the host monitor to send instance event data securely to the recovery workflow service (e.g. via TLS), without relying on the security of the admin network over which the data is sent.

Other end user impact

None

Performance Impact

There will be a small amount of extra RAM and CPU required on each compute node for running the pacemaker_remote service. However it's a relatively simple service, so this should not have significant impact on the node.

Other deployer impact

Distributions need to package pacemaker_remote; however this is already done for many distributions including SLES, openSUSE, RHEL, CentOS, Fedora, Ubuntu, and Debian.

Automated deployment solutions need to deploy and configure the pacemaker_remote service on each compute node; however this is a relatively simple task.

Developer impact

Nothing other than the listed work items below.

Documentation Impact

The service should be documented in the OpenStack High Availability Guide_.

Assignee(s)

Primary assignee:

Adam Spiers

Other contributors:

Sampath Priyankara
Andrew Beekhof
Dawid Deja

Work Items

Implement nova-host-alerter (TODO: choose owner for this)
If appropriate, move the existing fence_evacuate.py to a more suitable long-term home (TODO: choose owner for this)
Add SSL support (TODO: choose owner for this)
Add documentation to the OpenStack High Availability Guide_ (aspiers / beekhof)

Dependencies

Pacemaker

Testing

Cloud99 could possibly be used for testing.

References

Architecture diagram presented at OpenStack Day Israel, June 2017 (see also the video of the talk)
"High Availability for Virtual Machines" user story
Video of "High Availability for Instances: Moving to a Converged Upstream Solution" presentation at OpenStack conference in Boston, May 2017
Instance HA etherpad started at Newton Design Summit in Austin, April 2016
Video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin, April 2016
automatic-evacuation etherpad
Existing fence agent which sends failure notification payload as JSON over HTTP.
Instance auto-evacuation cross project spec (WIP)

History

Revisions
Release Name	Description
Pike	Updated to have alerting mechanism decoupled from fencing process
Newton	First introduced

15 KiB Raw Blame History