diff --git a/specs/newton/approved/newton-instance-ha-host-monitoring-spec.rst b/specs/newton/approved/newton-instance-ha-host-monitoring-spec.rst new file mode 100644 index 0000000..67df1e1 --- /dev/null +++ b/specs/newton/approved/newton-instance-ha-host-monitoring-spec.rst @@ -0,0 +1,433 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================== +Host Monitoring +========================================== + +The purpose of this spec is to describe a method for monitoring the +health of OpenStack compute nodes. + +Problem description +=================== + +Monitoring compute node health is essential for providing high +availability for VMs. A health monitor must be able to detect crashes, +freezes, network connectivity issues, and any other OS-level errors on +the compute node which prevent it from being able to run the necessary +services in order to host existing or new VMs. + +Use Cases +--------- + +As a cloud operator, I would like to provide my users with highly +available VMs to meet high SLA requirements. Therefore, I need my +compute nodes automatically monitored for hardware failure, kernel +crashes and hangs, and other failures at the operating system level. +Any failure event detected needs to be passed to a compute host +recovery workflow service which can then take the appropriate remedial +action. + +For example, if a compute host fails (or appears to fail to the extent +that the monitor can detect), the recovery service will typically +identify all VMs which were running on this compute host, and may take +any of the following possible actions: + +- Fence the host (STONITH) to eliminate the risk of a still-running + instance being resurrected elsewhere (see the next step) and + simultaneously running in two places as a result, which could cause + data corruption. + +- Resurrect some or all of the VMs on other compute hosts. + +- Notify the cloud operator. + +- Notify affected users. + +- Make the failure and recovery events available to telemetry / + auditing systems. + +Scope +----- + +This spec only addresses monitoring the health of the compute node +hardware and basic operating system functions, and notifying +appropriate recovery components in the case of any failure. + +Monitoring the health of ``nova-compute`` and other processes it +depends on, such as ``libvirtd`` and anything else at or above the +hypervisor layer, including individual VMs, will be covered by +separate specs, and are therefore out of scope for this spec. + +Any kind of recovery workflow is also out of scope and will be covered +by separate specs. + +This spec has the following goals: + +1. Encourage all implementations of compute node monitoring, whether + upstream or downstream, to output failure notifications in a + standardized manner. This will allow cloud vendors and operators + to implement HA of the compute plane via a collection of compatible + components (of which one is compute node monitoring), whilst not + being tied to any one implementation. + +2. Provide details of and recommend a specific implementation which + for the most part already exists and is proven to work. + +3. Identify gaps with that implementation and corresponding future + work required. + +Acceptance criteria +=================== + +Here the words "must", "should" etc. are used with the strict meaning +defined in `RFC2119 `_. + +- Compute nodes must be automatically monitored for hardware failure, + kernel crashes and hangs, and other failures at the operating system + level. + +- The solution must scale to hundreds of compute hosts. + +- Any failure event detected must cause the component responsible for + alerting to send a notification to a configurable endpoint so that + it can be consumed by the cloud operator's choice of compute node + recovery workflow controller. + +- If a failure notification is not accepted by the recovery component, + it should be persisted within the monitoring/alerting components, + and sending of the notification should be retried periodically until + it succeeds. This will ensure that remediation of failures is never + dropped due to temporary failure or other unavailability of any + component. + +- The alerting component must be extensible in order to allow + communication with multiple types of recovery workflow controller, + via a driver abstraction layer, and drivers for each type. At least + one driver must be implemented initially. + +- One of the drivers should send notifications to an HTTP endpoint + using a standardized JSON format as the payload. + +- Another driver should send notifications to the `masakari API server + `_. + +Implementation +============== + +The implementation described here was presented at OpenStack Day +Israel, June 2017, from which `this diagram +`_ +should assist in understanding the below description. + +Running a `pacemaker_remote +`_ +service on each compute host allows it to be monitored by a central +Pacemaker cluster via a straight-forward TCP connection. This is an +ideal solution to this problem for the following reasons: + +- Pacemaker can scale to handling a very large number of remote nodes. + +- ``pacemaker_remote`` can be simultaneously used for monitoring and + managing services on each compute host. + +- ``pacemaker_remote`` is a very lightweight service which will not + cause any significantly increased load on each compute host. + +- Pacemaker has excellent support for fencing for a wide range of + STONITH devices, and it is easy to extend support to other devices, + as shown by the `fence_agents repository + `_. + +- Pacemaker is easily extensible via OCF Resource Agents, which allow + custom design of monitoring and of the automated reaction when those + monitors fail. + +- Many clouds will already be running one or more Pacemaker clusters + on the control plane, as recommended by the |ha-guide|_, so + deployment complexity is not significantly increased. + +- This architecture is already implemented and proven via the + commercially supported enterprise products RHEL OpenStack Platform + and SUSE OpenStack Cloud, and via `masakari + `_ + which is used by production deployments at NTT. + +Since many different tools are currently in use for deployment of +OpenStack with HA, configuration of Pacemaker is currently out of +scope for upstream projects, so the exact details will be left as the +responsibility of each individual deployer. Nevertheless, examples +of partial configurations for Pacemaker are given below. + +Fencing +------- + +Fencing is technically outside the scope of this spec, in order to +allow any cloud operator to choose their own clustering technology +whilst remaining compliant and hence compatible with the notification +standard described here. However, Pacemaker offers such a convenient +solution to fencing which is also used to send the failure +notification, so it is described here in full. + +Pacemaker already implements effective heartbeat monitoring of its +remote nodes via the TCP connection with ``pacemaker_remote``, so it +only remains to ensure that the correct steps are taken when the +monitor detects failure: + +1. Firstly, the compute host must be fenced via an appropriate STONITH + agent, for the reasons stated above. + +2. Once the host has been fenced, the monitor must mark the host as + needing remediation in a manner which is persisted to disk (in case + of changes in cluster state during handling of the failure) and + read/write-accessible by a separate alerting component which can + hand over responsibility of processing the failure to a recovery + workflow controller, by sending it the appropriate notification. + +These steps should be implemented by using two features of Pacemaker. +Firstly, its ``fencing_topology`` configuration directive to implement +the second step as a custom fencing agent which is triggered after the +first step is complete. For example, the custom fencing agent might +be set up via a Pacemaker ``primitive`` resource such as: + +.. code:: + + primitive fence-nova stonith:fence_compute \ + params auth-url="http://cluster.my.cloud.com:5000/v3/" \ + domain=my.cloud.com \ + tenant-name=admin \ + endpoint-type=internalURL \ + login=admin \ + passwd=s3kr1t \ + op monitor interval=10m + +and then it could be configured as the second device in the fencing +sequence: + +.. code:: + + fencing_topology compute1: stonith-compute1,fence-nova + +Secondly, the ``fence_compute`` agent here should persist the marking of +the fenced compute host via `attrd +`_, so that +a separate alerting component can transfer ownership of this host's +failure to a recovery workflow controller by sending it the +appropriate notification message. + +It is worth noting that the ``fence_compute`` fencing agent `already +exists +`_ +as part of an earlier architecture, so it is strongly recommended to +reuse and adapt the existing implementation rather than writing a new +one from scratch. + +Sending failure notifications to a host recovery workflow controller +-------------------------------------------------------------------- + +There must be a highly available service responsible for taking host +failures marked in ``attrd``, notifying a recovery workflow +controller, and updating ``attrd`` accordingly once appropriate action +has been taken. A suggested name for this service is +``nova-host-alerter``. + +It should be easy to ensure this alerter service is highly available +by placing it under management of the existing Pacemaker cluster. It +could be written as an `OCF resource agent +`_, or as a +Python daemon which is controlled by an OCF / LSB / ``systemd`` resource +agent. + +The alerter service must contain an extensible driver-based +architecture, so that it is capable of sending notifications to a +number of different recovery workflow controllers. + +In particular it must have a driver for sending notifications via the +`masakari API `_. If the +service is implemented as a shell script, this could be achieved by +invoking masakari's ``notification-create`` CLI, or if in Python, via +the `python-masakariclient library +`_. + +Ideally it should also have a driver for sending HTTP POST messages to +a configurable endpoint with JSON data formatted in the following +form: + +.. code-block:: json + + { + "id": UUID, + "event_type": "host failure", + "version": "1.0", + "generated_time" : TIMESTAMP, + "payload": { + "hostname": COMPUTE_NAME + "on_shared_storage": [true|false], + "failure_time" : TIMESTAMP + }, + } + +``COMPUTE_NAME`` refers to the FQDN of the compute node on which the +failures have occurred. ``on_shared_storage`` is ``true`` if and only +if the compute host's instances are backed by shared storage. +``failure_time`` provides a timestamp (in seconds since the UNIX +epoch) for when the failure occurred. + +This is already implemented as `fence_evacuate.py +`_, +although the message sent by that script is currently specifically +formatted to be consumed by Mistral. + +Alternatives +============ + +No alternatives to the overall architecture are obviously apparent at +this point. However it is possible that the use of `attrd +`_ (which +is functional but not comprehensively documented) could be substituted +for some other highly available key/value attribute store, such as +`etcd `_. + +Impact assessment +================= + +Data model impact +----------------- + +None + +API impact +---------- + +The HTTP API of the host recovery workflow service needs to be able to +receive events in the format they are sent by this host monitor. + +Security impact +--------------- + +Ideally it should be possible for the host monitor to send +instance event data securely to the recovery workflow service +(e.g. via TLS), without relying on the security of the admin network +over which the data is sent. + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +There will be a small amount of extra RAM and CPU required on each +compute node for running the ``pacemaker_remote`` service. However +it's a relatively simple service, so this should not have significant +impact on the node. + +Other deployer impact +--------------------- + +Distributions need to package ``pacemaker_remote``; however this is +already done for many distributions including SLES, openSUSE, RHEL, +CentOS, Fedora, Ubuntu, and Debian. + +Automated deployment solutions need to deploy and configure the +``pacemaker_remote`` service on each compute node; however this is a +relatively simple task. + +Developer impact +---------------- + +Nothing other than the listed work items below. + +Documentation Impact +-------------------- + +The service should be documented in the |ha-guide|_. + +Assignee(s) +=========== + +Primary assignee: + +- Adam Spiers + +Other contributors: + +- Sampath Priyankara +- Andrew Beekhof +- Dawid Deja + +Work Items +========== + +- Implement ``nova-host-alerter`` (**TODO**: choose owner for this) + +- If appropriate, move the existing `fence_evacuate.py + `_ + to a more suitable long-term home (**TODO**: choose owner for this) + +- Add SSL support (**TODO**: choose owner for this) + +- Add documentation to the |ha-guide|_ (``aspiers`` / ``beekhof``) + +.. |ha-guide| replace:: OpenStack High Availability Guide +.. _ha-guide: http://docs.openstack.org/ha-guide/ + +Dependencies +============ + +- `Pacemaker `_ + +Testing +======= + +`Cloud99 `_ could +possibly be used for testing. + +References +========== + +- `Architecture diagram presented at OpenStack Day Israel, June 2017 + `_ + (see also `the video of the talk `_) + +- `"High Availability for Virtual Machines" user story + `_ + +- `Video of "High Availability for Instances: Moving to a Converged Upstream Solution" + presentation at OpenStack conference in Boston, May 2017 + `_ + +- `Instance HA etherpad started at Newton Design Summit in Austin, April 2016 + `_ + +- `Video of "HA for Pets and Hypervisors" presentation at OpenStack conference + in Austin, April 2016 + `_ + +- `automatic-evacuation etherpad + `_ + +- Existing `fence agent + `_ + which sends failure notification payload as JSON over HTTP. + +- `Instance auto-evacuation cross project spec (WIP) + `_ + + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Pike + - Updated to have alerting mechanism decoupled from fencing process + * - Newton + - First introduced