Update docs and examples for health policy v1.1

The documentation and example health policies need to be updated to
reflect health policy v1.1 schema changes.

Change-Id: Ie447397b001fba9798025370cdb0087e679fdfe4
Closes-Bug: #1834516
This commit is contained in:
Duc Truong 2019-06-28 20:33:04 +00:00
parent bc12179412
commit ebefa20ca2
5 changed files with 109 additions and 44 deletions

View File

@ -13,7 +13,7 @@
==================
Health Policy V1.0
Health Policy V1.1
==================
The health policy is designed to automate the failure detection and recovery
@ -74,7 +74,10 @@ The current vision is that the health policy will support following types
of failure detection:
* ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
and check if there are nodes inactive.
and checks if there are nodes inactive.
* ``NODE_STATUS_POLL_URL``: the *health manager* periodically polls a URL
and checks if a node is considered healthy based on the response.
* ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
sent by the backend service (e.g. nova-compute).
@ -104,27 +107,19 @@ periodically check the resource status by querying the backend service and see
if the resource is active. Below is a sample configuration::
type: senlin.policy.health
version: 1.0
version: 1.1
properties:
detection:
type: NODE_STATUS_POLLING
options:
internal: 120
interval: 120
detection_modes:
- type: NODE_STATUS_POLLING
...
**NOTE**: The current polling logic is only about checking with the backend
service whether a resource is in "ACTIVE" status. However, in future, this may
get extended to having the *health manager* ping the IP address of a nova
server or posting a "GET" request to a specific URL. We believe such
extensions can better reveal whether a specific node is operating.
Once such a policy object is attached to a cluster, Senlin registers the
cluster to the *health manager* engine for failure detection, i.e., node
health checking. A thread is created to issue a ``CLUSTER_CHECK`` RPC request
to the ``senlin-engine`` periodically at the specified interval. The
``CLUSTER_CHECK`` action only refreshes the status of each and every node in
the cluster.
health checking. A thread is created to periodically call Nova to check the
status of the node. If the server status is ERROR, SHUTOFF or DELETED, the node
is considered unhealthy.
When one of the ``senlin-engine`` services is restarted, a new *health manager*
engine will be launched. This new engine will check the database and see if
@ -133,6 +128,54 @@ health status maintained by a *health manager* that is no longer alive. The
new *health manager* will pick up these clusters for health management.
Polling Node URL
----------------
The health check for a node can also be configured to periodically query a
URL with the ``NODE_STATUS_POLL_URL`` detection type. The URL can optionally
contain expansion parameters. Expansion parameters are strings enclosed in {}
that will be substituted with the node specific value by Senlin prior to
querying the URL. The only valid expansion parameter at this point is
``{nodename}``. This expansion parameter will be replaced with the name of the
Senlin node. Below is a sample configuration::
type: senlin.policy.health
version: 1.1
properties:
detection:
interval: 120
detection_modes:
- type: NODE_STATUS_POLL_URL
options:
poll_url: "http://{nodename}/healthstatus"
poll_url_healthy_response: "passing"
poll_url_conn_error_as_unhealty: true
poll_url_retry_limit: 3
poll_url_retry_interval: 2
...
.. note::
``{nodename}`` can be used to query a URL implemented by an
application running on each node. This requires that the OpenStack cloud
is setup to automatically register the name of new server instances with
the DNS service. In the future support for a new expansion parameter for
node IP addresses may be added.
Once such a policy object is attached to a cluster, Senlin registers the
cluster to the *health manager* engine for failure detection, i.e., node
health checking. A thread is created to periodically make a GET request on the
specified URL. ``poll_url_conn_error_as_unheathy`` specifies the behavior if
the URL is unreachable. A node is considered healthy if the response to the GET
request includes the string specified by ``poll_url_healthy_response``. If it
does not, Senlin will retry the URL health check for the number of times
specified by ``poll_url_retry_limit`` while waiting the number of seconds in
``poll_url_retry_interval`` between each retry. If the URL response still does
not contain the expected string after the retries, the node is considered
healthy.
Listening to Event Notifications
--------------------------------
@ -151,7 +194,7 @@ to listen to event notifications, users can attach their cluster(s) a health
policy which looks like the following example::
type: senlin.policy.health
version: 1.0
version: 1.1
properties:
detection:
type: LIFECYCLE_EVENTS

View File

@ -41,24 +41,37 @@ A typical spec for a health policy looks like the following example:
.. literalinclude :: /../../examples/policies/health_policy_poll.yaml
:language: yaml
There are two groups of properties (``detection`` and ``recovery``), each of
There are two groups of properties (``detection`` and ``recovery``), each of
which provides information related to the failure detection and the failure
recovery aspect respectively.
For failure detection, you can specify one of the following two values:
For failure detection, you can specify a detection mode that can be one of the
following two values:
- ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
manager service) is expected to poll each and every nodes periodically to
find out if they are "alive" or not.
- ``NODE_STATUS_POLL_URL``: Senlin engine (more specifically, the health
manager service) is expected to poll the specified URL periodically to
find out if a node is considered healthy or not.
- ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
message queue when configured. Senlin engine is expected to listen to these
events and react to them appropriately.
Both detection types can carry an optional map of ``options``. When the
detection type is set to "``NODE_STATUS_POLLING``", for example, you can
specify a value for ``interval`` property to customize the frequency at which
your cluster nodes are polled.
It is possible to combine ``NODE_STATUS_POLLING`` and ``NODE_STATUS_POLL_URL``
detections by specifying multiple detection modes. In the case of multiple
detection modes, Senlin engine tries each detection type in the order
specified. The behavior of a failed health check in the case of multiple
detection modes is specified using ``recovery_conditional``.
``LIFECYCLE_EVENTS`` cannot be combined with any other detection type.
All detection types can carry an optional map of ``options``. When the
detection type is set to "``NODE_STATUS_POLL_URL``", for example, you can
specify a value for ``poll_url`` property to specify the URL to be used for
health checking.
As the policy type implementation stabilizes, more options may be added later.
@ -106,3 +119,10 @@ Snapshots
There have been some requirements to take snapshots of a node before recovery
so that the recovered node(s) will resume from where they failed. This feature
is also on the TODO list for the development team.
References
~~~~~~~~~~
For more detailed information on how the health policy work, please check
:doc:`Health Policy V1.1 <../../contributor/policies/health_v1>`

View File

@ -1,12 +1,13 @@
# Sample health policy based on VM lifecycle events
type: senlin.policy.health
version: 1.0
version: 1.1
description: A policy for maintaining node health from a cluster.
properties:
detection:
# Type for health checking, valid values include:
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
type: LIFECYCLE_EVENTS
detection_modes:
# Type for health checking, valid values include:
# NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
- type: LIFECYCLE_EVENTS
recovery:
# Action that can be retried on a failed node, will improve to

View File

@ -1,16 +1,16 @@
# Sample health policy based on node health checking
type: senlin.policy.health
version: 1.0
version: 1.1
description: A policy for maintaining node health from a cluster.
properties:
detection:
# Type for health checking, valid values include:
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
type: NODE_STATUS_POLLING
# Number of seconds between two adjacent checking
interval: 600
options:
# Number of seconds between two adjacent checking
interval: 600
detection_modes:
# Type for health checking, valid values include:
# NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
- type: NODE_STATUS_POLLING
recovery:
# Action that can be retried on a failed node, will improve to

View File

@ -1,16 +1,17 @@
type: senlin.policy.health
version: 1.0
version: 1.1
description: A policy for maintaining node health by polling a URL
properties:
detection:
type: NODE_STATUS_POLL_URL
options:
interval: 120
poll_url: "http://myhealthservice/health/node/{nodename}"
poll_url_healthy_response: "passing"
poll_url_retry_limit: 3
poll_url_retry_interval: 2
node_update_timeout: 240
interval: 120
node_update_timeout: 240
detection_modes:
- type: NODE_STATUS_POLL_URL
options:
poll_url: "http://myhealthservice/health/node/{nodename}"
poll_url_healthy_response: "passing"
poll_url_retry_limit: 3
poll_url_retry_interval: 2
recovery:
actions:
- name: RECREATE