Update docs and examples for health policy v1.1

The documentation and example health policies need to be updated to
reflect health policy v1.1 schema changes.

Change-Id: Ie447397b001fba9798025370cdb0087e679fdfe4
Closes-Bug: #1834516
This commit is contained in:
Duc Truong 2019-06-28 20:33:04 +00:00
parent bc12179412
commit ebefa20ca2
5 changed files with 109 additions and 44 deletions

View File

@ -13,7 +13,7 @@
================== ==================
Health Policy V1.0 Health Policy V1.1
================== ==================
The health policy is designed to automate the failure detection and recovery The health policy is designed to automate the failure detection and recovery
@ -74,7 +74,10 @@ The current vision is that the health policy will support following types
of failure detection: of failure detection:
* ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster * ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
and check if there are nodes inactive. and checks if there are nodes inactive.
* ``NODE_STATUS_POLL_URL``: the *health manager* periodically polls a URL
and checks if a node is considered healthy based on the response.
* ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications * ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
sent by the backend service (e.g. nova-compute). sent by the backend service (e.g. nova-compute).
@ -104,27 +107,19 @@ periodically check the resource status by querying the backend service and see
if the resource is active. Below is a sample configuration:: if the resource is active. Below is a sample configuration::
type: senlin.policy.health type: senlin.policy.health
version: 1.0 version: 1.1
properties: properties:
detection: detection:
type: NODE_STATUS_POLLING interval: 120
options: detection_modes:
internal: 120 - type: NODE_STATUS_POLLING
... ...
**NOTE**: The current polling logic is only about checking with the backend
service whether a resource is in "ACTIVE" status. However, in future, this may
get extended to having the *health manager* ping the IP address of a nova
server or posting a "GET" request to a specific URL. We believe such
extensions can better reveal whether a specific node is operating.
Once such a policy object is attached to a cluster, Senlin registers the Once such a policy object is attached to a cluster, Senlin registers the
cluster to the *health manager* engine for failure detection, i.e., node cluster to the *health manager* engine for failure detection, i.e., node
health checking. A thread is created to issue a ``CLUSTER_CHECK`` RPC request health checking. A thread is created to periodically call Nova to check the
to the ``senlin-engine`` periodically at the specified interval. The status of the node. If the server status is ERROR, SHUTOFF or DELETED, the node
``CLUSTER_CHECK`` action only refreshes the status of each and every node in is considered unhealthy.
the cluster.
When one of the ``senlin-engine`` services is restarted, a new *health manager* When one of the ``senlin-engine`` services is restarted, a new *health manager*
engine will be launched. This new engine will check the database and see if engine will be launched. This new engine will check the database and see if
@ -133,6 +128,54 @@ health status maintained by a *health manager* that is no longer alive. The
new *health manager* will pick up these clusters for health management. new *health manager* will pick up these clusters for health management.
Polling Node URL
----------------
The health check for a node can also be configured to periodically query a
URL with the ``NODE_STATUS_POLL_URL`` detection type. The URL can optionally
contain expansion parameters. Expansion parameters are strings enclosed in {}
that will be substituted with the node specific value by Senlin prior to
querying the URL. The only valid expansion parameter at this point is
``{nodename}``. This expansion parameter will be replaced with the name of the
Senlin node. Below is a sample configuration::
type: senlin.policy.health
version: 1.1
properties:
detection:
interval: 120
detection_modes:
- type: NODE_STATUS_POLL_URL
options:
poll_url: "http://{nodename}/healthstatus"
poll_url_healthy_response: "passing"
poll_url_conn_error_as_unhealty: true
poll_url_retry_limit: 3
poll_url_retry_interval: 2
...
.. note::
``{nodename}`` can be used to query a URL implemented by an
application running on each node. This requires that the OpenStack cloud
is setup to automatically register the name of new server instances with
the DNS service. In the future support for a new expansion parameter for
node IP addresses may be added.
Once such a policy object is attached to a cluster, Senlin registers the
cluster to the *health manager* engine for failure detection, i.e., node
health checking. A thread is created to periodically make a GET request on the
specified URL. ``poll_url_conn_error_as_unheathy`` specifies the behavior if
the URL is unreachable. A node is considered healthy if the response to the GET
request includes the string specified by ``poll_url_healthy_response``. If it
does not, Senlin will retry the URL health check for the number of times
specified by ``poll_url_retry_limit`` while waiting the number of seconds in
``poll_url_retry_interval`` between each retry. If the URL response still does
not contain the expected string after the retries, the node is considered
healthy.
Listening to Event Notifications Listening to Event Notifications
-------------------------------- --------------------------------
@ -151,7 +194,7 @@ to listen to event notifications, users can attach their cluster(s) a health
policy which looks like the following example:: policy which looks like the following example::
type: senlin.policy.health type: senlin.policy.health
version: 1.0 version: 1.1
properties: properties:
detection: detection:
type: LIFECYCLE_EVENTS type: LIFECYCLE_EVENTS

View File

@ -41,24 +41,37 @@ A typical spec for a health policy looks like the following example:
.. literalinclude :: /../../examples/policies/health_policy_poll.yaml .. literalinclude :: /../../examples/policies/health_policy_poll.yaml
:language: yaml :language: yaml
There are two groups of properties (``detection`` and ``recovery``), each of There are two groups of properties (``detection`` and ``recovery``), each of
which provides information related to the failure detection and the failure which provides information related to the failure detection and the failure
recovery aspect respectively. recovery aspect respectively.
For failure detection, you can specify one of the following two values: For failure detection, you can specify a detection mode that can be one of the
following two values:
- ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health - ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
manager service) is expected to poll each and every nodes periodically to manager service) is expected to poll each and every nodes periodically to
find out if they are "alive" or not. find out if they are "alive" or not.
- ``NODE_STATUS_POLL_URL``: Senlin engine (more specifically, the health
manager service) is expected to poll the specified URL periodically to
find out if a node is considered healthy or not.
- ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the - ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
message queue when configured. Senlin engine is expected to listen to these message queue when configured. Senlin engine is expected to listen to these
events and react to them appropriately. events and react to them appropriately.
Both detection types can carry an optional map of ``options``. When the It is possible to combine ``NODE_STATUS_POLLING`` and ``NODE_STATUS_POLL_URL``
detection type is set to "``NODE_STATUS_POLLING``", for example, you can detections by specifying multiple detection modes. In the case of multiple
specify a value for ``interval`` property to customize the frequency at which detection modes, Senlin engine tries each detection type in the order
your cluster nodes are polled. specified. The behavior of a failed health check in the case of multiple
detection modes is specified using ``recovery_conditional``.
``LIFECYCLE_EVENTS`` cannot be combined with any other detection type.
All detection types can carry an optional map of ``options``. When the
detection type is set to "``NODE_STATUS_POLL_URL``", for example, you can
specify a value for ``poll_url`` property to specify the URL to be used for
health checking.
As the policy type implementation stabilizes, more options may be added later. As the policy type implementation stabilizes, more options may be added later.
@ -106,3 +119,10 @@ Snapshots
There have been some requirements to take snapshots of a node before recovery There have been some requirements to take snapshots of a node before recovery
so that the recovered node(s) will resume from where they failed. This feature so that the recovered node(s) will resume from where they failed. This feature
is also on the TODO list for the development team. is also on the TODO list for the development team.
References
~~~~~~~~~~
For more detailed information on how the health policy work, please check
:doc:`Health Policy V1.1 <../../contributor/policies/health_v1>`

View File

@ -1,12 +1,13 @@
# Sample health policy based on VM lifecycle events # Sample health policy based on VM lifecycle events
type: senlin.policy.health type: senlin.policy.health
version: 1.0 version: 1.1
description: A policy for maintaining node health from a cluster. description: A policy for maintaining node health from a cluster.
properties: properties:
detection: detection:
# Type for health checking, valid values include: detection_modes:
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS # Type for health checking, valid values include:
type: LIFECYCLE_EVENTS # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
- type: LIFECYCLE_EVENTS
recovery: recovery:
# Action that can be retried on a failed node, will improve to # Action that can be retried on a failed node, will improve to

View File

@ -1,16 +1,16 @@
# Sample health policy based on node health checking # Sample health policy based on node health checking
type: senlin.policy.health type: senlin.policy.health
version: 1.0 version: 1.1
description: A policy for maintaining node health from a cluster. description: A policy for maintaining node health from a cluster.
properties: properties:
detection: detection:
# Type for health checking, valid values include: # Number of seconds between two adjacent checking
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS interval: 600
type: NODE_STATUS_POLLING
options: detection_modes:
# Number of seconds between two adjacent checking # Type for health checking, valid values include:
interval: 600 # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
- type: NODE_STATUS_POLLING
recovery: recovery:
# Action that can be retried on a failed node, will improve to # Action that can be retried on a failed node, will improve to

View File

@ -1,16 +1,17 @@
type: senlin.policy.health type: senlin.policy.health
version: 1.0 version: 1.1
description: A policy for maintaining node health by polling a URL description: A policy for maintaining node health by polling a URL
properties: properties:
detection: detection:
type: NODE_STATUS_POLL_URL interval: 120
options: node_update_timeout: 240
interval: 120 detection_modes:
poll_url: "http://myhealthservice/health/node/{nodename}" - type: NODE_STATUS_POLL_URL
poll_url_healthy_response: "passing" options:
poll_url_retry_limit: 3 poll_url: "http://myhealthservice/health/node/{nodename}"
poll_url_retry_interval: 2 poll_url_healthy_response: "passing"
node_update_timeout: 240 poll_url_retry_limit: 3
poll_url_retry_interval: 2
recovery: recovery:
actions: actions:
- name: RECREATE - name: RECREATE