Update docs and examples for health policy v1.1
The documentation and example health policies need to be updated to reflect health policy v1.1 schema changes. Change-Id: Ie447397b001fba9798025370cdb0087e679fdfe4 Closes-Bug: #1834516
This commit is contained in:
parent
bc12179412
commit
ebefa20ca2
|
@ -13,7 +13,7 @@
|
|||
|
||||
|
||||
==================
|
||||
Health Policy V1.0
|
||||
Health Policy V1.1
|
||||
==================
|
||||
|
||||
The health policy is designed to automate the failure detection and recovery
|
||||
|
@ -74,7 +74,10 @@ The current vision is that the health policy will support following types
|
|||
of failure detection:
|
||||
|
||||
* ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
|
||||
and check if there are nodes inactive.
|
||||
and checks if there are nodes inactive.
|
||||
|
||||
* ``NODE_STATUS_POLL_URL``: the *health manager* periodically polls a URL
|
||||
and checks if a node is considered healthy based on the response.
|
||||
|
||||
* ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
|
||||
sent by the backend service (e.g. nova-compute).
|
||||
|
@ -104,27 +107,19 @@ periodically check the resource status by querying the backend service and see
|
|||
if the resource is active. Below is a sample configuration::
|
||||
|
||||
type: senlin.policy.health
|
||||
version: 1.0
|
||||
version: 1.1
|
||||
properties:
|
||||
detection:
|
||||
type: NODE_STATUS_POLLING
|
||||
options:
|
||||
internal: 120
|
||||
|
||||
interval: 120
|
||||
detection_modes:
|
||||
- type: NODE_STATUS_POLLING
|
||||
...
|
||||
|
||||
**NOTE**: The current polling logic is only about checking with the backend
|
||||
service whether a resource is in "ACTIVE" status. However, in future, this may
|
||||
get extended to having the *health manager* ping the IP address of a nova
|
||||
server or posting a "GET" request to a specific URL. We believe such
|
||||
extensions can better reveal whether a specific node is operating.
|
||||
|
||||
Once such a policy object is attached to a cluster, Senlin registers the
|
||||
cluster to the *health manager* engine for failure detection, i.e., node
|
||||
health checking. A thread is created to issue a ``CLUSTER_CHECK`` RPC request
|
||||
to the ``senlin-engine`` periodically at the specified interval. The
|
||||
``CLUSTER_CHECK`` action only refreshes the status of each and every node in
|
||||
the cluster.
|
||||
health checking. A thread is created to periodically call Nova to check the
|
||||
status of the node. If the server status is ERROR, SHUTOFF or DELETED, the node
|
||||
is considered unhealthy.
|
||||
|
||||
When one of the ``senlin-engine`` services is restarted, a new *health manager*
|
||||
engine will be launched. This new engine will check the database and see if
|
||||
|
@ -133,6 +128,54 @@ health status maintained by a *health manager* that is no longer alive. The
|
|||
new *health manager* will pick up these clusters for health management.
|
||||
|
||||
|
||||
Polling Node URL
|
||||
----------------
|
||||
|
||||
The health check for a node can also be configured to periodically query a
|
||||
URL with the ``NODE_STATUS_POLL_URL`` detection type. The URL can optionally
|
||||
contain expansion parameters. Expansion parameters are strings enclosed in {}
|
||||
that will be substituted with the node specific value by Senlin prior to
|
||||
querying the URL. The only valid expansion parameter at this point is
|
||||
``{nodename}``. This expansion parameter will be replaced with the name of the
|
||||
Senlin node. Below is a sample configuration::
|
||||
|
||||
|
||||
type: senlin.policy.health
|
||||
version: 1.1
|
||||
properties:
|
||||
detection:
|
||||
interval: 120
|
||||
detection_modes:
|
||||
- type: NODE_STATUS_POLL_URL
|
||||
options:
|
||||
poll_url: "http://{nodename}/healthstatus"
|
||||
poll_url_healthy_response: "passing"
|
||||
poll_url_conn_error_as_unhealty: true
|
||||
poll_url_retry_limit: 3
|
||||
poll_url_retry_interval: 2
|
||||
...
|
||||
|
||||
|
||||
.. note::
|
||||
``{nodename}`` can be used to query a URL implemented by an
|
||||
application running on each node. This requires that the OpenStack cloud
|
||||
is setup to automatically register the name of new server instances with
|
||||
the DNS service. In the future support for a new expansion parameter for
|
||||
node IP addresses may be added.
|
||||
|
||||
Once such a policy object is attached to a cluster, Senlin registers the
|
||||
cluster to the *health manager* engine for failure detection, i.e., node
|
||||
health checking. A thread is created to periodically make a GET request on the
|
||||
specified URL. ``poll_url_conn_error_as_unheathy`` specifies the behavior if
|
||||
the URL is unreachable. A node is considered healthy if the response to the GET
|
||||
request includes the string specified by ``poll_url_healthy_response``. If it
|
||||
does not, Senlin will retry the URL health check for the number of times
|
||||
specified by ``poll_url_retry_limit`` while waiting the number of seconds in
|
||||
``poll_url_retry_interval`` between each retry. If the URL response still does
|
||||
not contain the expected string after the retries, the node is considered
|
||||
healthy.
|
||||
|
||||
|
||||
Listening to Event Notifications
|
||||
--------------------------------
|
||||
|
||||
|
@ -151,7 +194,7 @@ to listen to event notifications, users can attach their cluster(s) a health
|
|||
policy which looks like the following example::
|
||||
|
||||
type: senlin.policy.health
|
||||
version: 1.0
|
||||
version: 1.1
|
||||
properties:
|
||||
detection:
|
||||
type: LIFECYCLE_EVENTS
|
||||
|
|
|
@ -41,24 +41,37 @@ A typical spec for a health policy looks like the following example:
|
|||
.. literalinclude :: /../../examples/policies/health_policy_poll.yaml
|
||||
:language: yaml
|
||||
|
||||
There are two groups of properties (``detection`` and ``recovery``), each of
|
||||
There are two groups of properties (``detection`` and ``recovery``), each of
|
||||
which provides information related to the failure detection and the failure
|
||||
recovery aspect respectively.
|
||||
|
||||
For failure detection, you can specify one of the following two values:
|
||||
For failure detection, you can specify a detection mode that can be one of the
|
||||
following two values:
|
||||
|
||||
- ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
|
||||
manager service) is expected to poll each and every nodes periodically to
|
||||
find out if they are "alive" or not.
|
||||
|
||||
- ``NODE_STATUS_POLL_URL``: Senlin engine (more specifically, the health
|
||||
manager service) is expected to poll the specified URL periodically to
|
||||
find out if a node is considered healthy or not.
|
||||
|
||||
- ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
|
||||
message queue when configured. Senlin engine is expected to listen to these
|
||||
events and react to them appropriately.
|
||||
|
||||
Both detection types can carry an optional map of ``options``. When the
|
||||
detection type is set to "``NODE_STATUS_POLLING``", for example, you can
|
||||
specify a value for ``interval`` property to customize the frequency at which
|
||||
your cluster nodes are polled.
|
||||
It is possible to combine ``NODE_STATUS_POLLING`` and ``NODE_STATUS_POLL_URL``
|
||||
detections by specifying multiple detection modes. In the case of multiple
|
||||
detection modes, Senlin engine tries each detection type in the order
|
||||
specified. The behavior of a failed health check in the case of multiple
|
||||
detection modes is specified using ``recovery_conditional``.
|
||||
|
||||
``LIFECYCLE_EVENTS`` cannot be combined with any other detection type.
|
||||
|
||||
All detection types can carry an optional map of ``options``. When the
|
||||
detection type is set to "``NODE_STATUS_POLL_URL``", for example, you can
|
||||
specify a value for ``poll_url`` property to specify the URL to be used for
|
||||
health checking.
|
||||
|
||||
As the policy type implementation stabilizes, more options may be added later.
|
||||
|
||||
|
@ -106,3 +119,10 @@ Snapshots
|
|||
There have been some requirements to take snapshots of a node before recovery
|
||||
so that the recovered node(s) will resume from where they failed. This feature
|
||||
is also on the TODO list for the development team.
|
||||
|
||||
|
||||
References
|
||||
~~~~~~~~~~
|
||||
|
||||
For more detailed information on how the health policy work, please check
|
||||
:doc:`Health Policy V1.1 <../../contributor/policies/health_v1>`
|
|
@ -1,12 +1,13 @@
|
|||
# Sample health policy based on VM lifecycle events
|
||||
type: senlin.policy.health
|
||||
version: 1.0
|
||||
version: 1.1
|
||||
description: A policy for maintaining node health from a cluster.
|
||||
properties:
|
||||
detection:
|
||||
# Type for health checking, valid values include:
|
||||
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
|
||||
type: LIFECYCLE_EVENTS
|
||||
detection_modes:
|
||||
# Type for health checking, valid values include:
|
||||
# NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
|
||||
- type: LIFECYCLE_EVENTS
|
||||
|
||||
recovery:
|
||||
# Action that can be retried on a failed node, will improve to
|
||||
|
|
|
@ -1,16 +1,16 @@
|
|||
# Sample health policy based on node health checking
|
||||
type: senlin.policy.health
|
||||
version: 1.0
|
||||
version: 1.1
|
||||
description: A policy for maintaining node health from a cluster.
|
||||
properties:
|
||||
detection:
|
||||
# Type for health checking, valid values include:
|
||||
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
|
||||
type: NODE_STATUS_POLLING
|
||||
# Number of seconds between two adjacent checking
|
||||
interval: 600
|
||||
|
||||
options:
|
||||
# Number of seconds between two adjacent checking
|
||||
interval: 600
|
||||
detection_modes:
|
||||
# Type for health checking, valid values include:
|
||||
# NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
|
||||
- type: NODE_STATUS_POLLING
|
||||
|
||||
recovery:
|
||||
# Action that can be retried on a failed node, will improve to
|
||||
|
|
|
@ -1,16 +1,17 @@
|
|||
type: senlin.policy.health
|
||||
version: 1.0
|
||||
version: 1.1
|
||||
description: A policy for maintaining node health by polling a URL
|
||||
properties:
|
||||
detection:
|
||||
type: NODE_STATUS_POLL_URL
|
||||
options:
|
||||
interval: 120
|
||||
poll_url: "http://myhealthservice/health/node/{nodename}"
|
||||
poll_url_healthy_response: "passing"
|
||||
poll_url_retry_limit: 3
|
||||
poll_url_retry_interval: 2
|
||||
node_update_timeout: 240
|
||||
interval: 120
|
||||
node_update_timeout: 240
|
||||
detection_modes:
|
||||
- type: NODE_STATUS_POLL_URL
|
||||
options:
|
||||
poll_url: "http://myhealthservice/health/node/{nodename}"
|
||||
poll_url_healthy_response: "passing"
|
||||
poll_url_retry_limit: 3
|
||||
poll_url_retry_interval: 2
|
||||
recovery:
|
||||
actions:
|
||||
- name: RECREATE
|
||||
|
|
Loading…
Reference in New Issue