Update docs and examples for health policy v1.1
The documentation and example health policies need to be updated to reflect health policy v1.1 schema changes. Change-Id: Ie447397b001fba9798025370cdb0087e679fdfe4 Closes-Bug: #1834516
This commit is contained in:
parent
bc12179412
commit
ebefa20ca2
|
@ -13,7 +13,7 @@
|
||||||
|
|
||||||
|
|
||||||
==================
|
==================
|
||||||
Health Policy V1.0
|
Health Policy V1.1
|
||||||
==================
|
==================
|
||||||
|
|
||||||
The health policy is designed to automate the failure detection and recovery
|
The health policy is designed to automate the failure detection and recovery
|
||||||
|
@ -74,7 +74,10 @@ The current vision is that the health policy will support following types
|
||||||
of failure detection:
|
of failure detection:
|
||||||
|
|
||||||
* ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
|
* ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
|
||||||
and check if there are nodes inactive.
|
and checks if there are nodes inactive.
|
||||||
|
|
||||||
|
* ``NODE_STATUS_POLL_URL``: the *health manager* periodically polls a URL
|
||||||
|
and checks if a node is considered healthy based on the response.
|
||||||
|
|
||||||
* ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
|
* ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
|
||||||
sent by the backend service (e.g. nova-compute).
|
sent by the backend service (e.g. nova-compute).
|
||||||
|
@ -104,27 +107,19 @@ periodically check the resource status by querying the backend service and see
|
||||||
if the resource is active. Below is a sample configuration::
|
if the resource is active. Below is a sample configuration::
|
||||||
|
|
||||||
type: senlin.policy.health
|
type: senlin.policy.health
|
||||||
version: 1.0
|
version: 1.1
|
||||||
properties:
|
properties:
|
||||||
detection:
|
detection:
|
||||||
type: NODE_STATUS_POLLING
|
interval: 120
|
||||||
options:
|
detection_modes:
|
||||||
internal: 120
|
- type: NODE_STATUS_POLLING
|
||||||
|
|
||||||
...
|
...
|
||||||
|
|
||||||
**NOTE**: The current polling logic is only about checking with the backend
|
|
||||||
service whether a resource is in "ACTIVE" status. However, in future, this may
|
|
||||||
get extended to having the *health manager* ping the IP address of a nova
|
|
||||||
server or posting a "GET" request to a specific URL. We believe such
|
|
||||||
extensions can better reveal whether a specific node is operating.
|
|
||||||
|
|
||||||
Once such a policy object is attached to a cluster, Senlin registers the
|
Once such a policy object is attached to a cluster, Senlin registers the
|
||||||
cluster to the *health manager* engine for failure detection, i.e., node
|
cluster to the *health manager* engine for failure detection, i.e., node
|
||||||
health checking. A thread is created to issue a ``CLUSTER_CHECK`` RPC request
|
health checking. A thread is created to periodically call Nova to check the
|
||||||
to the ``senlin-engine`` periodically at the specified interval. The
|
status of the node. If the server status is ERROR, SHUTOFF or DELETED, the node
|
||||||
``CLUSTER_CHECK`` action only refreshes the status of each and every node in
|
is considered unhealthy.
|
||||||
the cluster.
|
|
||||||
|
|
||||||
When one of the ``senlin-engine`` services is restarted, a new *health manager*
|
When one of the ``senlin-engine`` services is restarted, a new *health manager*
|
||||||
engine will be launched. This new engine will check the database and see if
|
engine will be launched. This new engine will check the database and see if
|
||||||
|
@ -133,6 +128,54 @@ health status maintained by a *health manager* that is no longer alive. The
|
||||||
new *health manager* will pick up these clusters for health management.
|
new *health manager* will pick up these clusters for health management.
|
||||||
|
|
||||||
|
|
||||||
|
Polling Node URL
|
||||||
|
----------------
|
||||||
|
|
||||||
|
The health check for a node can also be configured to periodically query a
|
||||||
|
URL with the ``NODE_STATUS_POLL_URL`` detection type. The URL can optionally
|
||||||
|
contain expansion parameters. Expansion parameters are strings enclosed in {}
|
||||||
|
that will be substituted with the node specific value by Senlin prior to
|
||||||
|
querying the URL. The only valid expansion parameter at this point is
|
||||||
|
``{nodename}``. This expansion parameter will be replaced with the name of the
|
||||||
|
Senlin node. Below is a sample configuration::
|
||||||
|
|
||||||
|
|
||||||
|
type: senlin.policy.health
|
||||||
|
version: 1.1
|
||||||
|
properties:
|
||||||
|
detection:
|
||||||
|
interval: 120
|
||||||
|
detection_modes:
|
||||||
|
- type: NODE_STATUS_POLL_URL
|
||||||
|
options:
|
||||||
|
poll_url: "http://{nodename}/healthstatus"
|
||||||
|
poll_url_healthy_response: "passing"
|
||||||
|
poll_url_conn_error_as_unhealty: true
|
||||||
|
poll_url_retry_limit: 3
|
||||||
|
poll_url_retry_interval: 2
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
``{nodename}`` can be used to query a URL implemented by an
|
||||||
|
application running on each node. This requires that the OpenStack cloud
|
||||||
|
is setup to automatically register the name of new server instances with
|
||||||
|
the DNS service. In the future support for a new expansion parameter for
|
||||||
|
node IP addresses may be added.
|
||||||
|
|
||||||
|
Once such a policy object is attached to a cluster, Senlin registers the
|
||||||
|
cluster to the *health manager* engine for failure detection, i.e., node
|
||||||
|
health checking. A thread is created to periodically make a GET request on the
|
||||||
|
specified URL. ``poll_url_conn_error_as_unheathy`` specifies the behavior if
|
||||||
|
the URL is unreachable. A node is considered healthy if the response to the GET
|
||||||
|
request includes the string specified by ``poll_url_healthy_response``. If it
|
||||||
|
does not, Senlin will retry the URL health check for the number of times
|
||||||
|
specified by ``poll_url_retry_limit`` while waiting the number of seconds in
|
||||||
|
``poll_url_retry_interval`` between each retry. If the URL response still does
|
||||||
|
not contain the expected string after the retries, the node is considered
|
||||||
|
healthy.
|
||||||
|
|
||||||
|
|
||||||
Listening to Event Notifications
|
Listening to Event Notifications
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
|
@ -151,7 +194,7 @@ to listen to event notifications, users can attach their cluster(s) a health
|
||||||
policy which looks like the following example::
|
policy which looks like the following example::
|
||||||
|
|
||||||
type: senlin.policy.health
|
type: senlin.policy.health
|
||||||
version: 1.0
|
version: 1.1
|
||||||
properties:
|
properties:
|
||||||
detection:
|
detection:
|
||||||
type: LIFECYCLE_EVENTS
|
type: LIFECYCLE_EVENTS
|
||||||
|
|
|
@ -41,24 +41,37 @@ A typical spec for a health policy looks like the following example:
|
||||||
.. literalinclude :: /../../examples/policies/health_policy_poll.yaml
|
.. literalinclude :: /../../examples/policies/health_policy_poll.yaml
|
||||||
:language: yaml
|
:language: yaml
|
||||||
|
|
||||||
There are two groups of properties (``detection`` and ``recovery``), each of
|
There are two groups of properties (``detection`` and ``recovery``), each of
|
||||||
which provides information related to the failure detection and the failure
|
which provides information related to the failure detection and the failure
|
||||||
recovery aspect respectively.
|
recovery aspect respectively.
|
||||||
|
|
||||||
For failure detection, you can specify one of the following two values:
|
For failure detection, you can specify a detection mode that can be one of the
|
||||||
|
following two values:
|
||||||
|
|
||||||
- ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
|
- ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
|
||||||
manager service) is expected to poll each and every nodes periodically to
|
manager service) is expected to poll each and every nodes periodically to
|
||||||
find out if they are "alive" or not.
|
find out if they are "alive" or not.
|
||||||
|
|
||||||
|
- ``NODE_STATUS_POLL_URL``: Senlin engine (more specifically, the health
|
||||||
|
manager service) is expected to poll the specified URL periodically to
|
||||||
|
find out if a node is considered healthy or not.
|
||||||
|
|
||||||
- ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
|
- ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
|
||||||
message queue when configured. Senlin engine is expected to listen to these
|
message queue when configured. Senlin engine is expected to listen to these
|
||||||
events and react to them appropriately.
|
events and react to them appropriately.
|
||||||
|
|
||||||
Both detection types can carry an optional map of ``options``. When the
|
It is possible to combine ``NODE_STATUS_POLLING`` and ``NODE_STATUS_POLL_URL``
|
||||||
detection type is set to "``NODE_STATUS_POLLING``", for example, you can
|
detections by specifying multiple detection modes. In the case of multiple
|
||||||
specify a value for ``interval`` property to customize the frequency at which
|
detection modes, Senlin engine tries each detection type in the order
|
||||||
your cluster nodes are polled.
|
specified. The behavior of a failed health check in the case of multiple
|
||||||
|
detection modes is specified using ``recovery_conditional``.
|
||||||
|
|
||||||
|
``LIFECYCLE_EVENTS`` cannot be combined with any other detection type.
|
||||||
|
|
||||||
|
All detection types can carry an optional map of ``options``. When the
|
||||||
|
detection type is set to "``NODE_STATUS_POLL_URL``", for example, you can
|
||||||
|
specify a value for ``poll_url`` property to specify the URL to be used for
|
||||||
|
health checking.
|
||||||
|
|
||||||
As the policy type implementation stabilizes, more options may be added later.
|
As the policy type implementation stabilizes, more options may be added later.
|
||||||
|
|
||||||
|
@ -106,3 +119,10 @@ Snapshots
|
||||||
There have been some requirements to take snapshots of a node before recovery
|
There have been some requirements to take snapshots of a node before recovery
|
||||||
so that the recovered node(s) will resume from where they failed. This feature
|
so that the recovered node(s) will resume from where they failed. This feature
|
||||||
is also on the TODO list for the development team.
|
is also on the TODO list for the development team.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
For more detailed information on how the health policy work, please check
|
||||||
|
:doc:`Health Policy V1.1 <../../contributor/policies/health_v1>`
|
|
@ -1,12 +1,13 @@
|
||||||
# Sample health policy based on VM lifecycle events
|
# Sample health policy based on VM lifecycle events
|
||||||
type: senlin.policy.health
|
type: senlin.policy.health
|
||||||
version: 1.0
|
version: 1.1
|
||||||
description: A policy for maintaining node health from a cluster.
|
description: A policy for maintaining node health from a cluster.
|
||||||
properties:
|
properties:
|
||||||
detection:
|
detection:
|
||||||
# Type for health checking, valid values include:
|
detection_modes:
|
||||||
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
|
# Type for health checking, valid values include:
|
||||||
type: LIFECYCLE_EVENTS
|
# NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
|
||||||
|
- type: LIFECYCLE_EVENTS
|
||||||
|
|
||||||
recovery:
|
recovery:
|
||||||
# Action that can be retried on a failed node, will improve to
|
# Action that can be retried on a failed node, will improve to
|
||||||
|
|
|
@ -1,16 +1,16 @@
|
||||||
# Sample health policy based on node health checking
|
# Sample health policy based on node health checking
|
||||||
type: senlin.policy.health
|
type: senlin.policy.health
|
||||||
version: 1.0
|
version: 1.1
|
||||||
description: A policy for maintaining node health from a cluster.
|
description: A policy for maintaining node health from a cluster.
|
||||||
properties:
|
properties:
|
||||||
detection:
|
detection:
|
||||||
# Type for health checking, valid values include:
|
# Number of seconds between two adjacent checking
|
||||||
# NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
|
interval: 600
|
||||||
type: NODE_STATUS_POLLING
|
|
||||||
|
|
||||||
options:
|
detection_modes:
|
||||||
# Number of seconds between two adjacent checking
|
# Type for health checking, valid values include:
|
||||||
interval: 600
|
# NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
|
||||||
|
- type: NODE_STATUS_POLLING
|
||||||
|
|
||||||
recovery:
|
recovery:
|
||||||
# Action that can be retried on a failed node, will improve to
|
# Action that can be retried on a failed node, will improve to
|
||||||
|
|
|
@ -1,16 +1,17 @@
|
||||||
type: senlin.policy.health
|
type: senlin.policy.health
|
||||||
version: 1.0
|
version: 1.1
|
||||||
description: A policy for maintaining node health by polling a URL
|
description: A policy for maintaining node health by polling a URL
|
||||||
properties:
|
properties:
|
||||||
detection:
|
detection:
|
||||||
type: NODE_STATUS_POLL_URL
|
interval: 120
|
||||||
options:
|
node_update_timeout: 240
|
||||||
interval: 120
|
detection_modes:
|
||||||
poll_url: "http://myhealthservice/health/node/{nodename}"
|
- type: NODE_STATUS_POLL_URL
|
||||||
poll_url_healthy_response: "passing"
|
options:
|
||||||
poll_url_retry_limit: 3
|
poll_url: "http://myhealthservice/health/node/{nodename}"
|
||||||
poll_url_retry_interval: 2
|
poll_url_healthy_response: "passing"
|
||||||
node_update_timeout: 240
|
poll_url_retry_limit: 3
|
||||||
|
poll_url_retry_interval: 2
|
||||||
recovery:
|
recovery:
|
||||||
actions:
|
actions:
|
||||||
- name: RECREATE
|
- name: RECREATE
|
||||||
|
|
Loading…
Reference in New Issue