Update docs and examples for health policy v1.1

The documentation and example health policies need to be updated to reflect health policy v1.1 schema changes. Change-Id: Ie447397b001fba9798025370cdb0087e679fdfe4 Closes-Bug: #1834516
2019-06-28 20:33:04 +00:00 · 2019-06-28 20:33:04 +00:00 · ebefa20ca2
parent bc12179412
commit ebefa20ca2
5 changed files with 109 additions and 44 deletions
--- a/doc/source/contributor/policies/health_v1.rst
+++ b/doc/source/contributor/policies/health_v1.rst
@ -13,7 +13,7 @@


 ==================
-Health Policy V1.0
+Health Policy V1.1
 ==================

 The health policy is designed to automate the failure detection and recovery
@ -74,7 +74,10 @@ The current vision is that the health policy will support following types
 of failure detection:

 * ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
-  and check if there are nodes inactive.
+  and checks if there are nodes inactive.
+
+* ``NODE_STATUS_POLL_URL``: the *health manager* periodically polls a URL
+  and checks if a node is considered healthy based on the response.

 * ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
  sent by the backend service (e.g. nova-compute).
@ -104,27 +107,19 @@ periodically check the resource status by querying the backend service and see
 if the resource is active.  Below is a sample configuration::

  type: senlin.policy.health
-  version: 1.0
+  version: 1.1
  properties:
    detection:
-      type: NODE_STATUS_POLLING
-      options:
-        internal: 120
-
+      interval: 120
+      detection_modes:
+        - type: NODE_STATUS_POLLING
    ...

-**NOTE**: The current polling logic is only about checking with the backend
-service whether a resource is in "ACTIVE" status. However, in future, this may
-get extended to having the *health manager* ping the IP address of a nova
-server or posting a "GET" request to a specific URL. We believe such
-extensions can better reveal whether a specific node is operating.
-
 Once such a policy object is attached to a cluster, Senlin registers the
 cluster to the *health manager* engine for failure detection, i.e., node
-health checking. A thread is created to issue a ``CLUSTER_CHECK`` RPC request
-to the ``senlin-engine`` periodically at the specified interval. The
-``CLUSTER_CHECK`` action only refreshes the status of each and every node in
-the cluster.
+health checking. A thread is created to periodically call Nova to check the
+status of the node. If the server status is ERROR, SHUTOFF or DELETED, the node
+is considered unhealthy.

 When one of the ``senlin-engine`` services is restarted, a new *health manager*
 engine will be launched. This new engine will check the database and see if
@ -133,6 +128,54 @@ health status maintained by a *health manager* that is no longer alive. The
 new *health manager* will pick up these clusters for health management.


+Polling Node URL
+----------------
+
+The health check for a node can also be configured to periodically query a
+URL with the ``NODE_STATUS_POLL_URL`` detection type. The URL can optionally
+contain expansion parameters.  Expansion parameters are strings enclosed in {}
+that will be substituted with the node specific value by Senlin prior to
+querying the URL. The only valid expansion parameter at this point is
+``{nodename}``. This expansion parameter will be replaced with the name of the
+Senlin node. Below is a sample configuration::
+
+
+  type: senlin.policy.health
+  version: 1.1
+  properties:
+    detection:
+      interval: 120
+      detection_modes:
+        - type: NODE_STATUS_POLL_URL
+          options:
+              poll_url: "http://{nodename}/healthstatus"
+              poll_url_healthy_response: "passing"
+              poll_url_conn_error_as_unhealty: true
+              poll_url_retry_limit: 3
+              poll_url_retry_interval: 2
+    ...
+
+
+.. note::
+    ``{nodename}`` can be used to query a URL implemented by an
+    application running on each node.  This requires that the OpenStack cloud
+    is setup to automatically register the name of new server instances with
+    the DNS service. In the future support for a new expansion parameter for
+    node IP addresses may be added.
+
+Once such a policy object is attached to a cluster, Senlin registers the
+cluster to the *health manager* engine for failure detection, i.e., node
+health checking. A thread is created to periodically make a GET request on the
+specified URL. ``poll_url_conn_error_as_unheathy`` specifies the behavior if
+the URL is unreachable. A node is considered healthy if the response to the GET
+request includes the string specified by ``poll_url_healthy_response``.  If it
+does not, Senlin will retry the URL health check for the number of times
+specified by ``poll_url_retry_limit`` while waiting the number of seconds in
+``poll_url_retry_interval`` between each retry.  If the URL response still does
+not contain the expected string after the retries, the node is considered
+healthy.
+
+
 Listening to Event Notifications
 --------------------------------

@ -151,7 +194,7 @@ to listen to event notifications, users can attach their cluster(s) a health
 policy which looks like the following example::

  type: senlin.policy.health
-  version: 1.0
+  version: 1.1
  properties:
    detection:
      type: LIFECYCLE_EVENTS
--- a/doc/source/user/policy_types/health.rst
+++ b/doc/source/user/policy_types/health.rst
@ -41,24 +41,37 @@ A typical spec for a health policy looks like the following example:
 .. literalinclude :: /../../examples/policies/health_policy_poll.yaml
  :language: yaml

-There are two groups of properties (``detection`` and ``recovery``),  each of
+There are two groups of properties (``detection`` and ``recovery``), each of
 which provides information related to the failure detection and the failure
 recovery aspect respectively.

-For failure detection, you can specify one of the following two values:
+For failure detection, you can specify a detection mode that can be one of the
+following two values:

 - ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
  manager service) is expected to poll each and every nodes periodically to
  find out if they are "alive" or not.

+- ``NODE_STATUS_POLL_URL``: Senlin engine (more specifically, the health
+  manager service) is expected to poll the specified URL periodically to
+  find out if a node is considered healthy or not.
+
 - ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
  message queue when configured. Senlin engine is expected to listen to these
  events and react to them appropriately.

-Both detection types can carry an optional map of ``options``. When the
-detection type is set to "``NODE_STATUS_POLLING``", for example, you can
-specify a value for ``interval`` property to customize the frequency at which
-your cluster nodes are polled.
+It is possible to combine ``NODE_STATUS_POLLING`` and ``NODE_STATUS_POLL_URL``
+detections by specifying multiple detection modes. In the case of multiple
+detection modes, Senlin engine tries each detection type in the order
+specified. The behavior of a failed health check in the case of multiple
+detection modes is specified using ``recovery_conditional``.
+
+``LIFECYCLE_EVENTS`` cannot be combined with any other detection type.
+
+All detection types can carry an optional map of ``options``. When the
+detection type is set to "``NODE_STATUS_POLL_URL``", for example, you can
+specify a value for ``poll_url`` property to specify the URL to be used for
+health checking.

 As the policy type implementation stabilizes, more options may be added later.

@ -106,3 +119,10 @@ Snapshots
 There have been some requirements to take snapshots of a node before recovery
 so that the recovered node(s) will resume from where they failed. This feature
 is also on the TODO list for the development team.
+
+
+References
+~~~~~~~~~~
+
+For more detailed information on how the health policy work, please check
+:doc:`Health Policy V1.1 <../../contributor/policies/health_v1>`
--- a/examples/policies/health_policy_event.yaml
+++ b/examples/policies/health_policy_event.yaml
@ -1,12 +1,13 @@
 # Sample health policy based on VM lifecycle events
 type: senlin.policy.health
-version: 1.0
+version: 1.1
 description: A policy for maintaining node health from a cluster.
 properties:
  detection:
-    # Type for health checking, valid values include:
-    # NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
-    type: LIFECYCLE_EVENTS
+    detection_modes:
+      # Type for health checking, valid values include:
+      # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
+      - type: LIFECYCLE_EVENTS

  recovery:
    # Action that can be retried on a failed node, will improve to
--- a/examples/policies/health_policy_poll.yaml
+++ b/examples/policies/health_policy_poll.yaml
@ -1,16 +1,16 @@
 # Sample health policy based on node health checking
 type: senlin.policy.health
-version: 1.0
+version: 1.1
 description: A policy for maintaining node health from a cluster.
 properties:
  detection:
-    # Type for health checking, valid values include:
-    # NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
-    type: NODE_STATUS_POLLING
+    # Number of seconds between two adjacent checking
+    interval: 600

-    options:
-      # Number of seconds between two adjacent checking
-      interval: 600
+    detection_modes:
+      # Type for health checking, valid values include:
+      # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
+      - type: NODE_STATUS_POLLING

  recovery:
    # Action that can be retried on a failed node, will improve to
--- a/examples/policies/health_policy_poll_url.yaml
+++ b/examples/policies/health_policy_poll_url.yaml
@ -1,16 +1,17 @@
 type: senlin.policy.health
-version: 1.0
+version: 1.1
 description: A policy for maintaining node health by polling a URL
 properties:
  detection:
-    type: NODE_STATUS_POLL_URL
-    options:
-      interval: 120
-      poll_url: "http://myhealthservice/health/node/{nodename}"
-      poll_url_healthy_response: "passing"
-      poll_url_retry_limit: 3
-      poll_url_retry_interval: 2
-      node_update_timeout: 240
+    interval: 120
+    node_update_timeout: 240
+    detection_modes:
+      - type: NODE_STATUS_POLL_URL
+        options:
+          poll_url: "http://myhealthservice/health/node/{nodename}"
+          poll_url_healthy_response: "passing"
+          poll_url_retry_limit: 3
+          poll_url_retry_interval: 2
  recovery:
    actions:
      - name: RECREATE