Update docs and examples for health policy v1.1

The documentation and example health policies need to be updated to reflect health policy v1.1 schema changes. Change-Id: Ie447397b001fba9798025370cdb0087e679fdfe4 Closes-Bug: #1834516
2019-06-28 20:33:04 +00:00 · 2019-06-28 20:33:04 +00:00 · ebefa20ca2
parent bc12179412
commit ebefa20ca2
5 changed files with 109 additions and 44 deletions
--- a/doc/source/contributor/policies/health_v1.rst
+++ b/doc/source/contributor/policies/health_v1.rst
@ -13,7 +13,7 @@
 ==================
-Health Policy V1.0
+Health Policy V1.1
 ==================
 The health policy is designed to automate the failure detection and recovery
@ -74,7 +74,10 @@ The current vision is that the health policy will support following types
 of failure detection:
 * ``NODE_STATUS_POLLING``: the *health manager* periodically polls a cluster
-  and check if there are nodes inactive.
+  and checks if there are nodes inactive.
 * ``NODE_STATUS_POLL_URL``: the *health manager* periodically polls a URL
  and checks if a node is considered healthy based on the response.
 * ``LIFECYCLE_EVENTS``: the *health manager* listens to event notifications
  sent by the backend service (e.g. nova-compute).
@ -104,27 +107,19 @@ periodically check the resource status by querying the backend service and see
 if the resource is active.  Below is a sample configuration::
  type: senlin.policy.health
-  version: 1.0
+  version: 1.1
  properties:
    detection:
-      type: NODE_STATUS_POLLING
+      interval: 120
-      options:
+      detection_modes:
-        internal: 120
+        - type: NODE_STATUS_POLLING
    ...
 **NOTE**: The current polling logic is only about checking with the backend
 service whether a resource is in "ACTIVE" status. However, in future, this may
 get extended to having the *health manager* ping the IP address of a nova
 server or posting a "GET" request to a specific URL. We believe such
 extensions can better reveal whether a specific node is operating.
 Once such a policy object is attached to a cluster, Senlin registers the
 cluster to the *health manager* engine for failure detection, i.e., node
-health checking. A thread is created to issue a ``CLUSTER_CHECK`` RPC request
+health checking. A thread is created to periodically call Nova to check the
-to the ``senlin-engine`` periodically at the specified interval. The
+status of the node. If the server status is ERROR, SHUTOFF or DELETED, the node
-``CLUSTER_CHECK`` action only refreshes the status of each and every node in
+is considered unhealthy.
 the cluster.
 When one of the ``senlin-engine`` services is restarted, a new *health manager*
 engine will be launched. This new engine will check the database and see if
@ -133,6 +128,54 @@ health status maintained by a *health manager* that is no longer alive. The
 new *health manager* will pick up these clusters for health management.
 Polling Node URL
 ----------------
 The health check for a node can also be configured to periodically query a
 URL with the ``NODE_STATUS_POLL_URL`` detection type. The URL can optionally
 contain expansion parameters.  Expansion parameters are strings enclosed in {}
 that will be substituted with the node specific value by Senlin prior to
 querying the URL. The only valid expansion parameter at this point is
 ``{nodename}``. This expansion parameter will be replaced with the name of the
 Senlin node. Below is a sample configuration::
  type: senlin.policy.health
  version: 1.1
  properties:
    detection:
      interval: 120
      detection_modes:
        - type: NODE_STATUS_POLL_URL
          options:
              poll_url: "http://{nodename}/healthstatus"
              poll_url_healthy_response: "passing"
              poll_url_conn_error_as_unhealty: true
              poll_url_retry_limit: 3
              poll_url_retry_interval: 2
    ...
 .. note::
    ``{nodename}`` can be used to query a URL implemented by an
    application running on each node.  This requires that the OpenStack cloud
    is setup to automatically register the name of new server instances with
    the DNS service. In the future support for a new expansion parameter for
    node IP addresses may be added.
 Once such a policy object is attached to a cluster, Senlin registers the
 cluster to the *health manager* engine for failure detection, i.e., node
 health checking. A thread is created to periodically make a GET request on the
 specified URL. ``poll_url_conn_error_as_unheathy`` specifies the behavior if
 the URL is unreachable. A node is considered healthy if the response to the GET
 request includes the string specified by ``poll_url_healthy_response``.  If it
 does not, Senlin will retry the URL health check for the number of times
 specified by ``poll_url_retry_limit`` while waiting the number of seconds in
 ``poll_url_retry_interval`` between each retry.  If the URL response still does
 not contain the expected string after the retries, the node is considered
 healthy.
 Listening to Event Notifications
 --------------------------------
@ -151,7 +194,7 @@ to listen to event notifications, users can attach their cluster(s) a health
 policy which looks like the following example::
  type: senlin.policy.health
-  version: 1.0
+  version: 1.1
  properties:
    detection:
      type: LIFECYCLE_EVENTS
--- a/doc/source/user/policy_types/health.rst
+++ b/doc/source/user/policy_types/health.rst
@ -41,24 +41,37 @@ A typical spec for a health policy looks like the following example:
 .. literalinclude :: /../../examples/policies/health_policy_poll.yaml
  :language: yaml
-There are two groups of properties (``detection`` and ``recovery``),  each of
+There are two groups of properties (``detection`` and ``recovery``), each of
 which provides information related to the failure detection and the failure
 recovery aspect respectively.
-For failure detection, you can specify one of the following two values:
+For failure detection, you can specify a detection mode that can be one of the
 following two values:
 - ``NODE_STATUS_POLLING``: Senlin engine (more specifically, the health
  manager service) is expected to poll each and every nodes periodically to
  find out if they are "alive" or not.
 - ``NODE_STATUS_POLL_URL``: Senlin engine (more specifically, the health
  manager service) is expected to poll the specified URL periodically to
  find out if a node is considered healthy or not.
 - ``LIFECYCLE_EVENTS``: Many services can emit notification messages on the
  message queue when configured. Senlin engine is expected to listen to these
  events and react to them appropriately.
-Both detection types can carry an optional map of ``options``. When the
+It is possible to combine ``NODE_STATUS_POLLING`` and ``NODE_STATUS_POLL_URL``
-detection type is set to "``NODE_STATUS_POLLING``", for example, you can
+detections by specifying multiple detection modes. In the case of multiple
-specify a value for ``interval`` property to customize the frequency at which
+detection modes, Senlin engine tries each detection type in the order
-your cluster nodes are polled.
+specified. The behavior of a failed health check in the case of multiple
 detection modes is specified using ``recovery_conditional``.
 ``LIFECYCLE_EVENTS`` cannot be combined with any other detection type.
 All detection types can carry an optional map of ``options``. When the
 detection type is set to "``NODE_STATUS_POLL_URL``", for example, you can
 specify a value for ``poll_url`` property to specify the URL to be used for
 health checking.
 As the policy type implementation stabilizes, more options may be added later.
@ -106,3 +119,10 @@ Snapshots
 There have been some requirements to take snapshots of a node before recovery
 so that the recovered node(s) will resume from where they failed. This feature
 is also on the TODO list for the development team.
 References
 ~~~~~~~~~~
 For more detailed information on how the health policy work, please check
 :doc:`Health Policy V1.1 <../../contributor/policies/health_v1>`
--- a/examples/policies/health_policy_event.yaml
+++ b/examples/policies/health_policy_event.yaml
@ -1,12 +1,13 @@
 # Sample health policy based on VM lifecycle events
 type: senlin.policy.health
-version: 1.0
+version: 1.1
 description: A policy for maintaining node health from a cluster.
 properties:
  detection:
-    # Type for health checking, valid values include:
+    detection_modes:
-    # NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
+      # Type for health checking, valid values include:
-    type: LIFECYCLE_EVENTS
+      # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
      - type: LIFECYCLE_EVENTS
  recovery:
    # Action that can be retried on a failed node, will improve to
--- a/examples/policies/health_policy_poll.yaml
+++ b/examples/policies/health_policy_poll.yaml
@ -1,16 +1,16 @@
 # Sample health policy based on node health checking
 type: senlin.policy.health
-version: 1.0
+version: 1.1
 description: A policy for maintaining node health from a cluster.
 properties:
  detection:
-    # Type for health checking, valid values include:
+    # Number of seconds between two adjacent checking
-    # NODE_STATUS_POLLING, LB_STATUS_POLLING, LIFECYCLE_EVENTS
+    interval: 600
    type: NODE_STATUS_POLLING
-    options:
+    detection_modes:
-      # Number of seconds between two adjacent checking
+      # Type for health checking, valid values include:
-      interval: 600
+      # NODE_STATUS_POLLING, NODE_STATUS_POLL_URL, LIFECYCLE_EVENTS
      - type: NODE_STATUS_POLLING
  recovery:
    # Action that can be retried on a failed node, will improve to
--- a/examples/policies/health_policy_poll_url.yaml
+++ b/examples/policies/health_policy_poll_url.yaml
@ -1,16 +1,17 @@
 type: senlin.policy.health
-version: 1.0
+version: 1.1
 description: A policy for maintaining node health by polling a URL
 properties:
  detection:
-    type: NODE_STATUS_POLL_URL
+    interval: 120
-    options:
+    node_update_timeout: 240
-      interval: 120
+    detection_modes:
-      poll_url: "http://myhealthservice/health/node/{nodename}"
+      - type: NODE_STATUS_POLL_URL
-      poll_url_healthy_response: "passing"
+        options:
-      poll_url_retry_limit: 3
+          poll_url: "http://myhealthservice/health/node/{nodename}"
-      poll_url_retry_interval: 2
+          poll_url_healthy_response: "passing"
-      node_update_timeout: 240
+          poll_url_retry_limit: 3
          poll_url_retry_interval: 2
  recovery:
    actions:
      - name: RECREATE