senlin

History

Duc Truong 309fefa68c Add node poll url detection type to health policy The current implementation of health policy is able to check node health either using lifecycle events sent by Nova or by actively polling the the node status using Nova driver. However, neither of those detection types will detect that a node is unhealthy because the compute node on which the node is running on went down. In that case Nova will still report the node status as active even though the node is no longer running. This change adds a new detection type that actively polls the node health using a URL specified in the health policy. That way the user can integrate Senlin health policy with another custom or 3rd party health check service. Using this external health check service, the health policy will be able to detect the failure scenario when a compute node running the node goes down. When a health policy with NODE_STATUS_POLL_URL is attached to a cluster, the health manager performs a GET operation on the URL. The URL specified in NODE_STATUS_POLL_URL can contain expansion parameters that are resolved by the health manager before the GET operation is executed. The only valid expansion parameter at this time is {nodename}. If the response body of the GET operation contains the string specified by poll_url_healthy_response in the health policy, the node is considered to be healthy. If it does not contain that string, the health manager will retry the GET operation the number of times specified in poll_url_retry_limit while sleeping the time interval specified in poll_url_retry_interval between each retry. If the response body of each of those GET operations still does not contain the poll_url_healthy_response string, the node is considered to be unhealthy. Once a node is determined to be unhealthy, the health manager will attempt to recover the node using the recovery type specified in the health policy. In the case when a compute node was shutdown, the deletion of an existing node will never finish. If node_force_recreate is set to True in the health policy's recovery options, the health manager will wait for the time interval in node_delete_timeout before continuing to create the node even if the node deletion failed due to timeout. The detailed changes include: * Add new options to health policy related to node poll url detection type * Update health manager to support node status polling by URL * Update server profile to handle force_recreate option * Add supported status to health policy * Add new configuration value health_check_interval_min The senlin documentation and senlin-tempest-plugin will be updated to include the health policy detection type in a separate patch sets. Change-Id: Iaf9174b4a372df6fc38935a67648229bfb1ebc16	2018-06-26 00:56:19 +00:00
..
policies	Add node poll url detection type to health policy	2018-06-26 00:56:19 +00:00
profiles	Fix grammar error	2018-02-11 13:57:56 +08:00

Duc Truong 309fefa68c Add node poll url detection type to health policy

The current implementation of health policy is able to check node health
either using lifecycle events sent by Nova or by actively polling the
the node status using Nova driver.  However, neither of those detection
types will detect that a node is unhealthy because the compute node on
which the node is running on went down.  In that case Nova will still
report the node status as active even though the node is no longer
running.

This change adds a new detection type that actively polls the node
health using a URL specified in the health policy.  That way the user
can integrate Senlin health policy with another custom or 3rd party
health check service.  Using this external health check service, the
health policy will be able to detect the failure scenario when a compute
node running the node goes down.

When a health policy with NODE_STATUS_POLL_URL is attached to a cluster,
the health manager performs a GET operation on the URL. The URL
specified in NODE_STATUS_POLL_URL can contain expansion parameters that
are resolved by the health manager before the GET operation is executed.
The only valid expansion parameter at this time is {nodename}.

If the response body of the GET operation contains the string specified
by poll_url_healthy_response in the health policy, the node is
considered to be healthy.  If it does not contain that string, the
health manager will retry the GET operation the number of times
specified in poll_url_retry_limit while sleeping the time interval
specified in poll_url_retry_interval between each retry.  If the
response body of each of those GET operations still does not contain the
poll_url_healthy_response string, the node is considered to be
unhealthy.

Once a node is determined to be unhealthy, the health manager will
attempt to recover the node using the recovery type specified in the
health policy.  In the case when a compute node was shutdown, the
deletion of an existing node will never finish.  If node_force_recreate
is set to True in the health policy's recovery options, the health
manager will wait for the time interval in node_delete_timeout before
continuing to create the node even if the node deletion failed due to
timeout.

The detailed changes include:
* Add new options to health policy related to node poll url detection
type
* Update health manager to support node status polling by URL
* Update server profile to handle force_recreate option
* Add supported status to health policy
* Add new configuration value health_check_interval_min

The senlin documentation and senlin-tempest-plugin will be updated to
include the health policy detection type in a separate patch sets.

Change-Id: Iaf9174b4a372df6fc38935a67648229bfb1ebc16

2018-06-26 00:56:19 +00:00

policies

Add node poll url detection type to health policy

2018-06-26 00:56:19 +00:00

profiles

Fix grammar error

2018-02-11 13:57:56 +08:00