Merge "Add an API to manually mark a resource as unhealthy"

2016-02-29 16:57:56 +00:00 · 2016-02-29 16:57:56 +00:00 · b3c15e6fb6
commit b3c15e6fb6
parent d87a12c91a a1c3eecbdd
1 changed files with 201 additions and 0 deletions
--- a/specs/mitaka/mark-unhealthy.rst
+++ b/specs/mitaka/mark-unhealthy.rst
@ -0,0 +1,201 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ..
 This template should be in ReSTructured text. The filename in the git
 repository should match the launchpad URL, for example a URL of
 https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
 awesome-thing.rst .  Please do not delete any of the sections in this
 template.  If you have nothing to say for a whole section, just write: None
 For help with syntax, see http://sphinx-doc.org/rest.html
 To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
 ==========================
 Mark Unhealthy Resources
 ==========================
 https://blueprints.launchpad.net/heat/+spec/mark-unhealthy
 Add an interface to allow the user to communicate information about the health
 of a resource that Heat cannot determine on its own.
 Problem description
 ===================
 The only mechanism that Heat has for evaluating the health of a resource is to
 compare its properties against the output of the relevant OpenStack API. (This
 happens via the stack-check command in the current architecture, but will be
 automatic on updates in the proposed Phase 2 of the Convergence architecture).
 However, there may exist resources that the user (or application) knows are
 unhealthy where Heat has no way of determining that. The obvious example is a
 server which is running as far as Nova is concerned but is, in point of fact,
 borked as far as the application is concerned.
 Currently there is no way for an user (or application) to replace such a
 resource without going performing multiple orchestration passes or renaming the
 resource or both. Both are undesirable, and this leaves the user unable to take
 advantage of Heat's ability to correctly replace a resource as part of a single
 workflow.
 Proposed change
 ===============
 Add a PATCH handler to the Resource endpoint::
  /stacks/<stack_name>/<stack_id>/resources/<resource_id>
 The PATCH method will accept a JSON body of the form::
  {
    'mark_unhealthy': <bool>,
    'resource_status_reason': <string>
  }
 For legacy stacks, this call will fail if it cannot acquire the stack lock. For
 Convergence (phase 1) stacks, the call will fail if it cannot acquire the
 resource lock. This failure mode will be indicated by raising an
 ActionInProgress exception in the engine, which manifests as a 409 Conflict
 response to the ReST API request.
 Upon receipt of this call, Heat will put the resource into the CHECK_FAILED
 state if the 'mark_unhealthy' field is true. If the field is false, Heat will
 put the resource in the CHECK_COMPLETE state if it was in the CHECK_FAILED
 state; otherwise it will make no change.
 Presence of any other fields or a missing 'mark_unhealthy' field will trigger
 an Invalid Request error.
 The status_reason field is optional. If present, the value of this field will
 be used as the status_reason for the status change; otherwise an appropriate
 default message will be recorded to indicate that the state change was due to
 the resource being explicitly marked unhealthy.
 It is assumed that should any future additional operations be added using the
 PATCH verb on a resource, it will be invalid for them to occur in the same call
 as this one. As such, the RPC call will have a specific mark_unhealthy_resource
 call rather than a general patch_resource call.
 Change the _needs_update() method of the StackResource and RemoteStack resource
 types, such that the resource is replaced on update if it is in the
 CHECK_FAILED state.  A user who wants to manually force replacement of a
 *member* of a nested stack (as opposed to the nested stack itself) should mark
 the member(s) as unhealthy rather than the stack itself.  Resources of any
 other type that are in a FAILED state already will be replaced on a subsequent
 stack update, regardless of the action (CHECK or otherwise), and this applies
 equally to both legacy and convergence stacks.
 Modify the InstanceGroup (and, by extension, Heat and AWS AutoscalingGroup)
 types to give members in a FAILED state the highest priority for being removed
 when scaling down or being updated in a rolling update. Currently, FAILED
 resources are omitted when building a new template for the scaling group, so
 any such resources would never be replaced by one of the same name. This change
 will allow for continuity of naming in the case of a change that doesn't
 permanently remove the resource due to scaling down.
 Once bug 1508736 is fixed there should be no further need to make any change to
 ResourceGroup. However, note that ResourceGroup and InstanceGroup both use the
 same grouputils.get_members() function that filters out failed members, so the
 modifications above may require changes to ResourceGroup to maintain the same
 behaviour.
 Alternatives
 ------------
 It might appear desirable to have a single call to both mark the resource as
 unhealthy and initiate a stack update with the existing template and
 environment. However, it is better to keep the API calls orthogonal, as the
 user may want to make other changes to the stack at the same time. It also
 considerably simplifies implementation and testing.
 We could add a separate healthy=False column to the database instead of
 re-using CHECK_FAILED, but given that this is effectively a way of manually
 providing information that is not available to stack-check, it makes sense to
 re-use the same state. It also simplifies the logic in the engine, as we
 already check for a FAILED state in many places, so re-using this state should
 result in Heat just doing the Right Thing without having to add multiple checks
 for another field.
 An earlier version of this proposal suggested using a SOAP-style POST request
 to a "mark_unhealthy" action endpoint, rather than a PATCH request to the
 resource.  This is consistent with how many OpenStack APIs operate today, but
 widely regarded as a non-ReSTful abomination. The currently `proposed
 guidelines`_ of the API working group suggest a single "actions" endpoint for
 POST requests of this type, where the body would be of the form::
  {
    "name": "mark_unhealthy",
    "args": {
      "unhealthy": <bool>,
      "resource_status_reason": <string>
    }
  }
 However, this proposal is still controversial (and has been described, not
 inaccurately, in reviews as "SOAP in ReST clothing"). The main driver behind it
 seems to be a belief that projects will be unwilling to implement a fully
 ReSTful interface like that proposed here.
 .. _proposed guidelines: https://review.openstack.org/#/c/234994/
 We could re-use the existing signal API instead of adding a new endpoint.
 However, that would mean a mix of responsibility in handling signals between
 the resource plugin (which is responsible today) and Heat (since this new
 proposal is independent of resource type). It would be more consistent with the
 currently-proposed API guidelines; it's arguable whether that is a good thing
 or not, since those recommendations are still very much liable to change.
 Alternatively, we could make this a call on a stack (rather than an individual
 resource), so that the user can mark multiple resources unhealthy with a single
 call. One downside of this is that it requires the resource identifier to be
 included in the body of the request rather than the URL, so it could end up
 harder than it needs to be to include in e.g. Mistral workflows. It's also less
 logical from a ReST perspective, and complicates error handling and reporting.
 We can always add this later if it really turns out to be required.
 Instead of defining a particular state transition, we could allow the user to
 set arbitrary resource states. This is a giant can of worms.
 This proposal is an alternative to the one presented in
 https://review.openstack.org/#/c/212205/ which involved mechanisms to place the
 member IDs of various types of scaling groups under user control. This proposal
 is both more generic and more relevant to the future convergence plans than
 that one.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  ahmed-h-elkhouly <ahmed.h.elkhouly@gmail.com>
 Milestones
 ----------
 Target Milestone for completion:
  mitaka-3
 Work Items
 ----------
 - Modify StackResource and RemoteStack such that they are replaced on update
  when in the CHECK_FAILED state.
 - Implement an RPC API to mark resources as CHECK_FAILED in both the legacy and
  convergence architectures in heat-engine
 - Implement a ReST front end to the RPC API call in heat-api
 - Implement client support for the API call
 - Modify InstanceGroup to keep FAILED resources in the template (so that they
  are replaced by another of the same name)
 Dependencies
 ============
 It is possible that the changes to InstanceGroup could be greatly simplified
 after the completion of the blueprint scaling-group-common.
 The replacement of failed ResourceGroup members will not work correctly in the
 case of a rolling update until bug 1508736 is fixed.