Merge "Add an API to manually mark a resource as unhealthy"

This commit is contained in:
Jenkins 2016-02-29 16:57:56 +00:00 committed by Gerrit Code Review
commit b3c15e6fb6

View File

@ -0,0 +1,201 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. The filename in the git
repository should match the launchpad URL, for example a URL of
https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
awesome-thing.rst . Please do not delete any of the sections in this
template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
==========================
Mark Unhealthy Resources
==========================
https://blueprints.launchpad.net/heat/+spec/mark-unhealthy
Add an interface to allow the user to communicate information about the health
of a resource that Heat cannot determine on its own.
Problem description
===================
The only mechanism that Heat has for evaluating the health of a resource is to
compare its properties against the output of the relevant OpenStack API. (This
happens via the stack-check command in the current architecture, but will be
automatic on updates in the proposed Phase 2 of the Convergence architecture).
However, there may exist resources that the user (or application) knows are
unhealthy where Heat has no way of determining that. The obvious example is a
server which is running as far as Nova is concerned but is, in point of fact,
borked as far as the application is concerned.
Currently there is no way for an user (or application) to replace such a
resource without going performing multiple orchestration passes or renaming the
resource or both. Both are undesirable, and this leaves the user unable to take
advantage of Heat's ability to correctly replace a resource as part of a single
workflow.
Proposed change
===============
Add a PATCH handler to the Resource endpoint::
/stacks/<stack_name>/<stack_id>/resources/<resource_id>
The PATCH method will accept a JSON body of the form::
{
'mark_unhealthy': <bool>,
'resource_status_reason': <string>
}
For legacy stacks, this call will fail if it cannot acquire the stack lock. For
Convergence (phase 1) stacks, the call will fail if it cannot acquire the
resource lock. This failure mode will be indicated by raising an
ActionInProgress exception in the engine, which manifests as a 409 Conflict
response to the ReST API request.
Upon receipt of this call, Heat will put the resource into the CHECK_FAILED
state if the 'mark_unhealthy' field is true. If the field is false, Heat will
put the resource in the CHECK_COMPLETE state if it was in the CHECK_FAILED
state; otherwise it will make no change.
Presence of any other fields or a missing 'mark_unhealthy' field will trigger
an Invalid Request error.
The status_reason field is optional. If present, the value of this field will
be used as the status_reason for the status change; otherwise an appropriate
default message will be recorded to indicate that the state change was due to
the resource being explicitly marked unhealthy.
It is assumed that should any future additional operations be added using the
PATCH verb on a resource, it will be invalid for them to occur in the same call
as this one. As such, the RPC call will have a specific mark_unhealthy_resource
call rather than a general patch_resource call.
Change the _needs_update() method of the StackResource and RemoteStack resource
types, such that the resource is replaced on update if it is in the
CHECK_FAILED state. A user who wants to manually force replacement of a
*member* of a nested stack (as opposed to the nested stack itself) should mark
the member(s) as unhealthy rather than the stack itself. Resources of any
other type that are in a FAILED state already will be replaced on a subsequent
stack update, regardless of the action (CHECK or otherwise), and this applies
equally to both legacy and convergence stacks.
Modify the InstanceGroup (and, by extension, Heat and AWS AutoscalingGroup)
types to give members in a FAILED state the highest priority for being removed
when scaling down or being updated in a rolling update. Currently, FAILED
resources are omitted when building a new template for the scaling group, so
any such resources would never be replaced by one of the same name. This change
will allow for continuity of naming in the case of a change that doesn't
permanently remove the resource due to scaling down.
Once bug 1508736 is fixed there should be no further need to make any change to
ResourceGroup. However, note that ResourceGroup and InstanceGroup both use the
same grouputils.get_members() function that filters out failed members, so the
modifications above may require changes to ResourceGroup to maintain the same
behaviour.
Alternatives
------------
It might appear desirable to have a single call to both mark the resource as
unhealthy and initiate a stack update with the existing template and
environment. However, it is better to keep the API calls orthogonal, as the
user may want to make other changes to the stack at the same time. It also
considerably simplifies implementation and testing.
We could add a separate healthy=False column to the database instead of
re-using CHECK_FAILED, but given that this is effectively a way of manually
providing information that is not available to stack-check, it makes sense to
re-use the same state. It also simplifies the logic in the engine, as we
already check for a FAILED state in many places, so re-using this state should
result in Heat just doing the Right Thing without having to add multiple checks
for another field.
An earlier version of this proposal suggested using a SOAP-style POST request
to a "mark_unhealthy" action endpoint, rather than a PATCH request to the
resource. This is consistent with how many OpenStack APIs operate today, but
widely regarded as a non-ReSTful abomination. The currently `proposed
guidelines`_ of the API working group suggest a single "actions" endpoint for
POST requests of this type, where the body would be of the form::
{
"name": "mark_unhealthy",
"args": {
"unhealthy": <bool>,
"resource_status_reason": <string>
}
}
However, this proposal is still controversial (and has been described, not
inaccurately, in reviews as "SOAP in ReST clothing"). The main driver behind it
seems to be a belief that projects will be unwilling to implement a fully
ReSTful interface like that proposed here.
.. _proposed guidelines: https://review.openstack.org/#/c/234994/
We could re-use the existing signal API instead of adding a new endpoint.
However, that would mean a mix of responsibility in handling signals between
the resource plugin (which is responsible today) and Heat (since this new
proposal is independent of resource type). It would be more consistent with the
currently-proposed API guidelines; it's arguable whether that is a good thing
or not, since those recommendations are still very much liable to change.
Alternatively, we could make this a call on a stack (rather than an individual
resource), so that the user can mark multiple resources unhealthy with a single
call. One downside of this is that it requires the resource identifier to be
included in the body of the request rather than the URL, so it could end up
harder than it needs to be to include in e.g. Mistral workflows. It's also less
logical from a ReST perspective, and complicates error handling and reporting.
We can always add this later if it really turns out to be required.
Instead of defining a particular state transition, we could allow the user to
set arbitrary resource states. This is a giant can of worms.
This proposal is an alternative to the one presented in
https://review.openstack.org/#/c/212205/ which involved mechanisms to place the
member IDs of various types of scaling groups under user control. This proposal
is both more generic and more relevant to the future convergence plans than
that one.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
ahmed-h-elkhouly <ahmed.h.elkhouly@gmail.com>
Milestones
----------
Target Milestone for completion:
mitaka-3
Work Items
----------
- Modify StackResource and RemoteStack such that they are replaced on update
when in the CHECK_FAILED state.
- Implement an RPC API to mark resources as CHECK_FAILED in both the legacy and
convergence architectures in heat-engine
- Implement a ReST front end to the RPC API call in heat-api
- Implement client support for the API call
- Modify InstanceGroup to keep FAILED resources in the template (so that they
are replaced by another of the same name)
Dependencies
============
It is possible that the changes to InstanceGroup could be greatly simplified
after the completion of the blueprint scaling-group-common.
The replacement of failed ResourceGroup members will not work correctly in the
case of a rolling update until bug 1508736 is fixed.