Merge "Add an API to manually mark a resource as unhealthy"
This commit is contained in:
commit
b3c15e6fb6
201
specs/mitaka/mark-unhealthy.rst
Normal file
201
specs/mitaka/mark-unhealthy.rst
Normal file
@ -0,0 +1,201 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
..
|
||||||
|
This template should be in ReSTructured text. The filename in the git
|
||||||
|
repository should match the launchpad URL, for example a URL of
|
||||||
|
https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
|
||||||
|
awesome-thing.rst . Please do not delete any of the sections in this
|
||||||
|
template. If you have nothing to say for a whole section, just write: None
|
||||||
|
For help with syntax, see http://sphinx-doc.org/rest.html
|
||||||
|
To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
|
||||||
|
|
||||||
|
==========================
|
||||||
|
Mark Unhealthy Resources
|
||||||
|
==========================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/heat/+spec/mark-unhealthy
|
||||||
|
|
||||||
|
Add an interface to allow the user to communicate information about the health
|
||||||
|
of a resource that Heat cannot determine on its own.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
The only mechanism that Heat has for evaluating the health of a resource is to
|
||||||
|
compare its properties against the output of the relevant OpenStack API. (This
|
||||||
|
happens via the stack-check command in the current architecture, but will be
|
||||||
|
automatic on updates in the proposed Phase 2 of the Convergence architecture).
|
||||||
|
However, there may exist resources that the user (or application) knows are
|
||||||
|
unhealthy where Heat has no way of determining that. The obvious example is a
|
||||||
|
server which is running as far as Nova is concerned but is, in point of fact,
|
||||||
|
borked as far as the application is concerned.
|
||||||
|
|
||||||
|
Currently there is no way for an user (or application) to replace such a
|
||||||
|
resource without going performing multiple orchestration passes or renaming the
|
||||||
|
resource or both. Both are undesirable, and this leaves the user unable to take
|
||||||
|
advantage of Heat's ability to correctly replace a resource as part of a single
|
||||||
|
workflow.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Add a PATCH handler to the Resource endpoint::
|
||||||
|
|
||||||
|
/stacks/<stack_name>/<stack_id>/resources/<resource_id>
|
||||||
|
|
||||||
|
The PATCH method will accept a JSON body of the form::
|
||||||
|
|
||||||
|
{
|
||||||
|
'mark_unhealthy': <bool>,
|
||||||
|
'resource_status_reason': <string>
|
||||||
|
}
|
||||||
|
|
||||||
|
For legacy stacks, this call will fail if it cannot acquire the stack lock. For
|
||||||
|
Convergence (phase 1) stacks, the call will fail if it cannot acquire the
|
||||||
|
resource lock. This failure mode will be indicated by raising an
|
||||||
|
ActionInProgress exception in the engine, which manifests as a 409 Conflict
|
||||||
|
response to the ReST API request.
|
||||||
|
|
||||||
|
Upon receipt of this call, Heat will put the resource into the CHECK_FAILED
|
||||||
|
state if the 'mark_unhealthy' field is true. If the field is false, Heat will
|
||||||
|
put the resource in the CHECK_COMPLETE state if it was in the CHECK_FAILED
|
||||||
|
state; otherwise it will make no change.
|
||||||
|
|
||||||
|
Presence of any other fields or a missing 'mark_unhealthy' field will trigger
|
||||||
|
an Invalid Request error.
|
||||||
|
|
||||||
|
The status_reason field is optional. If present, the value of this field will
|
||||||
|
be used as the status_reason for the status change; otherwise an appropriate
|
||||||
|
default message will be recorded to indicate that the state change was due to
|
||||||
|
the resource being explicitly marked unhealthy.
|
||||||
|
|
||||||
|
It is assumed that should any future additional operations be added using the
|
||||||
|
PATCH verb on a resource, it will be invalid for them to occur in the same call
|
||||||
|
as this one. As such, the RPC call will have a specific mark_unhealthy_resource
|
||||||
|
call rather than a general patch_resource call.
|
||||||
|
|
||||||
|
Change the _needs_update() method of the StackResource and RemoteStack resource
|
||||||
|
types, such that the resource is replaced on update if it is in the
|
||||||
|
CHECK_FAILED state. A user who wants to manually force replacement of a
|
||||||
|
*member* of a nested stack (as opposed to the nested stack itself) should mark
|
||||||
|
the member(s) as unhealthy rather than the stack itself. Resources of any
|
||||||
|
other type that are in a FAILED state already will be replaced on a subsequent
|
||||||
|
stack update, regardless of the action (CHECK or otherwise), and this applies
|
||||||
|
equally to both legacy and convergence stacks.
|
||||||
|
|
||||||
|
Modify the InstanceGroup (and, by extension, Heat and AWS AutoscalingGroup)
|
||||||
|
types to give members in a FAILED state the highest priority for being removed
|
||||||
|
when scaling down or being updated in a rolling update. Currently, FAILED
|
||||||
|
resources are omitted when building a new template for the scaling group, so
|
||||||
|
any such resources would never be replaced by one of the same name. This change
|
||||||
|
will allow for continuity of naming in the case of a change that doesn't
|
||||||
|
permanently remove the resource due to scaling down.
|
||||||
|
|
||||||
|
Once bug 1508736 is fixed there should be no further need to make any change to
|
||||||
|
ResourceGroup. However, note that ResourceGroup and InstanceGroup both use the
|
||||||
|
same grouputils.get_members() function that filters out failed members, so the
|
||||||
|
modifications above may require changes to ResourceGroup to maintain the same
|
||||||
|
behaviour.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
It might appear desirable to have a single call to both mark the resource as
|
||||||
|
unhealthy and initiate a stack update with the existing template and
|
||||||
|
environment. However, it is better to keep the API calls orthogonal, as the
|
||||||
|
user may want to make other changes to the stack at the same time. It also
|
||||||
|
considerably simplifies implementation and testing.
|
||||||
|
|
||||||
|
We could add a separate healthy=False column to the database instead of
|
||||||
|
re-using CHECK_FAILED, but given that this is effectively a way of manually
|
||||||
|
providing information that is not available to stack-check, it makes sense to
|
||||||
|
re-use the same state. It also simplifies the logic in the engine, as we
|
||||||
|
already check for a FAILED state in many places, so re-using this state should
|
||||||
|
result in Heat just doing the Right Thing without having to add multiple checks
|
||||||
|
for another field.
|
||||||
|
|
||||||
|
An earlier version of this proposal suggested using a SOAP-style POST request
|
||||||
|
to a "mark_unhealthy" action endpoint, rather than a PATCH request to the
|
||||||
|
resource. This is consistent with how many OpenStack APIs operate today, but
|
||||||
|
widely regarded as a non-ReSTful abomination. The currently `proposed
|
||||||
|
guidelines`_ of the API working group suggest a single "actions" endpoint for
|
||||||
|
POST requests of this type, where the body would be of the form::
|
||||||
|
|
||||||
|
{
|
||||||
|
"name": "mark_unhealthy",
|
||||||
|
"args": {
|
||||||
|
"unhealthy": <bool>,
|
||||||
|
"resource_status_reason": <string>
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
However, this proposal is still controversial (and has been described, not
|
||||||
|
inaccurately, in reviews as "SOAP in ReST clothing"). The main driver behind it
|
||||||
|
seems to be a belief that projects will be unwilling to implement a fully
|
||||||
|
ReSTful interface like that proposed here.
|
||||||
|
|
||||||
|
.. _proposed guidelines: https://review.openstack.org/#/c/234994/
|
||||||
|
|
||||||
|
We could re-use the existing signal API instead of adding a new endpoint.
|
||||||
|
However, that would mean a mix of responsibility in handling signals between
|
||||||
|
the resource plugin (which is responsible today) and Heat (since this new
|
||||||
|
proposal is independent of resource type). It would be more consistent with the
|
||||||
|
currently-proposed API guidelines; it's arguable whether that is a good thing
|
||||||
|
or not, since those recommendations are still very much liable to change.
|
||||||
|
|
||||||
|
Alternatively, we could make this a call on a stack (rather than an individual
|
||||||
|
resource), so that the user can mark multiple resources unhealthy with a single
|
||||||
|
call. One downside of this is that it requires the resource identifier to be
|
||||||
|
included in the body of the request rather than the URL, so it could end up
|
||||||
|
harder than it needs to be to include in e.g. Mistral workflows. It's also less
|
||||||
|
logical from a ReST perspective, and complicates error handling and reporting.
|
||||||
|
We can always add this later if it really turns out to be required.
|
||||||
|
|
||||||
|
Instead of defining a particular state transition, we could allow the user to
|
||||||
|
set arbitrary resource states. This is a giant can of worms.
|
||||||
|
|
||||||
|
This proposal is an alternative to the one presented in
|
||||||
|
https://review.openstack.org/#/c/212205/ which involved mechanisms to place the
|
||||||
|
member IDs of various types of scaling groups under user control. This proposal
|
||||||
|
is both more generic and more relevant to the future convergence plans than
|
||||||
|
that one.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
ahmed-h-elkhouly <ahmed.h.elkhouly@gmail.com>
|
||||||
|
|
||||||
|
Milestones
|
||||||
|
----------
|
||||||
|
|
||||||
|
Target Milestone for completion:
|
||||||
|
mitaka-3
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
- Modify StackResource and RemoteStack such that they are replaced on update
|
||||||
|
when in the CHECK_FAILED state.
|
||||||
|
- Implement an RPC API to mark resources as CHECK_FAILED in both the legacy and
|
||||||
|
convergence architectures in heat-engine
|
||||||
|
- Implement a ReST front end to the RPC API call in heat-api
|
||||||
|
- Implement client support for the API call
|
||||||
|
- Modify InstanceGroup to keep FAILED resources in the template (so that they
|
||||||
|
are replaced by another of the same name)
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
It is possible that the changes to InstanceGroup could be greatly simplified
|
||||||
|
after the completion of the blueprint scaling-group-common.
|
||||||
|
|
||||||
|
The replacement of failed ResourceGroup members will not work correctly in the
|
||||||
|
case of a rolling update until bug 1508736 is fixed.
|
Loading…
Reference in New Issue
Block a user