When updating a resource that hasn't changed, we didn't previously retry
the write when the atomic_key of the resource didn't match what we expect.
In addition to locking a resource to update it, the atomic key is also
incremented when modifying metadata and storing cached attribute values.
Apparently there is some mechanism that can cause this to happen in the
time between when the resource is loaded and when we attempt to update the
template ID &c. in the DB.
When the resource is not locked and its template ID hasn't changed since we
loaded it, we can assume that the update failed due to a mismatched atomic
key alone. Handle this case by sending another resource-check RPC message,
so that the operation check will be retried with fresh data from the DB.
Change-Id: I5afd5602096be54af5da256927fe828366dbd63b
Closes-Bug: #1763021
Because of quotas, there are times when creating a resource and then
deleting another resource may fail where doing it in the reverse order
would work, even though the resources are independent of one another.
When enqueueing 'check_resource' messages, send those for cleanup nodes
prior to those for update nodes. This means that all things being equal
(i.e. no dependency relationship), deletions will be started first. It
doesn't guarantee success when quotas allow, since only a dependency
relationship will cause Heat to wait for the deletion to complete before
starting creation, but it is a risk-free way to give us a better chance of
succeeding.
Change-Id: I9727d906cd0ad8c4bf9c5e632a47af6d7aad0c72
Partial-Bug: #1713900
There's nothing more frustrating than trying to debug when a message
appears to get lost without rhyme or reason. When we're not going to
process a resource (because it has already been replaced), log what is
happening so that we can see it when debugging.
Change-Id: I89d4b51205798d3c930f0132c81ff987d46382c7
If we force-cancel a resource check operation (i.e. by killing the thread)
then there was a short window where another traversal could have started
and been waiting on the not-yet-cancelled resource. That traversal would
then hang forever, as there was nothing to retrigger it. This change
ensures we retrigger the latest traversal on the resource after
cancellation.
Change-Id: Iae27c9cc5c0895b52aef2f2c72686dc48ec5983c
Closes-Bug: #1727007
Use a single transaction to create the replacement resource and set it as
the replaced_by link in the old resource. Also, ensure that no other
traversal has taken a lock on the old resource before we modify it.
If we end up bailing out and not creating a replacement or sending an RPC
message to check it, make sure we retrigger any new traversal.
Change-Id: I23db4f06a4060f3d26a78f7b26700de426f355e3
Closes-Bug: #1727128
If a resource times out, we still need to check whether there is a new
traversal underway that we need to retrigger, otherwise the new traversal
will never complete.
Change-Id: I4ac7ac88797b7fb14046b5668649b2277ee55517
Closes-Bug: #1721654
The node key in the convergence graph is a (resource id, update/!cleanup)
tuple. Sometimes it would be convenient to access the members by name, so
convert to a namedtuple.
Change-Id: Id8c159b0137df091e96f1f8d2312395d4a5664ee
If we get an unexpected exception when checking a resource, try to clean
up by marking the stack as FAILED. The graph traversal will stop if we
can't propagate any more RPC messages, so without this the stack would
be stuck IN_PROGRESS indefinitely.
Change-Id: I56ecfa7a9a328d1435c1f34ab14e56effb81bb21
Closes-Bug: #1703043
Store resource attributes that may be cached in the DB, saving the
cost of re-resolving them later. This works for most resources,
specifically those that do not override the get_attribute() method.
Change-Id: I71f8aa431a60457326167b8c82adc03ca750eda6
Partial-Bug: #1660831
If a traversal is interrupted by a fresh update before a particular
resource is created, then the resource is left stored in the DB with the
old template ID. While an update always uses the new template, a create
assumes that the template ID in the DB is correct. Since the resource has
never been created, the new traversal will create it using the old
template.
To resolve this, detect the case where the resource has not been created
yet and we are about to create it and the traversal ID is still current,
and always use the new resource definition in that case.
Change-Id: Ifa0ce9e1e08f86b30df00d92488301ea05b45b14
Closes-Bug: #1663745
Handle the restore operation as a normal convergence update instead of a
legacy one.
Change-Id: I6ee46cdf7a8fdf89c58c9812d08af21c97fb0f9e
Related-Bug: #1687006
Sometimes we know we will only access particular fields of a resource
object, rather than *all* of them. This commit allows the caller to
specify (optionally) the fields that should be populated when the
resource object is instantiated. This saves memory, trips to the db,
and in some cases avoids extra join queries (e.g. for resource.data or
resource.rsrc_prop_data).
Change-Id: I405888f46451d2657aa28f610f8ca555215ff5cf
Partial-Bug: #1680658
This will allow the snapshotting of attribute/refid values to occur on both
the legacy and convergence paths (currently it is used only for
convergence).
Change-Id: I9a8fce9c6d22d84ec967087b62bff77f5a6de3db
Partially-Implements: blueprint stack-definition
Formalise the format for the output data from a node in the convergence
graph (i.e. resource reference ID, attributes, &c.) by creating an object
with an API rather than ad-hoc dicts.
Change-Id: I7a705b41046bfbf81777e233e56aba24f3166510
Partially-Implements: blueprint stack-definition
Two improvements:
Do not iterate through stack outputs when determining what attributes
to send as input_data to the next resources in a traversal; it is at
best extra processing and at worse results in extra attributes being
included in input_data.
Do not re-resolve attributes / re-calculate input_data when we already
have it. I.e., when a resource constructs input_data to send to a
requirer resource, that same input_data may be used for all other
requirer resources.
Change-Id: I64089fb0774c10f172d986c3f87090e91cb3f263
Closes-Bug: #1656125
It is found that the inter-leaving of lock when a update-replace of a
resource is needed is the reason for new traversal not being triggered.
Consider the order of events below:
1. A server is being updated. The worker locks the server resource.
2. A rollback is triggered because some one cancelled the stack.
3. As part of rollback, new update using old template is started.
4. The new update tries to take the lock but it has been already
acquired in (1). The new update now expects that the when the old
resource is done, it will re-trigger the new traversal.
5. The old update decides to create a new resource for replacement. The
replacement resource is initiated for creation, a check_resource RPC
call is made for new resource.
6. A worker, possibly in another engine, receives the call and then it
bails out when it finds that there is a new traversal initiated (from
2). Now, there is no progress from here because it is expected (from 4)
that there will be a re-trigger when the old resource is done.
This change takes care of re-triggering the new traversal from worker
when it finds that there is a new traversal and an update-replace. Note
that this issue will not be seen when there is no update-replace
because the old resource will finish (either fail or complete) and in
the same thread it will find the new traversal and trigger it.
Closes-Bug: #1625073
Change-Id: Icea5ba498ef8ca45cd85a9721937da2f4ac304e0
This allows a convergence operation to be cancelled at an appropriate point
(i.e. between steps in a task) by sending a message to a queue.
Note that there's no code yet to actually cancel any operations
(specifically, sending a cancel message to the stack will _not_ cause the
check_resource operations to be cancelled under convergence).
Change-Id: I9469c31de5e40334083ef1dd20243f2f6779549e
Related-Bug: #1545063
Co-Authored-By: Anant Patil <anant.patil@hpe.com>
The input to check stack complete should be the resource ID of the
resource that the current resource replaces instead of its own. Failing
to do so will result in stack being in in_progress state for ever.
Change-Id: I6f2856c82c8cc73f628976b7296ab0fb20af5ff3
Closes-Bug: #1614960
A deeply misguided effort to move all exceptions out of the
heat.engine.resource module, where they belong, and into the
heat.common.exception module, where they largely do not, broke the API for
third-party resource plugins. Unfortunately this happened a couple of
releases back already, so we can't simply put UpdateReplace back where it
belongs as that would also break the de-facto third-party API.
This change adds an alias in the correct location and a comment indicating
that it will move back at some time in the future. It also switches all of
the in-tree uses back to heat.engine.resource.UpdateReplace, in the hope
that third-party developers will be more inclined to copy from that.
This reverts commit 4e2cfb991a.
Change-Id: Iedd5d07d6c0c07e39e51a0fb810665b3e9c61f87
Closes-Bug: #1611104
In convergence, wherein concurrent updates are possible, if a resource
is deleted (by previous traversal) after dependency graph is created
for new traversal, the resource remains in graph but wouldn't be
available in DB for processing.
It is prerequisite to have resources in DB before any action can be
taken on them.
Hence during convergence resource delete action, the resource entry
from DB is not deleted i.e soft deleted, so that the latest/new update
can find the entry.
All of these soft deleted resources will be deleted when the stack has
completed its operation.
Closes-Bug: #1528560
Change-Id: I0b36ce098022560d7fe01623ce7b66d1d5b38d55
The changes including:
1. Avoid hard code of resource and output keys
2. Decouple hot and cfn for outputs
Change-Id: I1fd7e08ff5c699ddfcf98c81aed5f0d91c4248b3
The patch does the following:
1. Adds in missing translations dfor log messages
2. The updated messages have dictionaries
TrivialFix
Change-Id: Id03a600694e561c4647094887fd55de127678cd1
Use `action` from ResourceFailure exception, if available, as
there are cases like update restriction check, where we don't
update the resource action yet.
Change-Id: I5d43c220669c7a9e8d7dbce6611e062101f8b86b
Closes-Bug: #1592631
The GetAttThenSelect dep_attrs implementaion is changed to return the
attribute only instead of (attribute + path component). It is wrong to
return the (attribute + path component) and that was breaking the
GetAttThenSelect for convergence.
Change-Id: I117dc3e587386f4d48e70ef89c61bb857c751717
Closes-Bug: #1582649
Refactor the worker service; move the check resource code to its own
class in another file and keep the convergence worker RPC API clean.
This refactor will help us contain the convergence logic in a separate
class file instead of in RCP API. The RPC service class should only have
the APIs it implements.
Change-Id: Ie9cf4daba7e6bf61f4cac3388494e8c9efefa4d7
While constructing input-data for building the cache, the resource
attributes must resolve without hitting the cache again. It is
unnecessary to look into cache for resolving attributes of a freshly
baked resource.
Change-Id: I0893c17d87c687ca5cf370c4443f471160bd2f3c
Add a new functional test using Zaqar as a target for event sinks. This
fixes the behavior when convergence is on.
Change-Id: I4bbdec55b98d0a261168229540a411d423e9406d
When a engine worker crashes or is restarted, the resources being
provisioned in it remain in IN_PROGRESS state. Next stack update should
pick these resources and work on them. The implementation is to set the
status of resource as FAILED and re-trigger check_resource.
Change-Id: Ib7fd73eadd0127f8fae47881b59388b31131daf4
Closes-Bug: #1501161
Presently, when a resource of previous traversal completes its action
successfully we re-trigger this resource for latest traversal.(since
the latest traversal will be waiting for its completion)
However, if a resource of previous traversal fails we do not
re-trigger which leads to latest traversal waiting endlessly.
This patch re-triggers the resource for latest traversal even when
the resource fails.
Change-Id: I9f70878ad7f1ff7c2facb950e496681425b54fc4
Partial-Bug: #1512343
To avoid certain concurrency related issues, the DB update API needs to
be given the traversal ID of the stack intended to be updated. By making
this change, we can void having following at all the places:
if current_traversal != stack.current_traversal:
return
The check for current traversal should be implicit, as a part of stack's
store and state_set methods, where self.current_traversal should be used
as expected traversal to be updated. All the state changes or updates in
DB to the stack object go through this implicit check (using
update...where).
When stack updates are triggered, the current traversal should be backed
up as previous traversal, a new traversal should be generated and the
stack should be stored in DB with expected traversal as the previous
traversal. This will ensure that no two updates can simultaneously
succeed on same stack with same traversal ID. This was one of our
primary goal.
Following example cases describe the issues we encounter:
1. When 2 updates, U1 and U2 try to update a stack concurrently:
1. Current traversal(CT) is X
2. U1 loads stack with CT=X
3. U2 loads stack with CT=X
4. U2 stores the stack and updates CT=Y
5. U1 stores the stack and updates the CT=Z
Both the updates have succeeded, and both would be running until
one of the workers does stack.current_traversal == current_traversal
and bail out.
Ideally, U1 should have failed: only one should be allowed in case
of concurrent update. When both U1 and U2 pass X as the expected
traversal ID of the stack, then this problem is solved.
2. A resource R is being provisioned for stack with current traversal
CT=X:
1. An new update U is issued, it loads the stack with CT=X.
2. Resource R fails and loads the stack with CT=X to mark it as FAILED.
3. Update U updates the stack with CT=Y and goes ahead with sync_point
etc., marks stack as UPDATE_IN_PROGRESS
4. Resource marks the stack as UPDATE_FAILED, which to user means that
update U has failed, but it actually is going on.
With this patch, when Resource R fails, it will supply CT=X as
expected traversal to be updated and will eventually fail because
update U with CT=Y has taken over.
Partial-Bug: #1512343
Change-Id: I6ca11bed1f353786bb05fec62c89708d98159050
When loading a resource, load the stack with template of the resource.
Appropriate stack needs to be assigned to resource(resource.stack), else
resource actions will fail.
Co-Authored-By: Anant Patil <anant.patil@hp.com>
Partial-Bug: #1512343
Change-Id: Ic4526152c8fd027049514b71554036321a61efd2
Fix failing convergence gate functional tests
- store resource uuid, action, status in cache data. Most of the code
requires the resource to have proper status and uuid to work.
- initialize rsrc._data to None so that the resource data is fetched from
db first time.
Change-Id: I7309c7da8fe1ce3e1c7e3d3027dea2e400111015
Co-Authored-By: Anant Patil <anant.patil@hp.com>
Partial-Bug: #1492116
Closes-Bug: #1495094
It is convenient to have all exceptions in exception module.
Also it is reduces namespace cluttering of resource module and decreases
the number of dependencies in other modules (we do not need to import resource
in some cases for now).
UpdateInProgress exception is moved in this patch.
Change-Id: If694c264639bbce5334e1e6e7403b225ce1d3aee
It is convenient to have all exceptions in exception module.
Also it is reduces namespace cluttering of resource module and decreases
the number of dependencies in other modules (we do not need to import resource
in some cases for now).
UpdateReplace exception is moved in this patch.
Change-Id: Ief441ca2022a0d50e88d709d1a062631479715b7
store the attr name and path so attributes don't get shadowed
e.g. get_attr: [res1, attr_x, show]
get_attr: [res1, attr_x, something]
Change-Id: I724e91b32776aa5813d2b821c2062424e0635a69
1. we are caching the result of FnGetRefId which can be the name
2. cache_data_resource_attribute() was trying to access "attributes"
instead of "attrs".
Change-Id: I59d55dcee2af521924fdb5da14e012dcc7b4dd3f
The resource provisioning work is distributed among heat engines, so the
timeout also has to be distributed and brought to the resource level
granularity.
Thus,
1. Before invoking check_resource on a resource, ensure that the stack
has not timed out.
2. Pass the remaining amount of time to the resource converge method so
that it can raise timeout exception if it cannot finish in the remaining
time.
Once timeout exception is raised by a resource converge method, the
corresponding stack is marked as FAILED with "Timed out" as failure
reason. Then, if rollback is enabled on the stack, it is triggered.
Change-Id: Id1806d546c67505137f57f72d5b463dc229a666d
All resources that are new will have an INIT state. Instead
of having a complex strategy to decide whether the resource
should be created or updated, just check for the action
to see if it is in the INIT state or not. If it is not, then
always trigger the update workflow.
Also, this fixes a bug where we triggered a create for a
resource without a resource id that originally should've been
updated because it was in UPDATE_FAILED which was the unhandled
case.
Change-Id: I3f2318fecfe76592e8b54e9c09fdf1614197e83f
Mostly in worker we have arguments called "data", it is not clear
if these are serialized or not (and if they have adopt data in them).
1. split adopt data out (add RPC support for the new argument)
2. name arguments "resource_data" for deserialized data
3. name arguments "rpc_data" for serialized data
4. make sure all data into client.check_resource() is serialized
Change-Id: Ie6bd0e45d2857d3a23235776c2b96cce02cb711a
1. remove the duplication between service.py and worker.py
2. use the topic, version & engine_id when logging
Change-Id: I2b7dfbbe1d5a68a9f1739ab53ba5c08691b495e1
When resources are replaced, the needed_by needes to be updated. When
the resources are visited in clean-up phase - after all the updates are
done - new needed_by data is sent to the resources.
Change-Id: Ib3bff461ab4cdd43391c7fcdfff6d8eb17fe2555
Co-Authored-By: Rakesh HS <rh-s@hp.com>
All attributes are retrieved (including ones referenced in the output).
In our functional test there is a bug where a nested stack did not
have an output that was referenced from the output. This didn't effect
the stack creation as:
- outputs are not normally required in the stack creation
- if the output is used by a TemplateResource it *may* be used
The way outputs are used currently is that they do not fail on
the first output that doesn't work as it is useful to get the ones
that do work. To help with this errors are placed alongside the
value (in a key "error_msg").
This is somewhat problematic in convergence as we now *require* all
attributes (including those based off of outputs).
Note: if you have an outer resource referencing a non-existent template
resource output it *will* fail, but when the outer stack's output references
the inner stack's output then it is not validated.
Change-Id: Id07c617a19eae56543f92ee21aea58cd38fa3606
Ensure stack operation is re-triggered when SynPointNotFound is
encountered and stack is updated (have new traversal ID).
Change-Id: Ia1b670aa5766c57dafdcc84d642c42007371a087
Fix the incorrect comparision of current traversal. Ensure
check_resource is re-triggered when sync point is not found and stack
has been updated (having a new traversal ID).
Change-Id: I9d7fa539b565f852b9431bb206823cfb16178607
Refactored worker to remove duplicate code. The check_resource is broken
down into many smaller methods for readability and unit tests.
Change-Id: Id32b9aa11ecb637bf737dc1d86261a9b78739535