In the original prototype for convergence, we passed the input_data from
the SyncPoint to the resource when calling the equivalent of
convergence_delete(), so that we could clear and needed_by references that
no longer exist. This is pointless for a few reasons:
* It's implemented incorrectly - it *sets* the referenced resources into
needed_by instead of clearing them from it.
* We don't actually pass any input data - in WorkerService.check_resource()
it's always set to an empty dict for cleanup nodes, regardless of what
came in on the wire.
* We don't store the result to the database unless we're deleting the
resource anyway - in which case it doesn't matter.
* It turns out that even in the prototype, the whole needed_by mechanism
isn't actually used for anything:
c74aac1f07
Rather than pretend that we're doing something useful with the input_data
here, just set the needed_by to an empty list, which is what was happening
anyway.
Change-Id: I73f6cf1646584dc4a83497f5a583c2c8973e8aba
When updating a resource that hasn't changed, we didn't previously retry
the write when the atomic_key of the resource didn't match what we expect.
In addition to locking a resource to update it, the atomic key is also
incremented when modifying metadata and storing cached attribute values.
Apparently there is some mechanism that can cause this to happen in the
time between when the resource is loaded and when we attempt to update the
template ID &c. in the DB.
When the resource is not locked and its template ID hasn't changed since we
loaded it, we can assume that the update failed due to a mismatched atomic
key alone. Handle this case by sending another resource-check RPC message,
so that the operation check will be retried with fresh data from the DB.
Change-Id: I5afd5602096be54af5da256927fe828366dbd63b
Closes-Bug: #1763021
If a resource times out, we still need to check whether there is a new
traversal underway that we need to retrigger, otherwise the new traversal
will never complete.
Change-Id: I4ac7ac88797b7fb14046b5668649b2277ee55517
Closes-Bug: #1721654
When requesting the NodeData for a Resource, generate placeholder values if
the resource has not been created yet. Currently, all attributes are given
a value of None (consistent with how the get_attr intrinsic functions
substitute them when the resource state indicates that the attributes are
not valid now). Reference IDs are generated as usual (since the logic for
what to return when the state indicates the data are not valid is contained
in the resource plugins).
In future, this will allow us to add placeholder NodeData for all resources
to the StackDefinition prior to validation. In the long term, that should
help us to remove the state-checking logic from the intrinsic functions. In
the short term, it will allow us to validate template data that is parsed
against the StackDefinition (rather than the Stack), since it gives us a
list of valid attribute names. In the medium term, this should give us a
place to substitute in more sophisticated placeholder values to do better
validation.
Change-Id: Ife69f563a3748c6cb611ca02f826229e76bee4d0
Partially-Implements: blueprint stack-definition
If a traversal is interrupted by a fresh update before a particular
resource is created, then the resource is left stored in the DB with the
old template ID. While an update always uses the new template, a create
assumes that the template ID in the DB is correct. Since the resource has
never been created, the new traversal will create it using the old
template.
To resolve this, detect the case where the resource has not been created
yet and we are about to create it and the traversal ID is still current,
and always use the new resource definition in that case.
Change-Id: Ifa0ce9e1e08f86b30df00d92488301ea05b45b14
Closes-Bug: #1663745
This reverts commit f94d76cb32.
All stack deployments in TripleO are taking about 10
minutes longer, so 20 minutes for ovb-updates job.
We found out that it's partially related to this patch.
See details in the bug report.
https://bugs.launchpad.net/tripleo/+bug/1684272
Change-Id: I80ced2dbc9127bc1501f8729eb52195c0d206813
Partial-Bug: #1684272
Before validating resources, generate placeholder NodeData for them in the
StackDefinition. Currently, all attributes are given a value of None
(consistent with how the get_attr intrinsic functions substitute them when
the resource state indicates that the attributes are not valid now).
Reference IDs are generated as usual (since the logic for what to return
when the state indicates the data are not valid is contained in the
resource plugins).
In the long term, this should help us to remove the state-checking logic
from the intrinsic functions.
In the short term, this allows us to validate template data that is parsed
against the StackDefinition (rather than the Stack), since it gives us a
list of valid attribute names.
In the medium term, this should give us a place to substitute in more
sophisticated placeholder values to do better validation.
Change-Id: I154ff0cd019d279379886fccbd708cf0d39ce53f
Instead of calling check_resource on all leaves in the resource graph at
once, sleep a little bit between each call. As it's a tad slower,
delegate it to a thread so that the stack_create RPC message doesn't
timeout when you have lots of resources.
Change-Id: I84d2b34d65b3ce7d8d858de106dac531aff509b7
Partial-Bug: #1566845
This will allow the snapshotting of attribute/refid values to occur on both
the legacy and convergence paths (currently it is used only for
convergence).
Change-Id: I9a8fce9c6d22d84ec967087b62bff77f5a6de3db
Partially-Implements: blueprint stack-definition
Formalise the format for the output data from a node in the convergence
graph (i.e. resource reference ID, attributes, &c.) by creating an object
with an API rather than ad-hoc dicts.
Change-Id: I7a705b41046bfbf81777e233e56aba24f3166510
Partially-Implements: blueprint stack-definition
It is found that the inter-leaving of lock when a update-replace of a
resource is needed is the reason for new traversal not being triggered.
Consider the order of events below:
1. A server is being updated. The worker locks the server resource.
2. A rollback is triggered because some one cancelled the stack.
3. As part of rollback, new update using old template is started.
4. The new update tries to take the lock but it has been already
acquired in (1). The new update now expects that the when the old
resource is done, it will re-trigger the new traversal.
5. The old update decides to create a new resource for replacement. The
replacement resource is initiated for creation, a check_resource RPC
call is made for new resource.
6. A worker, possibly in another engine, receives the call and then it
bails out when it finds that there is a new traversal initiated (from
2). Now, there is no progress from here because it is expected (from 4)
that there will be a re-trigger when the old resource is done.
This change takes care of re-triggering the new traversal from worker
when it finds that there is a new traversal and an update-replace. Note
that this issue will not be seen when there is no update-replace
because the old resource will finish (either fail or complete) and in
the same thread it will find the new traversal and trigger it.
Closes-Bug: #1625073
Change-Id: Icea5ba498ef8ca45cd85a9721937da2f4ac304e0
Implements mechanism to cancel existing workers (in_progress resources).
The stack-cancel-update request lands in one of the engines, and if
there are any workers in that engine which are working for the stack,
they are cancelled first and then other engines are requested to cancel
the workers.
Change-Id: I464c4fdb760247d436473af49448f7797dc0130d
This allows a convergence operation to be cancelled at an appropriate point
(i.e. between steps in a task) by sending a message to a queue.
Note that there's no code yet to actually cancel any operations
(specifically, sending a cancel message to the stack will _not_ cause the
check_resource operations to be cancelled under convergence).
Change-Id: I9469c31de5e40334083ef1dd20243f2f6779549e
Related-Bug: #1545063
Co-Authored-By: Anant Patil <anant.patil@hpe.com>
The input to check stack complete should be the resource ID of the
resource that the current resource replaces instead of its own. Failing
to do so will result in stack being in in_progress state for ever.
Change-Id: I6f2856c82c8cc73f628976b7296ab0fb20af5ff3
Closes-Bug: #1614960
A deeply misguided effort to move all exceptions out of the
heat.engine.resource module, where they belong, and into the
heat.common.exception module, where they largely do not, broke the API for
third-party resource plugins. Unfortunately this happened a couple of
releases back already, so we can't simply put UpdateReplace back where it
belongs as that would also break the de-facto third-party API.
This change adds an alias in the correct location and a comment indicating
that it will move back at some time in the future. It also switches all of
the in-tree uses back to heat.engine.resource.UpdateReplace, in the hope
that third-party developers will be more inclined to copy from that.
This reverts commit 4e2cfb991a.
Change-Id: Iedd5d07d6c0c07e39e51a0fb810665b3e9c61f87
Closes-Bug: #1611104
In convergence, wherein concurrent updates are possible, if a resource
is deleted (by previous traversal) after dependency graph is created
for new traversal, the resource remains in graph but wouldn't be
available in DB for processing.
It is prerequisite to have resources in DB before any action can be
taken on them.
Hence during convergence resource delete action, the resource entry
from DB is not deleted i.e soft deleted, so that the latest/new update
can find the entry.
All of these soft deleted resources will be deleted when the stack has
completed its operation.
Closes-Bug: #1528560
Change-Id: I0b36ce098022560d7fe01623ce7b66d1d5b38d55
Refactor the worker service; move the check resource code to its own
class in another file and keep the convergence worker RPC API clean.
This refactor will help us contain the convergence logic in a separate
class file instead of in RCP API. The RPC service class should only have
the APIs it implements.
Change-Id: Ie9cf4daba7e6bf61f4cac3388494e8c9efefa4d7
While constructing input-data for building the cache, the resource
attributes must resolve without hitting the cache again. It is
unnecessary to look into cache for resolving attributes of a freshly
baked resource.
Change-Id: I0893c17d87c687ca5cf370c4443f471160bd2f3c
When a engine worker crashes or is restarted, the resources being
provisioned in it remain in IN_PROGRESS state. Next stack update should
pick these resources and work on them. The implementation is to set the
status of resource as FAILED and re-trigger check_resource.
Change-Id: Ib7fd73eadd0127f8fae47881b59388b31131daf4
Closes-Bug: #1501161
Presently, when a resource of previous traversal completes its action
successfully we re-trigger this resource for latest traversal.(since
the latest traversal will be waiting for its completion)
However, if a resource of previous traversal fails we do not
re-trigger which leads to latest traversal waiting endlessly.
This patch re-triggers the resource for latest traversal even when
the resource fails.
Change-Id: I9f70878ad7f1ff7c2facb950e496681425b54fc4
Partial-Bug: #1512343
To avoid certain concurrency related issues, the DB update API needs to
be given the traversal ID of the stack intended to be updated. By making
this change, we can void having following at all the places:
if current_traversal != stack.current_traversal:
return
The check for current traversal should be implicit, as a part of stack's
store and state_set methods, where self.current_traversal should be used
as expected traversal to be updated. All the state changes or updates in
DB to the stack object go through this implicit check (using
update...where).
When stack updates are triggered, the current traversal should be backed
up as previous traversal, a new traversal should be generated and the
stack should be stored in DB with expected traversal as the previous
traversal. This will ensure that no two updates can simultaneously
succeed on same stack with same traversal ID. This was one of our
primary goal.
Following example cases describe the issues we encounter:
1. When 2 updates, U1 and U2 try to update a stack concurrently:
1. Current traversal(CT) is X
2. U1 loads stack with CT=X
3. U2 loads stack with CT=X
4. U2 stores the stack and updates CT=Y
5. U1 stores the stack and updates the CT=Z
Both the updates have succeeded, and both would be running until
one of the workers does stack.current_traversal == current_traversal
and bail out.
Ideally, U1 should have failed: only one should be allowed in case
of concurrent update. When both U1 and U2 pass X as the expected
traversal ID of the stack, then this problem is solved.
2. A resource R is being provisioned for stack with current traversal
CT=X:
1. An new update U is issued, it loads the stack with CT=X.
2. Resource R fails and loads the stack with CT=X to mark it as FAILED.
3. Update U updates the stack with CT=Y and goes ahead with sync_point
etc., marks stack as UPDATE_IN_PROGRESS
4. Resource marks the stack as UPDATE_FAILED, which to user means that
update U has failed, but it actually is going on.
With this patch, when Resource R fails, it will supply CT=X as
expected traversal to be updated and will eventually fail because
update U with CT=Y has taken over.
Partial-Bug: #1512343
Change-Id: I6ca11bed1f353786bb05fec62c89708d98159050
When loading a resource, load the stack with template of the resource.
Appropriate stack needs to be assigned to resource(resource.stack), else
resource actions will fail.
Co-Authored-By: Anant Patil <anant.patil@hp.com>
Partial-Bug: #1512343
Change-Id: Ic4526152c8fd027049514b71554036321a61efd2
Due to recent code changes stack is marked complete/failed when
releasing the lock. But in case of convergence, stack lock is not
used.
Hence in case of convergence we need to still persist stack status
when state_set is called.
Change-Id: Ibe8c59a69e8dc8be3f0d15af238e166c852ab675
Closes-Bug: #1503296
Some tests may fail because of delay between actual function call
and assert_called_with check of the same function: we have argument
that depend on time. So mock remaining_time() of the stack.
Change-Id: I2968437f7875ce6e8d437ec28a119f89cba720e8
Closes-Bug: #1493906
Fix failing convergence gate functional tests
- store resource uuid, action, status in cache data. Most of the code
requires the resource to have proper status and uuid to work.
- initialize rsrc._data to None so that the resource data is fetched from
db first time.
Change-Id: I7309c7da8fe1ce3e1c7e3d3027dea2e400111015
Co-Authored-By: Anant Patil <anant.patil@hp.com>
Partial-Bug: #1492116
Closes-Bug: #1495094
It is convenient to have all exceptions in exception module.
Also it is reduces namespace cluttering of resource module and decreases
the number of dependencies in other modules (we do not need to import resource
in some cases for now).
UpdateInProgress exception is moved in this patch.
Change-Id: If694c264639bbce5334e1e6e7403b225ce1d3aee
It is convenient to have all exceptions in exception module.
Also it is reduces namespace cluttering of resource module and decreases
the number of dependencies in other modules (we do not need to import resource
in some cases for now).
UpdateReplace exception is moved in this patch.
Change-Id: Ief441ca2022a0d50e88d709d1a062631479715b7
store the attr name and path so attributes don't get shadowed
e.g. get_attr: [res1, attr_x, show]
get_attr: [res1, attr_x, something]
Change-Id: I724e91b32776aa5813d2b821c2062424e0635a69
Patch Id1806d546c67505137f57f72d5b463dc229a666d added timeout logic for
convergence. New method time_remaining was used in calls
check_resource_cleanup and check_resource_update in worker.py
So now we also should use the same function in our unittests.
Change-Id: Iffaf88362d3fa3057ffa642a72dc1902ecbdf096
1. we are caching the result of FnGetRefId which can be the name
2. cache_data_resource_attribute() was trying to access "attributes"
instead of "attrs".
Change-Id: I59d55dcee2af521924fdb5da14e012dcc7b4dd3f
The resource provisioning work is distributed among heat engines, so the
timeout also has to be distributed and brought to the resource level
granularity.
Thus,
1. Before invoking check_resource on a resource, ensure that the stack
has not timed out.
2. Pass the remaining amount of time to the resource converge method so
that it can raise timeout exception if it cannot finish in the remaining
time.
Once timeout exception is raised by a resource converge method, the
corresponding stack is marked as FAILED with "Timed out" as failure
reason. Then, if rollback is enabled on the stack, it is triggered.
Change-Id: Id1806d546c67505137f57f72d5b463dc229a666d
* test_case: mock.ANY is not hashable, so replace with a random graph key
* test_case: convert all maps to lists before asserting against raw data
partial blueprint heat-python34-support
Change-Id: I103a6bc297d8f533ca4885af4c6593b6b8176b80
All resources that are new will have an INIT state. Instead
of having a complex strategy to decide whether the resource
should be created or updated, just check for the action
to see if it is in the INIT state or not. If it is not, then
always trigger the update workflow.
Also, this fixes a bug where we triggered a create for a
resource without a resource id that originally should've been
updated because it was in UPDATE_FAILED which was the unhandled
case.
Change-Id: I3f2318fecfe76592e8b54e9c09fdf1614197e83f
Mostly in worker we have arguments called "data", it is not clear
if these are serialized or not (and if they have adopt data in them).
1. split adopt data out (add RPC support for the new argument)
2. name arguments "resource_data" for deserialized data
3. name arguments "rpc_data" for serialized data
4. make sure all data into client.check_resource() is serialized
Change-Id: Ie6bd0e45d2857d3a23235776c2b96cce02cb711a
When resources are replaced, the needed_by needes to be updated. When
the resources are visited in clean-up phase - after all the updates are
done - new needed_by data is sent to the resources.
Change-Id: Ib3bff461ab4cdd43391c7fcdfff6d8eb17fe2555
Co-Authored-By: Rakesh HS <rh-s@hp.com>
All attributes are retrieved (including ones referenced in the output).
In our functional test there is a bug where a nested stack did not
have an output that was referenced from the output. This didn't effect
the stack creation as:
- outputs are not normally required in the stack creation
- if the output is used by a TemplateResource it *may* be used
The way outputs are used currently is that they do not fail on
the first output that doesn't work as it is useful to get the ones
that do work. To help with this errors are placed alongside the
value (in a key "error_msg").
This is somewhat problematic in convergence as we now *require* all
attributes (including those based off of outputs).
Note: if you have an outer resource referencing a non-existent template
resource output it *will* fail, but when the outer stack's output references
the inner stack's output then it is not validated.
Change-Id: Id07c617a19eae56543f92ee21aea58cd38fa3606
Ensure stack operation is re-triggered when SynPointNotFound is
encountered and stack is updated (have new traversal ID).
Change-Id: Ia1b670aa5766c57dafdcc84d642c42007371a087
Fix the incorrect comparision of current traversal. Ensure
check_resource is re-triggered when sync point is not found and stack
has been updated (having a new traversal ID).
Change-Id: I9d7fa539b565f852b9431bb206823cfb16178607
Refactored worker to remove duplicate code. The check_resource is broken
down into many smaller methods for readability and unit tests.
Change-Id: Id32b9aa11ecb637bf737dc1d86261a9b78739535
This change is required in order to mock the worker rpc
behaviour for the convergence scenario tests.
Change-Id: I2fd11acf00bbc331a80e12c5e1290d6587c5b27a
Currently stack rollback logic is in the worker class which
makes stack rollback code to be duplicated to other places
like the simulater tests.
Change-Id: I2c907ca1c76b39addc22a5d8c36fb50c47af4785
This is needed to fix the following functional test:
test_template_resource.TemplateResourceErrorMessageTest.test_fail
Change-Id: I90d6e6f688c214a73c043aa88d0a25fc0b2ebc83
When an error occurs while provisioning a resource and rollback for the
stack is enabled, then rollback is triggered.
Stack table has reference to previous completed template and it is used
for issuing update on stack in order to rollback. If previous template
doesn't exist then an empty template is supplied (this is the case of
create rollback).
Change-Id: I3c9ef8665cc3b2006b8d0c330b00fa5e226201b9
Implements: convergence-rollback
When a resource fails:
1. Mark the stack as FAILED.
2. If rollback is enabled, invoke rollback for the stack. Otherwise,
clean-up database of stale entries.
After the stack is completed (or failed), the database needs to be
cleaned up.
Change-Id: Ic5713e8c0c78c38ee1aaee8865346cb55228043f
This is a merge of 4 reviews:
I52f1611d34def3474acba0e5eee054e11c5fc5ad
Ic374a38c9d76763be341d3a80f53fa396c9c2256
Iecd21ccb4392369f66fa1b3a0cf55aad754aeac4
I77b81097d2dcf01efa540237ed5ae14896ed1670
- make sure sender is a tuple (otherwise the serialization
function in sync_point breaks.)
- Update updated_time on any lifecycle operation(CREATE/UPDATE/DELETE)
over a stack.
- adjust sync_point logic to account for deletes
Done by having only a single stack sync point
for both updates and deletes.
- Serialize/deserialize input_data for RPC
- Make GraphKey's the norm in convergence worker
- move temp_update_requires functionality to tests
During intial stages of convergence to simulate the entire cycle
some part of worker code was written in stack.py.
Now that the convergence worker is implemented, this code needs to
be executed only in tests.
- Fix dictionary structure that's passed to resoure.(create/update)
- Temporarily disable loading cache_data for stack to help fix other
issues.
Change-Id: Iecd21ccb4392369f66fa1b3a0cf55aad754aeac4
Co-Authored-by: Sirushti Murugesan <sirushti.murugesan@hp.com>
Co-Authored-by: Rakesh H S <rh-s@hp.com>