af3918fb69
The agent driver is using RPCs to call back from the driver to the conductor asynchronously. When using the RPC.call() method, some nodes would end up with stuck locks when using the agent driver during cleaning. The agent driver would issue a call() to continue_node_cleaning() after either the first heartbeat (from prepare_cleaning) or a heartbeat after a clean step had completed. The conductor would attempt to get a lock, but would not be able to. The node would retain its locked state (so far as I could tell), even after the error. Other nodes would continue and complete cleaning just fine. The exception raised by continue_node_cleaning() was likely not caught by the agent driver, but caught by vendor_passthru() in the conductor as an expected exception. Switching to cast() avoids the issue because the errors are not sent back to the caller. I didn't experience any more stuck locks with this change. Change-Id: I4dbb04ccb93199bba4e1a1614bc19b70a068a9ea Closes-Bug: 1442810 |
||
---|---|---|
.. | ||
__init__.py | ||
test_conductor_utils.py | ||
test_manager.py | ||
test_rpcapi.py | ||
test_task_manager.py | ||
test_utils.py | ||
utils.py |