From 01fb0efda0efa1379aa9b053688bb82441ee72cf Mon Sep 17 00:00:00 2001 From: Michele Baldessari Date: Tue, 22 Dec 2020 16:26:08 +0100 Subject: [PATCH] Make workflow monitoring more resilient Let's make the monitoring of the mistral executions a bit more robust. If for some reason a tcp reset occurs while monitoring the execution state of a workflow, tripleoclient completely gives up: 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) This is bad UX because the execution is actually running via mistral in the background just fine, it's just the monitoring that did not survive a hiccup. With this patch we were able to inject artificial problems on mistral with the reproducer below [1] and the client survived them just fine: TASK [Gathering Facts] ********************************************************* Tuesday 22 December 2020 13:27:38 +0000 (0:00:03.337) 0:00:03.438 ****** ok: [controller-1] 2020-12-22 13:27:41.242 3728 WARNING tripleoclient.workflows.base [-] Connection failure while fetching execution ID. Retrying: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',)): keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))^[[00m ok: [controller-0] PLAY [Load global variables] *************************************************** The mistral execution continues correctly and tripleoclient deals with the hiccup without erroring out. [1] A quick reproducer for this issue is to run a longer workflow (minor update or ffu of a node) and once tripleoclient is just monitoring the mistral execution just run: iptables -I INPUT -p tcp --dport 13989 -j REJECT sleep 13 iptables -D INPUT 1 Co-Authored-By: Damien Ciabrini Change-Id: Ie08f3bc7c7cd7796f067a9ee0a99c017a5567ea2 Closes-Bug: #1909019 --- tripleoclient/workflows/base.py | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/tripleoclient/workflows/base.py b/tripleoclient/workflows/base.py index e85f369a6..91cf9d9e6 100644 --- a/tripleoclient/workflows/base.py +++ b/tripleoclient/workflows/base.py @@ -10,6 +10,7 @@ # License for the specific language governing permissions and limitations # under the License. import json +import keystoneauth1 import logging from tripleoclient import exceptions @@ -78,7 +79,13 @@ def wait_for_messages(mistral, websocket, execution, timeout=None): # Workflows should end with SUCCESS or ERROR statuses. if payload.get('status', 'RUNNING') != "RUNNING": return - execution = mistral.executions.get(execution.id) + try: + execution = mistral.executions.get(execution.id) + except keystoneauth1.exceptions.connection.ConnectFailure as e: + LOG.warning("Connection failure while fetching execution ID." + "Retrying: %s" % e) + continue + if execution.state != "RUNNING": # yield the output as the last payload which was missed yield json.loads(execution.output)