Make workflow monitoring more resilient

Let's make the monitoring of the mistral executions a bit more robust.
If for some reason a tcp reset occurs while monitoring the execution
state of a workflow, tripleoclient completely gives up:

  2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command:
  keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25:
  ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

This is bad UX because the execution is actually running via mistral
in the background just fine, it's just the monitoring that did not
survive a hiccup.

With this patch we were able to inject artificial problems on mistral
with the reproducer below [1] and the client survived them just fine:

TASK [Gathering Facts] *********************************************************
Tuesday 22 December 2020  13:27:38 +0000 (0:00:03.337)       0:00:03.438 ******
ok: [controller-1]

2020-12-22 13:27:41.242 3728 WARNING tripleoclient.workflows.base [-] Connection failure while fetching execution ID. Retrying: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',)): keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',))^[[00m
ok: [controller-0]

PLAY [Load global variables] ***************************************************

The mistral execution continues correctly and tripleoclient deals with the hiccup without erroring out.

[1] A quick reproducer for this issue is to run a longer workflow (minor
update or ffu of a node) and once tripleoclient is just monitoring the
mistral execution just run:
iptables -I INPUT -p tcp --dport 13989 -j REJECT
sleep 13
iptables -D INPUT 1

Co-Authored-By: Damien Ciabrini <dciabrin@redhat.com>

Change-Id: Ie08f3bc7c7cd7796f067a9ee0a99c017a5567ea2
Closes-Bug: #1909019
This commit is contained in:
Michele Baldessari 2020-12-22 16:26:08 +01:00
parent 44b94ee9bf
commit 01fb0efda0

View File

@ -10,6 +10,7 @@
# License for the specific language governing permissions and limitations
# under the License.
import json
import keystoneauth1
import logging
from tripleoclient import exceptions
@ -78,7 +79,13 @@ def wait_for_messages(mistral, websocket, execution, timeout=None):
# Workflows should end with SUCCESS or ERROR statuses.
if payload.get('status', 'RUNNING') != "RUNNING":
return
execution = mistral.executions.get(execution.id)
try:
execution = mistral.executions.get(execution.id)
except keystoneauth1.exceptions.connection.ConnectFailure as e:
LOG.warning("Connection failure while fetching execution ID."
"Retrying: %s" % e)
continue
if execution.state != "RUNNING":
# yield the output as the last payload which was missed
yield json.loads(execution.output)