Make workflow monitoring more resilient
Let's make the monitoring of the mistral executions a bit more robust. If for some reason a tcp reset occurs while monitoring the execution state of a workflow, tripleoclient completely gives up: 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) This is bad UX because the execution is actually running via mistral in the background just fine, it's just the monitoring that did not survive a hiccup. With this patch we were able to inject artificial problems on mistral with the reproducer below [1] and the client survived them just fine: TASK [Gathering Facts] ********************************************************* Tuesday 22 December 2020 13:27:38 +0000 (0:00:03.337) 0:00:03.438 ****** ok: [controller-1] 2020-12-22 13:27:41.242 3728 WARNING tripleoclient.workflows.base [-] Connection failure while fetching execution ID. Retrying: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',)): keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',))^[[00m ok: [controller-0] PLAY [Load global variables] *************************************************** The mistral execution continues correctly and tripleoclient deals with the hiccup without erroring out. [1] A quick reproducer for this issue is to run a longer workflow (minor update or ffu of a node) and once tripleoclient is just monitoring the mistral execution just run: iptables -I INPUT -p tcp --dport 13989 -j REJECT sleep 13 iptables -D INPUT 1 Co-Authored-By: Damien Ciabrini <dciabrin@redhat.com> Change-Id: Ie08f3bc7c7cd7796f067a9ee0a99c017a5567ea2 Closes-Bug: #1909019
This commit is contained in:
parent
44b94ee9bf
commit
01fb0efda0
@ -10,6 +10,7 @@
|
||||
# License for the specific language governing permissions and limitations
|
||||
# under the License.
|
||||
import json
|
||||
import keystoneauth1
|
||||
import logging
|
||||
|
||||
from tripleoclient import exceptions
|
||||
@ -78,7 +79,13 @@ def wait_for_messages(mistral, websocket, execution, timeout=None):
|
||||
# Workflows should end with SUCCESS or ERROR statuses.
|
||||
if payload.get('status', 'RUNNING') != "RUNNING":
|
||||
return
|
||||
execution = mistral.executions.get(execution.id)
|
||||
try:
|
||||
execution = mistral.executions.get(execution.id)
|
||||
except keystoneauth1.exceptions.connection.ConnectFailure as e:
|
||||
LOG.warning("Connection failure while fetching execution ID."
|
||||
"Retrying: %s" % e)
|
||||
continue
|
||||
|
||||
if execution.state != "RUNNING":
|
||||
# yield the output as the last payload which was missed
|
||||
yield json.loads(execution.output)
|
||||
|
Loading…
Reference in New Issue
Block a user