Fix switch connection destination when a rabbitmq cluster node disappear
In a clustered rabbitmq when a node disappears, we get a
ConnectionRefusedError because the socket get disconnected.
The socket access yields a OSError because the heartbeat
tries to reach an unreachable host (No route to host).
Catch these exceptions to ensure that we call ensure_connection for switching
the connection destination.
POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc
Example:
$ git clone git@github.com:4383/rabbitmq-oslo_messging-error-poc
$ cd rabbitmq-oslo_messging-error-poc
$ python -m virtualenv .
$ source bin/activate
$ pip install -r requirements.txt
$ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
$ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSettings.IPAddress')
And in parallele in an another shell|tmux
$ podman stop rabbit
$ # observe the output of the poc.py script we now call ensure_connection
Now you can observe some output relative to the connection who is
modified and not catched before these changes.
Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399
Closes-Bug: #1828841
Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
(cherry picked from commit 9d8b1430e5
)
This commit is contained in:
parent
ab03435ad0
commit
26fccea843
|
@ -974,6 +974,14 @@ class Connection(object):
|
|||
def _heartbeat_thread_job(self):
|
||||
"""Thread that maintains inactive connections
|
||||
"""
|
||||
# NOTE(hberaud): Python2 doesn't have ConnectionRefusedError
|
||||
# defined so to switch connections destination on failure
|
||||
# with python2 and python3 we need to wrapp adapt connection refused
|
||||
try:
|
||||
ConnectRefuseError = ConnectionRefusedError
|
||||
except NameError:
|
||||
ConnectRefuseError = socket.error
|
||||
|
||||
while not self._heartbeat_exit_event.is_set():
|
||||
with self._connection_lock.for_heartbeat():
|
||||
|
||||
|
@ -990,7 +998,17 @@ class Connection(object):
|
|||
self.connection.drain_events(timeout=0.001)
|
||||
except socket.timeout:
|
||||
pass
|
||||
# NOTE(hberaud): In a clustered rabbitmq when
|
||||
# a node disappears, we get a ConnectionRefusedError
|
||||
# because the socket get disconnected.
|
||||
# The socket access yields a OSError because the heartbeat
|
||||
# tries to reach an unreachable host (No route to host).
|
||||
# Catch these exceptions to ensure that we call
|
||||
# ensure_connection for switching the
|
||||
# connection destination.
|
||||
except (socket.timeout,
|
||||
ConnectRefuseError,
|
||||
OSError,
|
||||
kombu.exceptions.OperationalError) as exc:
|
||||
LOG.info(_LI("A recoverable connection/channel error "
|
||||
"occurred, trying to reconnect: %s"), exc)
|
||||
|
|
Loading…
Reference in New Issue