From 7ad0d7eaf9cb095a14b07a08c814d9f1f9c8ff12 Mon Sep 17 00:00:00 2001 From: Jens Rosenboom Date: Fri, 27 Jun 2014 16:46:47 +0200 Subject: [PATCH] Fix reconnect race condition with RabbitMQ cluster Retry Queue creation to workaround race condition that may happen when both the client and broker race over exchange creation and deletion respectively which happen only when the Queue/Exchange were created with auto-delete flag. Queues/Exchange declared with auto-delete instruct the Broker to delete the Queue when the last Consumer disconnect from it, and the Exchange when the last Queue is deleted from this Exchange. Now in a RabbitMQ cluster setup, if the cluster node that we are connected to go down, 2 things will happen: 1. From RabbitMQ side, the Queues w/ auto-delete will be deleted from the other cluster nodes and then the Exchanges that the Queues are bind to if they were also created w/ auto-delete. 2. From client side, client will reconnect to another cluster node and call queue.declare() which create Exchanges then Queues then Binding in that order. Now in a happy path the queues/exchanges will be deleted from the broker before client start re-creating them again, but it also possible that the client first start by creating queues/exchange as part of the queue.declare() call, which are no-op operations b/c they alreay existed, but before it could bind Queue to Exchange, RabbitMQ nodes just received the 'signal' that the queue doesn't have any consumer so it should be delete, and the same with exchanges, which will lead to binding fail with NotFound error. Illustration of the time line from Client and RabbitMQ cluster respectively when the race condition happen: e-declare(E) q-declare(Q) q-bind(Q, E) -----+------------------+----------------+-----------> e-delete(E) ------------------------------+----------------------> Change-Id: Ideb73af6f246a8282780cdb204d675d5d4555bf0 Closes-Bug: #1318721 --- oslo/messaging/_drivers/impl_rabbit.py | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/oslo/messaging/_drivers/impl_rabbit.py b/oslo/messaging/_drivers/impl_rabbit.py index 939a3cec2..537bb235a 100644 --- a/oslo/messaging/_drivers/impl_rabbit.py +++ b/oslo/messaging/_drivers/impl_rabbit.py @@ -161,7 +161,20 @@ class ConsumerBase(object): self.channel = channel self.kwargs['channel'] = channel self.queue = kombu.entity.Queue(**self.kwargs) - self.queue.declare() + try: + self.queue.declare() + except Exception as e: + # NOTE: This exception may be triggered by a race condition. + # Simply retrying will solve the error most of the time and + # should work well enough as a workaround until the race condition + # itself can be fixed. + # TODO(jrosenboom): In order to be able to match the Execption + # more specifically, we have to refactor ConsumerBase to use + # 'channel_errors' of the kombu connection object that + # has created the channel. + # See https://bugs.launchpad.net/neutron/+bug/1318721 for details. + LOG.exception(_("Declaring queue failed with (%s), retrying"), e) + self.queue.declare() def _callback_handler(self, message, callback): """Call callback with deserialized message.