Adjust RabbitMQ HA policy to make reply queues HA

Changes in oslo.messaging for 2023.1 exposed a known race
condition in RabbitMQ when dealing with non-HA classic queues.
When a RMQ cluster member is taken down, clients failing over
to other members may erroneously be told a queue exists when it
is in the process of being deleted. This can cause them to
permanently sit waiting for messages from a queue that no longer
exists until their services are restarted.

Making the reply queues HA resolves this issue, at the expense
of a x3 increase in reply queues across the cluster. My
assumption is that reply queues were previously excluded from HA
policy as a performance gain given their link to the number of
compute nodes in an OpenStack deployment.

Context: https://bugs.launchpad.net/oslo.messaging/+bug/2031512

Depends-On: https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/916042
Change-Id: Iee6b5f8cc1ad04988c8634f8b6e026e2f8c75b52
This commit is contained in:
Andrew Bonney 2024-04-17 08:23:25 +01:00
parent 506d3bae49
commit d4530e242d
2 changed files with 8 additions and 1 deletions

View File

@ -32,7 +32,7 @@ oslomsg_rabbit_quorum_queues: False
rabbitmq_policies:
- name: "HA"
pattern: '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'
pattern: '^(?!(amq\.)|(.*_fanout_)).*'
priority: 0
tags: "ha-mode=all"
state: "{{ (oslomsg_rabbit_quorum_queues | default(True) or not rabbitmq_queue_replication) | ternary('absent', 'present') }}"

View File

@ -0,0 +1,7 @@
---
upgrade:
- |
When using RabbitMQ in a high availability cluster (non-quorum queues),
transient 'reply\_' queues are now included in the HA policy where they
previously were not. Note that this will increase the load on the RabbitMQ
cluster, particularly for deployments with large numbers of compute nodes.