Allow removal of classic queue mirroring for internal RabbitMQ
Backport note: This patch has been updated to retain the existing behaviour by default. A temporary variable, rabbitmq_remove_ha_all_policy, has been added which may be set to true in order to remove the ha-all policy. In order to support changing the policy without upgrading, the the ha-all policy is removed on deploys, in addition to upgrades. When OpenStack is deployed with Kolla-Ansible, by default there are no durable queues or exchanges created by the OpenStack services in RabbitMQ. In Rabbit terminology, not being durable is referred to as `transient`, and this means that the queue is generally held in memory. Whether OpenStack services create durable or transient queues is traditionally controlled by the Oslo Notification config option: `amqp_durable_queues`. In Kolla-Ansible, this remains set to the default of `False` in all services. The only `durable` objects are the `amq*` exchanges which are internal to RabbitMQ. More recently, Oslo Notification has introduced support for Quorum queues [7]. These are a successor to durable classic queues, however it isn't yet clear if they are a good fit for OpenStack in general [8]. For clustered RabbitMQ deployments, Kolla-Ansible configures all queues as `replicated` [1]. Replication occurs over all nodes in the cluster. RabbitMQ refers to this as 'mirroring of classic queues'. In summary, this means that a multi-node Kolla-Ansible deployment will end up with a large number of transient, mirrored queues and exchanges. However, the RabbitMQ documentation warns against this, stating that 'For replicated queues, the only reasonable option is to use durable queues: [2]`. This is discussed further in the following bug report: [3]. Whilst we could try enabling the `amqp_durable_queues` option for each service (this is suggested in [4]), there are a number of complexities with this approach, not limited to: 1) RabbitMQ is planning to remove classic queue mirroring in favor of 'Quorum queues' in a forthcoming release [5]. 2) Durable queues will be written to disk, which may cause performance problems at scale. Note that this includes Quorum queues which are always durable. 3) Potential for race conditions and other complexity discussed recently on the mailing list under: `[ops] [kolla] RabbitMQ High Availability` The remaining option, proposed here, is to use classic non-mirrored queues everywhere, and rely on services to recover if the node hosting a queue or exchange they are using fails. There is some discussion of this approach in [6]. The downside of potential message loss needs to be weighed against the real upsides of increasing the performance of RabbitMQ, and moving to a configuration which is officially supported and hopefully more stable. In the future, we can then consider promoting specific queues to quorum queues, in cases where message loss can result in failure states which are hard to recover from. [1] https://www.rabbitmq.com/ha.html [2] https://www.rabbitmq.com/queues.html [3] https://github.com/rabbitmq/rabbitmq-server/issues/2045 [4] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit [5] https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/ [6] https://fuel-ccp.readthedocs.io/en/latest/design/ref_arch_1000_nodes.html#replication [7] https://bugs.launchpad.net/oslo.messaging/+bug/1942933 [8] https://www.rabbitmq.com/quorum-queues.html#use-cases Partial-Bug: #1954925 Change-Id: I91d0e23b22319cf3fdb7603f5401d24e3b76a56e (cherry picked from commit6bfe1927f0
) (cherry picked from commit425ead5792
)
This commit is contained in:
parent
2de604178a
commit
2764844ee2
@ -74,6 +74,8 @@ rabbitmq_server_additional_erl_args: "+S 2:2 +sbwt none +sbwtdcpu none +sbwtdio
|
|||||||
rabbitmq_tls_options: {}
|
rabbitmq_tls_options: {}
|
||||||
# To avoid split-brain
|
# To avoid split-brain
|
||||||
rabbitmq_cluster_partition_handling: "pause_minority"
|
rabbitmq_cluster_partition_handling: "pause_minority"
|
||||||
|
# Whether to remove the ha-all policy.
|
||||||
|
rabbitmq_remove_ha_all_policy: false
|
||||||
|
|
||||||
####################
|
####################
|
||||||
# Plugins
|
# Plugins
|
||||||
|
@ -1,4 +1,7 @@
|
|||||||
---
|
---
|
||||||
|
- include_tasks: remove-ha-all-policy.yml
|
||||||
|
when: rabbitmq_remove_ha_all_policy | bool
|
||||||
|
|
||||||
- import_tasks: config.yml
|
- import_tasks: config.yml
|
||||||
|
|
||||||
- import_tasks: check-containers.yml
|
- import_tasks: check-containers.yml
|
||||||
|
29
ansible/roles/rabbitmq/tasks/remove-ha-all-policy.yml
Normal file
29
ansible/roles/rabbitmq/tasks/remove-ha-all-policy.yml
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
- block:
|
||||||
|
- name: Get container facts
|
||||||
|
become: true
|
||||||
|
kolla_container_facts:
|
||||||
|
name:
|
||||||
|
- "{{ service.container_name }}"
|
||||||
|
register: container_facts
|
||||||
|
|
||||||
|
- block:
|
||||||
|
- name: List RabbitMQ policies
|
||||||
|
become: true
|
||||||
|
command: "docker exec {{ service.container_name }} rabbitmqctl list_policies --silent"
|
||||||
|
register: rabbitmq_policies
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
# NOTE(dszumski): This can be removed in the Zed cycle
|
||||||
|
- name: Remove ha-all policy from RabbitMQ
|
||||||
|
become: true
|
||||||
|
command: "docker exec {{ service.container_name }} rabbitmqctl clear_policy ha-all"
|
||||||
|
when:
|
||||||
|
- "'ha-all' in rabbitmq_policies.stdout"
|
||||||
|
when: container_facts[service.container_name] is defined
|
||||||
|
|
||||||
|
delegate_to: "{{ groups[role_rabbitmq_groups] | first }}"
|
||||||
|
run_once: true
|
||||||
|
vars:
|
||||||
|
service_name: "rabbitmq"
|
||||||
|
service: "{{ rabbitmq_services[service_name] }}"
|
@ -16,6 +16,9 @@
|
|||||||
when: inventory_hostname in groups[role_rabbitmq_groups]
|
when: inventory_hostname in groups[role_rabbitmq_groups]
|
||||||
register: rabbitmq_differs
|
register: rabbitmq_differs
|
||||||
|
|
||||||
|
- include_tasks: remove-ha-all-policy.yml
|
||||||
|
when: rabbitmq_remove_ha_all_policy | bool
|
||||||
|
|
||||||
- import_tasks: config.yml
|
- import_tasks: config.yml
|
||||||
|
|
||||||
- import_tasks: check-containers.yml
|
- import_tasks: check-containers.yml
|
||||||
|
@ -16,9 +16,13 @@
|
|||||||
{"user": "{{ murano_agent_rabbitmq_user }}", "vhost": "{{ murano_agent_rabbitmq_vhost }}", "configure": ".*", "write": ".*", "read": ".*"}
|
{"user": "{{ murano_agent_rabbitmq_user }}", "vhost": "{{ murano_agent_rabbitmq_vhost }}", "configure": ".*", "write": ".*", "read": ".*"}
|
||||||
{% endif %}
|
{% endif %}
|
||||||
],
|
],
|
||||||
|
{% if rabbitmq_remove_ha_all_policy | bool %}
|
||||||
|
"policies":[]
|
||||||
|
{% else %}
|
||||||
"policies":[
|
"policies":[
|
||||||
{"vhost": "/", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %},
|
{"vhost": "/", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}{% if project_name == 'outward_rabbitmq' %},
|
||||||
{"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}
|
{"vhost": "{{ murano_agent_rabbitmq_vhost }}", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all"}, "priority":0}
|
||||||
{% endif %}
|
{% endif %}
|
||||||
]
|
]
|
||||||
|
{% endif %}
|
||||||
}
|
}
|
||||||
|
@ -0,0 +1,10 @@
|
|||||||
|
---
|
||||||
|
fixes:
|
||||||
|
- |
|
||||||
|
Fixes an issue where RabbitMQ was configured to mirror classic transient
|
||||||
|
queues for all services. According to the RabbitMQ documentation this is
|
||||||
|
not a supported configuration, and contributed to numerous bug reports.
|
||||||
|
In order to avoid making unexpected changes to the RabbitMQ cluster, it is
|
||||||
|
necessary to set ``rabbitmq_remove_ha_all_policy`` to ``yes`` in order to
|
||||||
|
apply this fix. This variable will be removed in the Yoga release.
|
||||||
|
`LP#1954925 <https://launchpad.net/bugs/1954925>`__
|
Loading…
Reference in New Issue
Block a user