openstack-ansible/doc/source/draft-operations-guide/maintenance-tasks/rabbitmq-maintain.rst
Amy Marrich (spotz) 39d1f38c39 DOC - Add note regarding Ansible hostname bug
Added a note to the operations guide and a release note for the issue

Change-Id: I8cc3d8b3c46de5e99fd5e2aa03a44be36efe28ba
Closes-Bug: #1669125
2017-03-08 11:10:10 +00:00

221 lines
8.2 KiB
ReStructuredText

============================
RabbitMQ cluster maintenance
============================
A RabbitMQ broker is a logical grouping of one or several Erlang nodes with each
node running the RabbitMQ application and sharing users, virtual hosts, queues,
exchanges, bindings, and runtime parameters. A collection of nodes is often
referred to as a `cluster`. For more information on RabbitMQ clustering, see
`RabbitMQ cluster <https://www.rabbitmq.com/clustering.html>`_.
Within OpenStack-Ansible, all data and states required for operation of the RabbitMQ
cluster is replicated across all nodes including the message queues providing
high availability. RabbitMQ nodes address each other using domain names.
The hostnames of all cluster members must be resolvable from all cluster
nodes, as well as any machines where CLI tools related to rabbit might be
used. There are alternatives that may work in more
restrictive environments. For more details on that setup, see
`Inet Configuration <http://erlang.org/doc/apps/erts/inet_cfg.html>`_.
.. note::
There is currently an Ansible bug in regards to ``HOSTNAME``. If
the host ``.bashrc`` holds a var named ``HOSTNAME``, the container where the
``lxc_container`` module attaches will inherit this var and potentially
set the wrong ``$HOSTNAME``. See
`the Ansible fix <https://github.com/ansible/ansible/pull/22246>`_ which will
be released in Ansible version 2.3.
Create a RabbitMQ cluster
~~~~~~~~~~~~~~~~~~~~~~~~~
RabbitMQ clusters can be formed in two ways:
* Manually with ``rabbitmqctl``
* Declaratively (list of cluster nodes in a config, with
``rabbitmq-autocluster``, or ``rabbitmq-clusterer`` plugins)
.. note::
RabbitMQ brokers can tolerate the failure of individual nodes within the
cluster. These nodes can start and stop at will as long as they have the
ability to reach previously known members at the time of shutdown.
There are two types of nodes you can configure: disk and RAM nodes. Most
commonly, you will use your nodes as disk nodes (preferred). Whereas
RAM nodes are more of a special configuration used in performance clusters.
RabbitMQ nodes and the CLI tools use an ``erlang cookie`` to determine whether
or not they have permission to communicate. The cookie is a string
of alphanumeric characters and can be as short or as long as you would like.
.. note::
The cookie value is a shared secret and should be protected and kept private.
The default location of the cookie on ``*nix`` environments is
``/var/lib/rabbitmq/.erlang.cookie`` or in ``$HOME/.erlang.cookie``.
.. tip::
While troubleshooting, if you notice one node is refusing to join the
cluster, it is definitely worth checking if the erlang cookie matches
the other nodes. When the cookie is misconfigured (for example, not identical),
RabbitMQ will log errors such as "Connection attempt from disallowed node" and
"Could not auto-cluster". See `clustering <https://www.rabbitmq.com/clustering.html>`_
for more information.
To form a RabbitMQ Cluster, you start by taking independent RabbitMQ brokers
and re-configuring these nodes into a cluster configuration.
Using a 3 node example, you would be telling nodes 2 and 3 to join the
cluster of the first node.
#. Login to the 2nd and 3rd node and stop the rabbitmq application.
#. Join the cluster, then restart the application:
.. code-block:: console
rabbit2$ rabbitmqctl stop_app
Stopping node rabbit@rabbit2 ...done.
rabbit2$ rabbitmqctl join_cluster rabbit@rabbit1
Clustering node rabbit@rabbit2 with [rabbit@rabbit1] ...done.
rabbit2$ rabbitmqctl start_app
Starting node rabbit@rabbit2 ...done.
Check the RabbitMQ cluster status
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. Run ``rabbitmqctl cluster_status`` from either node.
You will see ``rabbit1`` and ``rabbit2`` are both running as before.
The difference is that the cluster status section of the output, both
nodes are now grouped together:
.. code-block:: console
rabbit1$ rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1]}]
...done.
To add the third rabbit node to the cluster, repeat the above
process by stopping the rabbitmq application on the third node.
#. Join the cluster, and restart the application on the third node.
#. Execute ``rabbitmq cluster_status`` to see all 3 nodes:
.. code-block:: console
rabbit1$ rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit3]}]},
{running_nodes,[rabbit@rabbit3,rabbit@rabbit2,rabbit@rabbit1]}]
...done.
Stop and restart a RabbitMQ cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To stop and start the cluster, keep in mind the order in
which you shut the nodes down. The last node you stop, needs to be the
first node you start. This node is the `master`.
If you start the nodes out of order, you could run into an issue where
it thinks the current `master` should not be the master and drops the messages
to ensure that no new messages are queued while the real master is down.
RabbitMQ and mnesia
~~~~~~~~~~~~~~~~~~~
Mnesia is a distributed database that RabbitMQ uses to store information about
users, exchanges, queues, and bindings. Messages, however
are not stored in the database.
For more information about Mnesia, see the
`Mnesia overview <http://erlang.org/doc/apps/mnesia/Mnesia_overview.html>`_.
To view the locations of important Rabbit files, see
`File Locations <https://www.rabbitmq.com/relocate.html>`_.
Repair a partitioned RabbitMQ cluster for a single-node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Invariably due to something in your environment, you are likely to lose a
node in your cluster. In this scenario, multiple LXC containers on the same host
are running Rabbit and are in a single Rabbit cluster.
If the host still shows as part of the cluster, but it is not running,
execute:
.. code-block:: console
# rabbitmqctl start_app
However, you may notice some issues with your application as clients may be
trying to push messages to the un-responsive node. To remedy this, forget the
node from the cluster by executing the following:
#. Ensure rabbit is not running on the node:
.. code-block:: console
# rabbitmqctl stop_app
#. On the Rabbit2 node, execute:
.. code-block:: console
# rabbitmqctl forget_cluster_node rabbit@rabbit1
By doing this, the cluster can continue to run effectively and you can repair
the failing node.
.. important::
Watch out when you restart the node, it will still think it is part of
the cluster and will require you to reset the node. After resetting, you
should be able to rejoin it to other nodes as needed.
.. code-block:: console
rabbit1$ rabbitmqctl start_app
Starting node rabbit@rabbit1 ...
Error: inconsistent_cluster: Node rabbit@rabbit1 thinks it's clustered with node rabbit@rabbit2, but rabbit@rabbit2 disagrees
rabbit1$ rabbitmqctl reset
Resetting node rabbit@rabbit1 ...done.
rabbit1$ rabbitmqctl start_app
Starting node rabbit@mcnulty ...
...done.
Repair a partitioned RabbitMQ cluster for a multi-node cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The same concepts apply to a multi-node cluster that exist in a single-node
cluster. The only difference is that the various nodes will actually be
running on different hosts. The key things to keep in mind when dealing with a
multi-node cluster are:
* When the entire cluster is brought down, the last node to go down must be the
first node to be brought online. If this does not happen, the nodes will wait
30 seconds for the last disc node to come back online, and fail afterwards.
If the last node to go offline cannot be brought back up, it can be removed
from the cluster using the :command:`forget_cluster_node` command.
* If all cluster nodes stop in a simultaneous and uncontrolled manner,
(for example, with a power cut) you can be left with a situation in which
all nodes think that some other node stopped after them. In this case you
can use the :command:`force_boot` command on one node to make it
bootable again.
Consult the rabbitmqctl manpage for more information.