Files

Peter Matulis 995c6ed96c Remove wait option from all run commands

The Charm Guide was recently updated for Juju 3.x. However, it was not
known at the time that the semantics of the --wait option for the `juju
run` command had changed.

In 3, the run command by default stays in the foreground (a --background
option has been added to achieve previous 2.9 default behaviour). The
--wait option is now a timeout and, if used, requires a value. Since
the previous PR simply substituted commands, all `juju run --wait`
commands will fail.

Change-Id: I6bb90762ad5cb5ca97ca311501b1ff7d3d9a3ccb
Signed-off-by: Peter Matulis <peter.matulis@canonical.com>

2023-09-27 12:16:47 -04:00

8.2 KiB

Raw Permalink Blame History

orphan

Repair a RabbitMQ node

Important

This page has been identified as being affected by the breaking changes introduced between versions 2.9.x and 3.x of the Juju client. Read support note juju_29_3x_changes before continuing.

Introduction

There may be times when a cloud's RabbitMQ cluster experiences difficulties. When this happens one option to bringing the cluster back to a healthy state is to repair one of its nodes.

Important

This procedure will not result in cloud downtime providing that there is at least one functional RabbitMQ node present.

Advantageously, it will not trigger the restart of every client application (cloud service) that has a relation to the rabbitmq-server application.

Repair vs replacement

An alternative approach to repairing a node is to replace it. See operation Replace a RabbitMQ node <ops-replace-rabbitmq-node>.

Below is a comparison of the repair and replacement methods.

Method	Pro	Con
repair	less time required	less comprehensive
	no service restarts
replace	comprehensive	more time required*
		service restarts**

* especially if a new machine must be provisioned

** can be mitigated with the Deferred service events <deferred-events> feature

Initial state of the cluster

For this example, the juju status rabbitmq-server command describes the initial state of the cluster:

Model      Controller       Cloud/Region      Version  SLA          Timestamp
openstack  maas-controller  maas-one/default  2.9.18   unsupported  23:18:34Z

App              Version  Status  Scale  Charm            Store       Channel  Rev  OS      Message
rabbitmq-server  3.8.2    active      3  rabbitmq-server  charmstore  stable   118  ubuntu  Unit is ready and clustered

Unit                Workload  Agent  Machine  Public address  Ports     Message
rabbitmq-server/0   error     idle   0/lxd/0  10.0.0.177      5672/tcp  ????????????????
rabbitmq-server/1*  active    idle   1/lxd/0  10.0.0.175      5672/tcp  Unit is ready and clustered
rabbitmq-server/2   active    idle   2/lxd/0  10.0.0.176      5672/tcp  Unit is ready and clustered

Machine  State    DNS         Inst id              Series  AZ       Message
0        started  10.0.0.172  node2                focal   default  Deployed
0/lxd/0  started  10.0.0.177  juju-64dabb-0-lxd-0  focal   default  Container started
1        started  10.0.0.173  node4                focal   default  Deployed
1/lxd/0  started  10.0.0.175  juju-64dabb-1-lxd-0  focal   default  Container started
2        started  10.0.0.174  node3                focal   default  Deployed
2/lxd/0  started  10.0.0.176  juju-64dabb-2-lxd-0  focal   default  Container started

In this example the unhealthy node is assumed to be rabbitmq-server/0. This is the node that is to be repaired.

The status report of the cluster shows that all three nodes are both present and running. From a healthy node (rabbitmq-server/2):

juju ssh rabbitmq-server/2 sudo rabbitmqctl cluster_status

Partial output:

Disk Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Running Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Pause the unhealthy node's service

Pause the RabbitMQ service on the unhealthy node/unit:

juju run rabbitmq-server/0 pause

Identify the unhealthy node's hostname

The status report of the cluster should show that a node is no longer running.

From a healthy node:

juju ssh rabbitmq-server/2 sudo rabbitmqctl cluster_status

The cluster's status output now includes:

Disk Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Running Nodes

rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

The hostname of the unhealthy node is the one that is no longer running. In this example, it is rabbit@juju-64dabb-0-lxd-0.

Remove the unhealthy node from the cluster

Apply the forget-cluster-node action to a healthy node's unit and refer to the unhealthy node by its now-known hostname. This removes the unhealthy node from the cluster:

juju run rabbitmq-server/2 forget-cluster-node node=rabbit@juju-64dabb-0-lxd-0

The cluster's status output should now include:

Disk Nodes

rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Running Nodes

rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Refresh the unhealthy node and rejoin it to the cluster

We refresh the unhealthy node and rejoin it to the cluster by invoking commands directly on the machine that is hosting the unhealthy node. Begin by connecting to it:

juju ssh rabbitmq-server/0

Move the current database out of the way:

Note

Normally, a previously working node cannot join a cluster with its old database intact. It must first be removed or renamed.
```
> sudo mv /var/lib/rabbitmq/{mnesia,mnesia.bak}
```
Start the RabbitMQ service in standalone (unclustered) mode
```
> sudo rabbitmq-server -detached
```
Stop the RabbitMQ server but keep the Erlang VM running
```
> sudo rabbitmqctl stop_app
```

Rejoin the unhealthy node to the cluster

Rejoin the unhealthy node to the cluster by referencing an existing cluster member:

> sudo rabbitmqctl join_cluster rabbit@juju-64dabb-2-lxd-0

The cluster's status output should now include:

Disk Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Running Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-2-lxd-0

Start a refreshed RabbitMQ server

> sudo rabbitmqctl start_app

The cluster's status output should now include:

Disk Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

Running Nodes

rabbit@juju-64dabb-0-lxd-0
rabbit@juju-64dabb-1-lxd-0
rabbit@juju-64dabb-2-lxd-0

The repaired node has rejoined the cluster and is now running.

Stop the RabbitMQ service

The RabbitMQ service must be stopped in the current environment in order for it to be managed by Juju:
```
> sudo rabbitmqctl stop
```

Resume the repaired node's service

Resume the RabbitMQ service on the repaired node/unit:

juju run rabbitmq-server/0 resume

Verify model health

Verify the model's health with the juju status rabbitmq-server command. Its partial output should look like:

App              Version  Status  Scale  Charm            Store       Channel  Rev  OS      Message
rabbitmq-server  3.8.2    active      3  rabbitmq-server  charmstore  stable   118  ubuntu  Unit is ready and clustered

Unit                Workload  Agent  Machine  Public address  Ports     Message
rabbitmq-server/0   active    idle   0/lxd/0  10.0.0.177      5672/tcp  Unit is ready and clustered
rabbitmq-server/1*  active    idle   1/lxd/0  10.0.0.175      5672/tcp  Unit is ready and clustered
rabbitmq-server/2   active    idle   2/lxd/0  10.0.0.176      5672/tcp  Unit is ready and clustered

8.2 KiB Raw Permalink Blame History

Repair a RabbitMQ node

Introduction

Repair vs replacement

Initial state of the cluster

Pause the unhealthy node's service

Identify the unhealthy node's hostname

Remove the unhealthy node from the cluster

Refresh the unhealthy node and rejoin it to the cluster

Resume the repaired node's service

Verify model health

8.2 KiB

Raw Permalink Blame History