etcd: Add support for more scenarios
This commit addresses a few shortcomings in the etcd service:
* Adding or removing etcd nodes required manual intervention.
* The etcd service would have brief outages during upgrades or
reconfigures because restarts weren't always serialised.
This makes the etcd service follow a similar pattern to mariadb:
* There is now a distiction between bootstrapping the cluster
and adding / removing another member.
* This more closely follows etcd's upstream bootstrapping
guidelines.
* The etcd role now serialises restarts internally so the
kolla_serial pattern is no longer appropriate (or necessary).
This does not remove the need for manual intervention in all
failure modes: the documentation has been updated to address the
most common issues.
Note that there's repetition in the container specifications: this
is somewhat deliberate. In a future cleanup, it's intended to reduce
the duplication.
Change-Id: I39829ba0c5894f8e549f9b83b416e6db4fafd96f
This commit is contained in:
committed by
Dr. Jens Harbott
parent
db79eb0a55
commit
ed3b27cc92
97
doc/source/admin/etcd.rst
Normal file
97
doc/source/admin/etcd.rst
Normal file
@@ -0,0 +1,97 @@
|
||||
.. etcd:
|
||||
|
||||
=============
|
||||
Managing etcd
|
||||
=============
|
||||
|
||||
Kolla Ansible can manage the lifecycle of an etcd cluster and supports the
|
||||
following operations:
|
||||
|
||||
* Bootstrapping a clean multi-node etcd cluster
|
||||
* Adding a new member to the etcd cluster
|
||||
* Optionally, automatically removing a deleted node from the etcd cluster.
|
||||
|
||||
It is highly recommended to read the operator documentation for the version
|
||||
of etcd deployed in the cluster.
|
||||
|
||||
.. note::
|
||||
|
||||
Once an etcd cluster is bootstrapped, the etcd service takes most of its
|
||||
configuration from the etcd database itself.
|
||||
|
||||
This pattern is very different from many other Kolla Ansible services, and
|
||||
is a source of confusion for operators unfamiliar with etcd.
|
||||
|
||||
Cluster vs Node Bootstrapping
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Kolla Ansible distinguishes between two forms of bootstrapping in an etcd
|
||||
cluster:
|
||||
|
||||
* Bootstrapping multiple nodes at the same time to bring up a new cluster
|
||||
* Bootstrapping a single node to add it to an existing cluster
|
||||
|
||||
These corresponds to the `new` and `existing` parameters for
|
||||
`ETCD_INITIAL_CLUSTER_STATE` in the upstream documentation. Once an etcd node
|
||||
has completed bootstrap, the bootstrap configuration is ignored, even if it is
|
||||
changed.
|
||||
|
||||
Kolla Ansible will decide to perform a new cluster bootstrap if it detects that
|
||||
there is no existing data on the etcd nodes. Otherwise it assumes that there is
|
||||
a healthy etcd cluster and it will add a new node to it.
|
||||
|
||||
Forcing Bootstrapping
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Kolla Ansible looks for the `kolla_etcd` volume on the node. If this volume
|
||||
is available, it assumes that the bootstrap process has run on the node and
|
||||
the volume contains the required config.
|
||||
|
||||
However, if the process was interrupted (externally, or by an error), this
|
||||
volume might be misconfigured. In order to prevent dataloss, manual
|
||||
intervention is required.
|
||||
|
||||
Before retriggering bootstrap make sure that there is no valuable data on the
|
||||
volume. This could be because the node was not in service, or that the data
|
||||
is persisted elsewhere.
|
||||
|
||||
To retrigger a bootstrap (for either the cluster, or for a single node),
|
||||
remove the volume, from all affected nodes:
|
||||
|
||||
``docker volume rm kolla_etcd``
|
||||
|
||||
Rerunning Kolla Ansible will then trigger the appropriate workflow and either
|
||||
a blank cluster will be bootstrapped, or an empty member will be added to
|
||||
the existing cluster.
|
||||
|
||||
Manual Commands
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
In order to manage etcd manually, the ``etcdctl`` command can be used inside
|
||||
the `etcd` container. This command has been set up with the appropriate
|
||||
environment variables for integrating with automation.
|
||||
|
||||
``etcdctl`` is configured with json output by default:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# list cluster members in a human-readable table
|
||||
docker exec -it etcd etcdctl -w table member list
|
||||
|
||||
Removing Dead Nodes
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If ``globals.yml`` has the value ``etcd_remove_deleted_members: "yes"`` then
|
||||
etcd nodes that are not in the inventory will be removed from the etcd cluster.
|
||||
|
||||
Any errors in the inventory can therefore cause unintended removal.
|
||||
|
||||
To manually remove a dead node from the etcd cluster, use the following
|
||||
commands:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# list cluster members and identify dead member
|
||||
docker exec -it etcd etcdctl -w table member list
|
||||
# remove dead member
|
||||
docker exec -it etcd etcdctl member remove MEMBER_ID_IN_HEX
|
||||
@@ -9,5 +9,6 @@ Admin Guides
|
||||
tls
|
||||
acme
|
||||
mariadb-backup-and-restore
|
||||
etcd
|
||||
production-architecture-guide
|
||||
deployment-philosophy
|
||||
|
||||
@@ -173,6 +173,14 @@ For each host, clean up its services:
|
||||
|
||||
.. _removing-existing-compute-nodes:
|
||||
|
||||
If the node is also running the `etcd` service, set
|
||||
``etcd_remove_deleted_members: "yes"`` in `globals.yml` to automatically
|
||||
remove nodes from the `etcd` cluster that have been removed from the inventory.
|
||||
|
||||
Alternatively the `etcd` members can be removed manually with `etcdctl`. For
|
||||
more details, please consult the `runtime reconfiguration` documentation
|
||||
section for the version of etcd in operation.
|
||||
|
||||
Removing existing compute nodes
|
||||
-------------------------------
|
||||
|
||||
|
||||
Reference in New Issue
Block a user