etcd: Add support for more scenarios

This commit addresses a few shortcomings in the etcd service: * Adding or removing etcd nodes required manual intervention. * The etcd service would have brief outages during upgrades or reconfigures because restarts weren't always serialised. This makes the etcd service follow a similar pattern to mariadb: * There is now a distiction between bootstrapping the cluster and adding / removing another member. * This more closely follows etcd's upstream bootstrapping guidelines. * The etcd role now serialises restarts internally so the kolla_serial pattern is no longer appropriate (or necessary). This does not remove the need for manual intervention in all failure modes: the documentation has been updated to address the most common issues. Note that there's repetition in the container specifications: this is somewhat deliberate. In a future cleanup, it's intended to reduce the duplication. Change-Id: I39829ba0c5894f8e549f9b83b416e6db4fafd96f
2023-07-09 11:49:04 +01:00
parent db79eb0a55
commit ed3b27cc92
18 changed files with 471 additions and 19 deletions
--- a/doc/source/admin/etcd.rst
+++ b/doc/source/admin/etcd.rst
@@ -0,0 +1,97 @@
+.. etcd:
+
+=============
+Managing etcd
+=============
+
+Kolla Ansible can manage the lifecycle of an etcd cluster and supports the
+following operations:
+
+* Bootstrapping a clean multi-node etcd cluster
+* Adding a new member to the etcd cluster
+* Optionally, automatically removing a deleted node from the etcd cluster.
+
+It is highly recommended to read the operator documentation for the version
+of etcd deployed in the cluster.
+
+.. note::
+
+   Once an etcd cluster is bootstrapped, the etcd service takes most of its
+   configuration from the etcd database itself.
+
+   This pattern is very different from many other Kolla Ansible services, and
+   is a source of confusion for operators unfamiliar with etcd.
+
+Cluster vs Node Bootstrapping
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Kolla Ansible distinguishes between two forms of bootstrapping in an etcd
+cluster:
+
+* Bootstrapping multiple nodes at the same time to bring up a new cluster
+* Bootstrapping a single node to add it to an existing cluster
+
+These corresponds to the `new` and `existing` parameters for
+`ETCD_INITIAL_CLUSTER_STATE` in the upstream documentation. Once an etcd node
+has completed bootstrap, the bootstrap configuration is ignored, even if it is
+changed.
+
+Kolla Ansible will decide to perform a new cluster bootstrap if it detects that
+there is no existing data on the etcd nodes. Otherwise it assumes that there is
+a healthy etcd cluster and it will add a new node to it.
+
+Forcing Bootstrapping
+~~~~~~~~~~~~~~~~~~~~~
+
+Kolla Ansible looks for the `kolla_etcd` volume on the node. If this volume
+is available, it assumes that the bootstrap process has run on the node and
+the volume contains the required config.
+
+However, if the process was interrupted (externally, or by an error), this
+volume might be misconfigured. In order to prevent dataloss, manual
+intervention is required.
+
+Before retriggering bootstrap make sure that there is no valuable data on the
+volume. This could be because the node was not in service, or that the data
+is persisted elsewhere.
+
+To retrigger a bootstrap (for either the cluster, or for a single node),
+remove the volume, from all affected nodes:
+
+``docker volume rm kolla_etcd``
+
+Rerunning Kolla Ansible will then trigger the appropriate workflow and either
+a blank cluster will be bootstrapped, or an empty member will be added to
+the existing cluster.
+
+Manual Commands
+~~~~~~~~~~~~~~~
+
+In order to manage etcd manually, the ``etcdctl`` command can be used inside
+the `etcd` container. This command has been set up with the appropriate
+environment variables for integrating with automation.
+
+``etcdctl`` is configured with json output by default:
+
+.. code-block:: console
+
+   # list cluster members in a human-readable table
+   docker exec -it etcd etcdctl -w table member list
+
+Removing Dead Nodes
+~~~~~~~~~~~~~~~~~~~
+
+If ``globals.yml`` has the value ``etcd_remove_deleted_members: "yes"`` then
+etcd nodes that are not in the inventory will be removed from the etcd cluster.
+
+Any errors in the inventory can therefore cause unintended removal.
+
+To manually remove a dead node from the etcd cluster, use the following
+commands:
+
+.. code-block:: console
+
+   # list cluster members and identify dead member
+   docker exec -it etcd etcdctl -w table member list
+   # remove dead member
+   docker exec -it etcd etcdctl member remove MEMBER_ID_IN_HEX
--- a/doc/source/admin/index.rst
+++ b/doc/source/admin/index.rst
@@ -9,5 +9,6 @@ Admin Guides
   tls
   acme
   mariadb-backup-and-restore
+   etcd
   production-architecture-guide
   deployment-philosophy
--- a/doc/source/user/adding-and-removing-hosts.rst
+++ b/doc/source/user/adding-and-removing-hosts.rst
@@ -173,6 +173,14 @@ For each host, clean up its services:

 .. _removing-existing-compute-nodes:

+If the node is also running the `etcd` service, set
+``etcd_remove_deleted_members: "yes"`` in `globals.yml` to automatically
+remove nodes from the `etcd` cluster that have been removed from the inventory.
+
+Alternatively the `etcd` members can be removed manually with `etcdctl`. For
+more details, please consult the `runtime reconfiguration` documentation
+section for the version of etcd in operation.
+
 Removing existing compute nodes
 -------------------------------