From 67ac2902f2b8615e316e0f359b2fe2f43ed3c533 Mon Sep 17 00:00:00 2001 From: Martin Kalcok Date: Tue, 10 Jan 2023 17:58:03 +0100 Subject: [PATCH] Add docs for ovn-central downscaling. Change-Id: I4384f4a5b4f53a222b59af4c95f3e095192c1155 --- doc/source/admin/index.rst | 1 + .../admin/ops-scale-back-ovn-central.rst | 130 ++++++++++++++++++ 2 files changed, 131 insertions(+) create mode 100644 doc/source/admin/ops-scale-back-ovn-central.rst diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst index 59707e85..94d2d888 100644 --- a/doc/source/admin/index.rst +++ b/doc/source/admin/index.rst @@ -28,6 +28,7 @@ General cloud operations: ops-restart-partitioned-rabbitmq-cluster ops-replace-control-plane-service-ha ops-replace-hyperconverged-compute-node + ops-scale-back-ovn-central Ceph storage operations (published in the Charmed Ceph documentation): diff --git a/doc/source/admin/ops-scale-back-ovn-central.rst b/doc/source/admin/ops-scale-back-ovn-central.rst new file mode 100644 index 00000000..33d2dab3 --- /dev/null +++ b/doc/source/admin/ops-scale-back-ovn-central.rst @@ -0,0 +1,130 @@ +:orphan: + +======================================= +Scale back the ovn-central application +======================================= + +Preamble +-------- + +Clean downscaling of the ovn-central application is supported from release +23.03 onwards. Earlier versions of the charm will require some manual steps. + +Think about the impact +---------------------- + +OVN central is using the Raft consensus algorithm to facilitate HA. Raft + +* Tolerates up to (N-1)/2 node failures +* Requires minimum quorum of (N/2)+1 members + +Changes to the number of members in the OVN cluster affect its fault tolerance +as well as its minimum requirements for quorum. Before you downscale your +cluster, think about the impact it will have on both of these properties. + +It is not recommended to downscale ovn-central application below 3 members. + + +Procedure for releases before 23.03 +----------------------------------- + +With older releases of ovn-central charm, the operator can run +``juju remove-unit`` command, but internally, OVN cluster will not perform +reconfiguration and it will keep expecting servers from the removed unit to +rejoin the cluster. To cleanly remove units, you have to complete a few manual +steps. + +Log into the unit that you intend to remove (using ``juju ssh``) and execute +the following commands as root: + +.. code-block:: none + + ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/leave OVN_Southbound + ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/leave OVN_Northbound + +This will cause OVN servers hosted on this unit to gracefully leave both +Southbound and Northbound OVN clusters. + +Perform unit removal with: + +.. code-block:: none + + juju remove-unit + +To verify that the downscaling completed successfully, log into one of the +remaining units of ovn-central and check the state of both clusters (again as +root). + +.. code-block:: none + + ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound + ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound + +If both clusters have an expected number of members, you are done. However if +any of the clusters did not perform reconfiguration and removed servers are +still hanging around, you can kick them manually using following command where +CLUSTER_NAME is either "OVN_Southbound" or "OVN_Northbound" and SERVER_ID is +a short hexadecimal number from "cluster/status" output. + +.. code-block:: none + + ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick + +.. note:: + + The ``cluster/kick`` command does everything needed to decrease the number + of cluster members by one. It both removes the targeted server and informs + the remaining members so their view of the cluster can be automatically + updated. + +Procedure for release 23.03 and after +------------------------------------- + +Starting with this release, the removed ovn-central unit will attempt to perform +a graceful departure from the cluster so the operator should not need to do +anything else than remove the unit with: + +.. code-block:: none + + juju remove-unit + +To verify that the unit departed cluster cleanly, wait for the ovn-central +application to settle and run: + +.. code-block:: none + + juju run-action --wait cluster-status + +This output will show yaml-formatted status of both Southbound and Northbound +OVN clusters. Each cluster status will contain key "unit_map", if this list +does not contain any servers in category "UNKNOWN", it means that downscaling +completed successfully. + +Example of "unit_map" after successful downscaling: + +.. code-block:: console + + unit_map: + ovn-central/3: 7ed2 + ovn-central/1: f1ca + ovn-central/2: 92d5 + + +However if there are "UNKNOWN" servers, for example like this: + +.. code-block:: console + + unit_map: + ovn-central/3: 7ed2 + ovn-central/1: f1ca + ovn-central/2: 92d5 + UNKNOWN: + - ba21 + +It means that downscaling did not complete successfully, and you'll have to +manually kick servers listed as "UNKNOWN" using the `cluster-kick`_ action +provided by the charm. + +.. LINKS +.. _cluster-kick: https://charmhub.io/ovn-central/actions?channel=edge#cluster-kick +