Merge "Add docs for ovn-central downscaling."

This commit is contained in:
Zuul 2023-01-24 15:10:46 +00:00 committed by Gerrit Code Review
commit 27c03fbb86
2 changed files with 131 additions and 0 deletions

View File

@ -28,6 +28,7 @@ General cloud operations:
ops-restart-partitioned-rabbitmq-cluster
ops-replace-control-plane-service-ha
ops-replace-hyperconverged-compute-node
ops-scale-back-ovn-central
Ceph storage operations (published in the Charmed Ceph documentation):

View File

@ -0,0 +1,130 @@
:orphan:
=======================================
Scale back the ovn-central application
=======================================
Preamble
--------
Clean downscaling of the ovn-central application is supported from release
23.03 onwards. Earlier versions of the charm will require some manual steps.
Think about the impact
----------------------
OVN central is using the Raft consensus algorithm to facilitate HA. Raft
* Tolerates up to (N-1)/2 node failures
* Requires minimum quorum of (N/2)+1 members
Changes to the number of members in the OVN cluster affect its fault tolerance
as well as its minimum requirements for quorum. Before you downscale your
cluster, think about the impact it will have on both of these properties.
It is not recommended to downscale ovn-central application below 3 members.
Procedure for releases before 23.03
-----------------------------------
With older releases of ovn-central charm, the operator can run
``juju remove-unit`` command, but internally, OVN cluster will not perform
reconfiguration and it will keep expecting servers from the removed unit to
rejoin the cluster. To cleanly remove units, you have to complete a few manual
steps.
Log into the unit that you intend to remove (using ``juju ssh``) and execute
the following commands as root:
.. code-block:: none
ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/leave OVN_Southbound
ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/leave OVN_Northbound
This will cause OVN servers hosted on this unit to gracefully leave both
Southbound and Northbound OVN clusters.
Perform unit removal with:
.. code-block:: none
juju remove-unit <UNIT_NAME>
To verify that the downscaling completed successfully, log into one of the
remaining units of ovn-central and check the state of both clusters (again as
root).
.. code-block:: none
ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
If both clusters have an expected number of members, you are done. However if
any of the clusters did not perform reconfiguration and removed servers are
still hanging around, you can kick them manually using following command where
CLUSTER_NAME is either "OVN_Southbound" or "OVN_Northbound" and SERVER_ID is
a short hexadecimal number from "cluster/status" output.
.. code-block:: none
ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick <CLUSTER_NAME> <SERVER_ID>
.. note::
The ``cluster/kick`` command does everything needed to decrease the number
of cluster members by one. It both removes the targeted server and informs
the remaining members so their view of the cluster can be automatically
updated.
Procedure for release 23.03 and after
-------------------------------------
Starting with this release, the removed ovn-central unit will attempt to perform
a graceful departure from the cluster so the operator should not need to do
anything else than remove the unit with:
.. code-block:: none
juju remove-unit <UNIT_NAME>
To verify that the unit departed cluster cleanly, wait for the ovn-central
application to settle and run:
.. code-block:: none
juju run-action --wait <OVN_CENTRAL_UNIT> cluster-status
This output will show yaml-formatted status of both Southbound and Northbound
OVN clusters. Each cluster status will contain key "unit_map", if this list
does not contain any servers in category "UNKNOWN", it means that downscaling
completed successfully.
Example of "unit_map" after successful downscaling:
.. code-block:: console
unit_map:
ovn-central/3: 7ed2
ovn-central/1: f1ca
ovn-central/2: 92d5
However if there are "UNKNOWN" servers, for example like this:
.. code-block:: console
unit_map:
ovn-central/3: 7ed2
ovn-central/1: f1ca
ovn-central/2: 92d5
UNKNOWN:
- ba21
It means that downscaling did not complete successfully, and you'll have to
manually kick servers listed as "UNKNOWN" using the `cluster-kick`_ action
provided by the charm.
.. LINKS
.. _cluster-kick: https://charmhub.io/ovn-central/actions?channel=edge#cluster-kick