Add Distributed Cloud GEO Redundancy docs (r9, dsr8MR3)

- Overview of the feature
- Procedure of the feature configuration

Story: 2010852
Task: 48493

Change-Id: If5fd6792adbb7e77ab2e92f29527c951be0134ee
Signed-off-by: Litao Gao <litao.gao@windriver.com>
Signed-off-by: Ngairangbam Mili <ngairangbam.mili@windriver.com>
This commit is contained in:
Litao Gao 2023-12-07 14:43:49 +08:00 committed by Ngairangbam Mili
parent a34b9d46c8
commit 2a75cb0a7a
7 changed files with 760 additions and 1 deletions

View File

@ -23,6 +23,8 @@
.. |os-prod-hor| replace:: OpenStack |prod-hor|
.. |prod-img| replace:: https://mirror.starlingx.windriver.com/mirror/starlingx/
.. |prod-abbr| replace:: StX
.. |prod-dc-geo-red| replace:: Distributed Cloud Geo Redundancy
.. |prod-dc-geo-red-long| replace:: Distributed Cloud System controller Geographic Redundancy
.. Guide names; will be formatted in italics by default.
.. |node-doc| replace:: :title:`StarlingX Node Configuration and Management`

View File

@ -16,6 +16,12 @@ system data backup file has been generated on the subcloud, it will be
transferred to the system controller and stored at a dedicated central location
``/opt/dc-vault/backups/<subcloud-name>/<release-version>``.
.. note::
Enabling the GEO Redundancy function will affect some of the subcloud
backup functions. For more information on GEO Redundancy and its
restrictions, see :ref:`configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662`.
Backup data creation requires the subcloud to be online, managed, and in
healthy state.

View File

@ -0,0 +1,617 @@
.. _configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662:
============================================================
Configure Distributed Cloud System Controller GEO Redundancy
============================================================
.. rubric:: |context|
You can configure a distributed cloud System Controller GEO Redundancy
using DC manager |CLI| commands.
System administrators can follow the procedures below to enable and
disable the GEO Redundancy feature.
.. Note::
In this release, the GEO Redundancy feature supports only two
distributed clouds in one protection group.
.. contents::
:local:
:depth: 1
---------------------
Enable GEO Redundancy
---------------------
Set up a protection group for two distributed clouds, making these two
distributed clouds operational in 1+1 active GEO Redundancy mode.
For example, let us assume we have two distributed clouds, site A and site B.
When the operation is performed on site A, the local site is site A and the
peer site is site B. When the operation is performed on site B, the local
site is site B and the peer site is site A.
.. rubric:: |prereq|
The peer system controller's |OAM| network is accessible to each other and can
access the subclouds via both |OAM| and management networks.
For security of production system, it is important to ensure the safety and
identification of peer site queries. To meet this objective, it is essential to
have an HTTPS-based system API in place. This necessitates the presence of a
well-known and trusted |CA| to enable secure HTTPS communication between peers.
If you are using an internally trusted |CA|, ensure that the system trusts the |CA| by installing
its certificate with the following command.
.. code-block:: none
~(keystone_admin)]$ system certificate-install --mode ssl_ca <trusted-ca-bundle-pem-file>
where:
``<trusted-ca-bundle-pem-file>``
is the path to the intermediate or Root |CA| certificate associated
with the |prod| REST API's Intermediate or Root |CA|-signed certificate.
.. rubric:: |proc|
You can enable the GEO Redundancy feature between site A and site B from the
command line. In this procedure, the subclouds managed by site A will be
configured to be managed by GEO Redundancy protection group that consists of site
A and site B. When site A is offline for some reasons, an alarm notifies the
administrator, who initiates the group based batch migration
to rehome the subclouds of site A to site B for centralized management.
Similarly, you can also configure the subclouds managed by site B to be
taken over by site A when site B is offline by following the same procedure where
site B is local site and site A is peer site.
#. Log in to the active controller node of site B and get the required
information about the site B to create a protection group.
* Unique |UUID| of the central cloud of the peer system controller
* URI of Keystone endpoint of peer system controller
* Gateway IP address of the management network of peer system controller
For example:
.. code-block:: bash
# On site B
sysadmin@controller-0:~$ source /etc/platform/openrc
~(keystone_admin)]$ system show | grep -i uuid
| uuid | 223fcb30-909d-4edf-8c36-1aebc8e9bd4a |
~(keystone_admin)]$ openstack endpoint list --service keystone \
--interface public --region RegionOne -c URL
+-----------------------------+
| URL |
+-----------------------------+
| http://10.10.10.2:5000 |
+-----------------------------+
~(keystone_admin)]$ system host-route-list controller-0 | awk '{print $10}' | grep -v "^$"
gateway
10.10.27.1
#. Log in to the active controller node of the central cloud of site A. Create
a System Peer instance of site B on site A so that site A can access information of
site B.
.. code-block:: bash
# On site A
~(keystone_admin)]$ dcmanager system-peer add \
--peer-uuid 223fcb30-909d-4edf-8c36-1aebc8e9bd4a \
--peer-name siteB \
--manager-endpoint http://10.10.10.2:5000 \
--peer-controller-gateway-address 10.10.27.1
Enter the admin password for the system peer:
Re-enter admin password to confirm:
+----+--------------------------------------+-----------+-----------------------------+----------------------------+
| id | peer uuid | peer name | manager endpoint | controller gateway address |
+----+--------------------------------------+-----------+-----------------------------+----------------------------+
| 2 | 223fcb30-909d-4edf-8c36-1aebc8e9bd4a | siteB | http://10.10.10.2:5000 | 10.10.27.1 |
+----+--------------------------------------+-----------+-----------------------------+----------------------------+
#. Collect the information from site A.
.. code-block:: bash
# On site A
sysadmin@controller-0:~$ source /etc/platform/openrc
~(keystone_admin)]$ system show | grep -i uuid
~(keystone_admin)]$ openstack endpoint list --service keystone --interface public --region RegionOne -c URL
~(keystone_admin)]$ system host-route-list controller-0 | awk '{print $10}' | grep -v "^$"
#. Log in to the active controller node of the central cloud of site B. Create
a System Peer instance of site A on site B so that site B has information about site A.
.. code-block:: bash
# On site B
~(keystone_admin)]$ dcmanager system-peer add \
--peer-uuid 3963cb21-c01a-49cc-85dd-ebc1d142a41d \
--peer-name siteA \
--manager-endpoint http://10.10.11.2:5000 \
--peer-controller-gateway-address 10.10.25.1
Enter the admin password for the system peer:
Re-enter admin password to confirm:
#. Create a |SPG| for site A.
.. code-block:: bash
# On site A
~(keystone_admin)]$ dcmanager subcloud-peer-group add --peer-group-name group1
#. Add the subclouds needed for redundancy protection on site A.
Ensure that the subclouds bootstrap data is updated. The bootstrap data is
the data used to bootstrap the subcloud, which includes the |OAM| and
management network information, system controller gateway information, and docker
registry information to pull necessary images to bootstrap the system.
For an example of a typical bootstrap file, see :ref:`installing-and-provisioning-a-subcloud`.
#. Update the subcloud information with the bootstrap values.
.. code-block:: bash
~(keystone_admin)]$ dcmanager subcloud update subcloud1 \
--bootstrap-address <Subcloud_OAM_IP_Address> \
--bootstrap-values <Path_of_Bootstrap-Value-File>
#. Update the subcloud information with the |SPG| created locally.
.. code-block:: bash
~(keystone_admin)]$ dcmanager subcloud update <SiteA-Subcloud1-Name> \
--peer-group <SiteA-Subcloud-Peer-Group-ID-or-Name>
For example,
.. code-block:: bash
~(keystone_admin)]$ dcmanager subcloud update subcloud1 --peer-group group1
#. If you want to remove one subcloud from the |SPG|, run the
following command:
.. code-block:: bash
~(keystone_admin)]$ dcmanager subcloud update <SiteA-Subcloud-Name> --peer-group none
For example,
.. code-block:: bash
~(keystone_admin)]$ dcmanager subcloud update subcloud1 --peer-group none
#. Check the subclouds that are under the |SPG|.
.. code-block:: bash
~(keystone_admin)]$ dcmanager subcloud-peer-group list-subclouds <SiteA-Subcloud-Peer-Group-ID-or-Name>
#. Create an association between the System Peer and |SPG|.
.. code-block:: bash
# On site A
~(keystone_admin)]$ dcmanager peer-group-association add \
--system-peer-id <SiteB-System-Peer-ID> \
--peer-group-id <SiteA-System-Peer-Group1> \
--peer-group-priority <priority>
The ``peer-group-priority`` parameter can accept an integer value greater
than 0. It is used to set the priority of the |SPG|, which is
created in peer site using the peer site's dcmanager API during association
synchronization.
* The default priority in the |SPG| is 0 when it is created
in the local site.
* The smallest integer has the highest priority.
During the association creation, the |SPG| in the association
will be synchronized from the local site to the peer site, and the subclouds
belonging to the |SPG|.
Confirm that the local |SPG| and its subclouds have been synchronized
into site B with the same name.
* Show the association information just created in site A and ensure that
``sync_status`` is ``in-sync``.
.. code-block:: bash
# On site A
~(keystone_admin)]$ dcmanager peer-group-association list <Association-ID>
+----+---------------+----------------+---------+-----------------+---------------------+
| id | peer_group_id | system_peer_id | type | sync_status | peer_group_priority |
+----+---------------+----------------+---------+-----------------+---------------------+
| 1 | 1 | 2 | primary | in-sync | 2 |
+----+---------------+----------------+---------+-----------------+---------------------+
* Show ``subcloud-peer-group`` in site B and ensure that it has been created.
* List the subcloud in ``subcloud-peer-group`` in site B and ensure that all
the subclouds have been synchronized as secondary subclouds.
.. code-block:: bash
# On site B
~(keystone_admin)]$ dcmanager subcloud-peer-group show <SiteA-Subcloud-Peer-Group-Name>
~(keystone_admin)]$ dcmanager subcloud-peer-group list-subclouds <SiteA-Subcloud-Peer-Group-Name>
When you create the primary association on site A, a non-primary association
on site B will automatically be created to associate the synchronized |SPG|
from site A and the system peer pointing to site A.
You can check the association list to confirm if the non-primary association
was created on site B.
.. code-block:: bash
# On site B
~(keystone_admin)]$ dcmanager peer-group-association list
+----+---------------+----------------+-------------+-------------+---------------------+
| id | peer_group_id | system_peer_id | type | sync_status | peer_group_priority |
+----+---------------+----------------+-------------+-------------+---------------------+
| 2 | 26 | 1 | non-primary | in-sync | None |
+----+---------------+----------------+-------------+-------------+---------------------+
#. (Optional) Update the protection group related configuration.
After the peer group association has been created, you can still update the
related resources configured in the protection group:
* Update subcloud with bootstrap values
* Add subcloud(s) into the |SPG|
* Remove subcloud(s) from the |SPG|
After any of the above operations, ``sync_status`` is changed to ``out-of-sync``.
After the update has been completed, you need to use the :command:`sync`
command to push the |SPG| changes to the peer site that
keeps the |SPG| the same status.
.. code-block:: bash
# On site A
dcmanager peer-group-association sync <SiteA-Peer-Group-Association1-ID>
.. warning::
The :command:`dcmanager peer-group-association sync` command must be run
after any of the following changes:
- Subcloud is removed from the |SPG| for the subcloud name change.
- Subcloud is removed from the |SPG| for the subcloud management network
reconfiguration.
- Subcloud updates one or both of these parameters:
``--bootstrap-address``, ``--bootstrap-values parameters``.
Similarly, you need to check the information has been synchronized by
showing the association information just created in site A, ensuring that
``sync_status`` is ``in-sync``.
.. code-block:: bash
# On site A
~(keystone_admin)]$ dcmanager peer-group-association show <Association-ID>
+----+---------------+----------------+---------+-----------------+---------------------+
| id | peer_group_id | system_peer_id | type | sync_status | peer_group_priority |
+----+---------------+----------------+---------+-----------------+---------------------+
| 1 | 1 | 2 | primary | in-sync | 2 |
+----+---------------+----------------+---------+-----------------+---------------------+
.. rubric:: |result|
You have configured a GEO Redundancy protection group between site A and site B.
If site A is offline, the subclouds configured in the |SPG| can be
migrated in batch to site B for central management manually.
----------------------------
Health Monitor and Migration
----------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Peer monitoring and alarming
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After the peer protection group is formed, if site A cannot be connected to
site B, there will be an alarm message on site B.
For example:
.. code-block:: bash
# On site B
~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+
| 280.004 | Peer siteA is in disconnected state. Following subcloud peer groups are impacted: group1. | peer=223fcb30-909d-4edf- | major | 2023-08-18T10:25:29. |
| | | 8c36-1aebc8e9bd4a | | 670977 |
| | | | | |
+----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+
Administrator can suppress the alarm with the following command:
.. code-block:: bash
# On site B
~(keystone_admin)]$ fm event-suppress --alarm_id 280.004
+----------+------------+
| Event ID | Status |
+----------+------------+
| 280.004 | suppressed |
+----------+------------+
---------
Migration
---------
If site A is down, after receiving the alarming message the administrator
can choose to perform the migration on site B, which will migrate the
subclouds under the |SPG| from site A to site B.
.. note::
Before initiating the migration operation, ensure that ``sync-status`` of the
peer group association is ``in-sync`` so that the latest updates from site A
have been successfully synchronized to site B. If ``sync_status`` is not
``in-sync``, the migration may fail.
.. code-block:: bash
# On site B
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate <Subcloud-Peer-Group-ID-or-Name>
# For example:
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate group1
During the batch migration, you can check the status of the migration of each
subcloud in the |SPG| by showing the details of the |SPG| being migrated.
.. code-block:: bash
# On site B
~(keystone_admin)]$ dcmanager subcloud-peer-group status <Subcloud-Peer-Group-ID-or-Name>
After successful migration, the subcloud(s) should be in
``managed/online/complete`` status on site B.
For example:
.. code-block:: bash
# On site B
~(keystone_admin)]$ dcmanager subcloud list
+----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+
| id | name | management | availability | deploy status | sync | backup status | backup datetime |
+----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+
| 45 | subcloud3-node2 | managed | online | complete | in-sync | None | None |
| 46 | subcloud1-node6 | managed | online | complete | in-sync | None | None |
+----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+
--------------
Post Migration
--------------
If site A is restored, the subcloud(s) should be adjusted to
``unmanaged/secondary`` status in site A. The administrator can receive an
alarm on site A that notifies that the |SPG| is managed by a peer site (site
B), because this |SPG| on site A has the higher priority.
.. code-block:: bash
~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+
| 280.005 | Subcloud peer group (peer_group_name=group1) is managed by remote system | subcloud_peer_group=7 | warning | 2023-09-04T04:51:58. |
| | (peer_uuid=223fcb30-909d-4edf-8c36-1aebc8e9bd4a) with lower priority. | | | 435539 |
| | | | | |
+----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+
Then, the administrator can decide if and when to migrate the subcloud(s) back.
.. code-block:: bash
# On site A
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate <Subcloud-Peer-Group-ID-or-Name>
# For example:
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate group1
After successful migration, the subcloud status should be back to the
``managed/online/complete`` status.
For example:
.. code-block:: bash
+----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+
| id | name | management | availability | deploy status | sync | backup status | backup datetime |
+----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+
| 33 | subcloud3-node2 | managed | online | complete | in-sync | None | None |
| 34 | subcloud1-node6 | managed | online | complete | in-sync | None | None |
+----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+
Also, the alarm mentioned above will be cleared after migrating back.
.. code-block:: bash
~(keystone_admin)]$ fm alarm-list
----------------------
Disable GEO Redundancy
----------------------
You can disable the GEO Redundancy feature from the command line.
Ensure that you have a stable environment to disable the GEO Redundancy
feature, ensuring that the subclouds are managed by the expected site.
.. rubric:: |proc|
#. Delete the primary association on both the sites.
.. code-block:: bash
# site A
~(keystone_admin)]$ dcmanager peer-group-association delete <SiteA-Peer-Group-Association1-ID>
#. Delete the |SPG|.
.. code-block:: bash
# site A
~(keystone_admin)]$ dcmanager subcloud-peer-group delete group1
#. Delete the system peer.
.. code-block:: bash
# site A
~(keystone_admin)]$ dcmanager system-peer delete siteB
# site B
~(keystone_admin)]$ dcmanager system-peer delete siteA
.. rubric:: |result|
You have torn down the protection group between site A and site B.
---------------------------
Backup and Restore Subcloud
---------------------------
You can backup and restore a subcloud in a distributed cloud environment.
However, GEO redundancy does not support the replication of subcloud backup
files from one site to another.
A subcloud backup is valid only for the current system controller. When a
subcloud is migrated from site A to site B, the existing backup becomes
unavailable. In this case, you can create a new backup of that subcloud on site
B. Subsequently, you can restore the subcloud from this newly created backup
when it is managed under site B.
For information on how to backup and restore a subcloud, see
:ref:`backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42`
and :ref:`restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e`.
-------------------------------------------
Operations Performed by Protected Subclouds
-------------------------------------------
The table below lists the operations that can/cannot be performed on the protected subclouds.
**Primary site**: The site where the |SPG| was created.
**Secondary site**: The peer site where the subclouds in the |SPG| can be migrated to.
**Protected subcloud**: The subcloud that belongs to a |SPG|.
**Local/Unprotected subcloud**: The subcloud that does not belong to any |SPG|.
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Operation | Allow (Y/N/Maybe) | Note |
+==========================================+==================================+=================================================================================================+
| Unmanage | N | Subcloud must be removed from the |SPG| before it can be manually unmanaged. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Manage | N | Subcloud must be removed from the |SPG| before it can be manually managed. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Delete | N | Subcloud must be removed from the |SPG| before it can be manually unmanaged |
| | | and deleted. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Update | Maybe | Subcloud can only be updated while it is managed in the primary site because the sync command |
| | | can only be issued from the system controller where the |SPG| was created. |
| | | |
| | | .. warning:: |
| | | |
| | | The subcloud network cannot be reconfigured while it is being managed by the secondary |
| | | site. If this operation is necessary, perform the following steps: |
| | | |
| | | #. Remove the subcloud from the |SPG| to make it a local/unprotected |
| | | subcloud. |
| | | #. Update the subcloud. |
| | | #. (Optional) Manually rehome the subcloud to the primary site after it is restored. |
| | | #. (Optional) Re-add the subcloud to the |SPG|. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Rename | Yes | - If the subcloud in the primary site is already a part of |SPG|, we need to remove it from the |
| | | |SPG| and then unmanage, rename, and manage the subcloud, and add it back to |SPG| and perform|
| | | the sync operation. |
| | | |
| | | - If the subcloud is in the secondary site, perform the following steps: |
| | | |
| | | #. Remove the subcloud from the |SPG| to make it a local/unprotected subcloud. |
| | | |
| | | #. Unmange the subcloud. |
| | | |
| | | #. Rename the subcloud. |
| | | |
| | | #. (Optional) Manually rehome the subcloud to the primary site after it is restored. |
| | | |
| | | #. (Optional) Re-add the subcloud to the |SPG|. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Patch | Y | .. warning:: |
| | | |
| | | There may be a patch out-of-sync alarm when the subcloud is migrated to another site. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Upgrade | Y | All the system controllers in the protection group must be upgraded first before upgrading |
| | | any of the subclouds. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Rehome | N | Subcloud cannot be manually rehomed while being part of the |SPG| |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Backup | Y | |
| | | |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Restore | Maybe | - If the subcloud in the primary site is already a part of |SPG|, we need to remove it from the |
| | | |SPG| and then unmanage and restore the subcloud, and add it back to |SPG| and perform |
| | | the sync operation. |
| | | |
| | | - If the subcloud is in the secondary site, perform the following steps: |
| | | |
| | | #. Remove the subcloud from the |SPG| to make it a local/unprotected subcloud. |
| | | |
| | | #. Unmange the subcloud. |
| | | |
| | | #. Restore the subcloud from the backup. |
| | | |
| | | #. (Optional) Manually rehome the subcloud to the primary site after it is restored. |
| | | |
| | | #. (Optional) Re-add the subcloud to the |SPG|. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Prestage | Y | .. warning:: |
| | | |
| | | The prestage data will get overwritten because it is not guaranteed that both the system |
| | | controllers always run on the same patch level (ostree repo) and/or have the same images |
| | | list. |
| | | |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Reinstall | Y | |
| | | |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Remove from |SPG| | Maybe | Subcloud can be removed from the |SPG| in the primary site. Subcloud can |
| | | only be removed from the |SPG| in the secondary site if the primary site is |
| | | currently down. |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+
| Add to |SPG| | Maybe | Subcloud can only be added to the |SPG| in the primary site as manual sync is required. |
| | | |
| | | |
+------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

View File

@ -175,6 +175,16 @@ Upgrade Orchestration for Distributed Cloud SubClouds
failure-prior-to-the-installation-of-n-plus-1-load-on-a-subcloud
failure-during-the-installation-or-data-migration-of-n-plus-1-load-on-a-subcloud
--------------------------------------------------
Distributed Cloud System Controller GEO Redundancy
--------------------------------------------------
.. toctree::
:maxdepth: 1
overview-of-distributed-cloud-geo-redundancy
configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662
--------
Appendix
--------

View File

@ -0,0 +1,118 @@
.. eho1558617205547
.. _overview-of-distributed-cloud-geo-redundancy:
============================================
Overview of Distributed Cloud GEO Redundancy
============================================
|prod-long| |prod-dc-geo-red| configuration supports the ability to recover from
a catastrophic event that requires subclouds to be rehomed away from the failed
system controller site to the available site(s) which have enough spare capacity.
This way, even if the failed site cannot be restored in short time, the subclouds
can still be rehomed to available peer system controller(s) for centralized
management.
In this configuration, the following items are addressed:
* 1+1 GEO redundancy
- Active-Active redundancy model
- Total number of subcloud should not exceed 1K
* Automated operations
- Synchronization and liveness check between peer systems
- Alarm generation if peer system controller is down
* Manual operations
- Batch rehoming from alive peer system controller
---------------------------------------------
Distributed Cloud GEO Redundancy Architecture
---------------------------------------------
1+1 Distributed Cloud GEO Redundancy Architecture consists of two local high
availability Distributed Cloud clusters. They are the mutual peers that form a
protection group illustrated in the figure below:
.. image:: figures/dcg1695034653874.png
The architecture features a synchronized distributed control plane for
geographic redundancy, where system peer instance is created in each local
Distributed Cloud cluster pointing to each other via keystone endpoints to
form a system protection group.
If the administrator wants the peer site to take over the subclouds where local
system controller is in failure state, |SPG| needs to be created and subclouds
need to be assigned to it. Then, a Peer Group Association needs to be created
to link the system peer and |SPG| together. The |SPG| information and the
subclouds in it will be synchronized to the peer site via the endpoint information
stored in system peer instance.
The peer sites do health checks via the endpoint information stored in the system peer
instance. If the local site detects that the peer site is not reachable,
it will raise an alarm to alert the administrator.
If the failed site cannot be restored quickly, the administrator needs to
initiate batch subcloud migration by performing migration on the |SPG| from the
healthy peer of the failed site.
When the failed site has been restored and is ready for service, administrator can
initiate the batch subcloud migration from the restored site to migrate back
all the subclouds in the |SPG| for geographic proximity.
**Protection Group** A group of peer sites, which is configured to monitor each
other and decide how to take over the subclouds (based on predefined |SPG|) if
any peer in the group fails.
**System Peer**
A logic entity, which is created in a system controller site. System controller
site uses the information (keystone endpoint, credential) stored in the system
peer for the health check and data synchronization.
**Subcloud Secondary Deploy State**
This is a newly introduced state for a subcloud. If a subcloud is in the secondary
deploy state, the subcloud instance is only a placeholder holding the configuration
parameters, which can be used to migrate the corresponding subcloud from the peer
site. After rehoming, the subcloud's state will be changed from secondary to complete,
and is managed by the local site. The subcloud instance on the peer site is changed to secondary.
**Subcloud Peer Group**
Group of locally managed subclouds, which is supposed to be duplicated into a
peer site as secondary subclouds. The |SPG| instance will also be created in
peer site and it will contain all the secondary subclouds just duplicated.
Multiple |SPGs| are supported and the membership of the |SPG| is decided by
administrator. This way, administrator can divide local subclouds into different groups.
|SPG| can be used to initiate subcloud batch migration. For example, when the
peer site has been detected to be down, and the local site is supposed to take
over the management of the subclouds in failed peer site, administrator can
perform |SPG| migration to migrate all the subclouds in the |SPG| to the local
site for centralized management.
**Subcloud Peer Group Priority**
The priority is an attribute of |SPG| instance, and the |SPG| is designed to be
synchronized to each peer sites in the protection group with different priority
value.
In a Protection Group, there can be multiple System Peers. The site which owns
the |SPG| with the highest priority (smallest value) is the
leader site, which needs to initiate the batch migration to take over the
subclouds grouped by the |SPG|.
**Subcloud Peer Group and System Peer Association**
Association refers to the binding relationship between |SPG| and system peer.
When the association between a |SPG| and system peer is created on the local site,
the |SPG| and the subclouds in the group will be duplicated to the peer site to
which the system peer in this association is pointing. This way, when the local
site is down, the peer site has enough information to initiate the |SPG| based batch
migration to take over the centralized management for subclouds previously
managed by the failed site.
One system peer can be associated with multiple |SPGs|. One |SPG| can be associated
with multiple system peers, with priority specified. This priority is used to
decide which |SPG| has the higher priority to take over the subclouds when batch migration
should be performed.

View File

@ -17,6 +17,12 @@ controller using the rehoming playbook.
The rehoming playbook does not work with freshly installed/bootstrapped
subclouds.
.. note::
Manual rehoming is not possible if a subcloud is included in an |SPG|.
Use the :command:`dcmanager subcloud-peer-group migrate` command for automatic
rehoming. To get more information, see :ref:`configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662`.
.. note::
The system time should be accurately configured on the system controllers
@ -27,7 +33,7 @@ controller using the rehoming playbook.
Do not rehome a subcloud if the RECONCILED status on the system resource or
any host resource of the subcloud is FALSE. To check the RECONCILED status,
run the :command:`kubectl -n deployment get system` and :command:`kubectl -n deployment get hosts` commands.
Use the following procedure to enable subcloud rehoming and to update the new
subcloud configuration (networking parameters, passwords, etc.) to be
compatible with the new system controller.