Merge "Types of Errors reported on Subclouds introduced in Distributed Cloud System Controller GEO Redundancy - Phase1 (r9, dsr8MR3)"
This commit is contained in:
		@@ -205,3 +205,4 @@ Appendix
 | 
			
		||||
 | 
			
		||||
    distributed-cloud-ports-reference
 | 
			
		||||
    certificate-management-for-admin-rest-api-endpoints
 | 
			
		||||
    subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae
 | 
			
		||||
 
 | 
			
		||||
@@ -0,0 +1,137 @@
 | 
			
		||||
.. _subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae:
 | 
			
		||||
 | 
			
		||||
==============================================================
 | 
			
		||||
Subcloud GEO Redundancy Error Root Cause and Correction Action
 | 
			
		||||
==============================================================
 | 
			
		||||
 | 
			
		||||
This section describes different error scenarios that can occur while using the
 | 
			
		||||
GEO Redundancy feature. The error scenarios described here are based on the
 | 
			
		||||
assumption that you are dealing with two distributed clouds, site A and site B.
 | 
			
		||||
In this context, the GEO Redundancy feature is activated designating site A as
 | 
			
		||||
the primary site and site B as the non-primary site. The GEO Redundancy feature
 | 
			
		||||
allows migration of subclouds to the non-primary site when the primary site
 | 
			
		||||
becomes unavailable, and also allows migrating them back to the primary site when it
 | 
			
		||||
becomes available again.
 | 
			
		||||
 | 
			
		||||
The error scenarios are divided into the following categories:
 | 
			
		||||
 | 
			
		||||
.. contents::
 | 
			
		||||
   :local:
 | 
			
		||||
   :depth: 1
 | 
			
		||||
 | 
			
		||||
----------------------
 | 
			
		||||
Protection group setup
 | 
			
		||||
----------------------
 | 
			
		||||
 | 
			
		||||
This scenario covers the errors detected during setup of the protection group and issues.
 | 
			
		||||
 | 
			
		||||
.. table::
 | 
			
		||||
    :widths: auto
 | 
			
		||||
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Error scenarios                                                     |  Recovery mechanism                                                                                                                                                |
 | 
			
		||||
    +=====================================================================+====================================================================================================================================================================+
 | 
			
		||||
    | Site A goes down temporarily in the middle of association.          |  Upon site A recovery, the peer group association will automatically change its sync status to ``failed``.                                                         |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  The administrator can trigger re-sync from the ``primary`` site if ``sync_status`` is either ``failed`` or ``out-of-sync``.                                       |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  Possible values of ``sync_status`` include ``syncing``, ``in_sync``, ``out-of-sync``, ``failed``, and ``unknown``.                                                |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  Possible values of ``association_type`` include ``primary``, ``non-primary``.                                                                                     |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Site A is down in the middle of synchronization and remains offline |  The administrator can check the peer group association sync status in the non-primary site to decide the next step. If the sync status is ``in-sync``,            |
 | 
			
		||||
    | for an extended period of time.                                     |  migration can be initiated.                                                                                                                                       |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    | How does the user check the syncing status from site B to initiate  |                                                                                                                                                                    |
 | 
			
		||||
    | the migration?                                                      |                                                                                                                                                                    |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | After initial sync is completed, site B goes down.                  |                                                                                                                                                                    |
 | 
			
		||||
    | How does site A sync to site B after site B comes back online?      |  Site A needs to keep track of subcloud group updates when site B is down. The sync status will go into unknown status in site A.                                  |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  The peer group association sync status in site A will change to ``unknown`` as soon as site B becomes unavailable. Upon the recovery of site B, the sync status   |
 | 
			
		||||
    |                                                                     |  will become ``in-sync`` on both sites again.                                                                                                                      |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  If changes are made to the peer group while site B is offline, the sync status in site A will change to ``failed``. Upon the recovery of site B,                  |
 | 
			
		||||
    |                                                                     |  the sync status in site A will change to ``out-of-sync``. The administrator will need to re-initiate the sync in site A using                                     |
 | 
			
		||||
    |                                                                     |  the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` command.                                                                   |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Site B is offline while creating peer group association to associate|  Creation of association will be accepted but ``sync_status`` will be ``failed``. Protection group cannot be created.                                              |
 | 
			
		||||
    | peer and a |SPG|.                                                   |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  The administrator can re-sync the association after site B is online using the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` |
 | 
			
		||||
    |                                                                     |  command.                                                                                                                                                          |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Swact occurs in site A while a peer group association is syncing.   |  Expected behavior should be similar to that of site A abrupt shutdown during sync.                                                                                |
 | 
			
		||||
    |                                                                     |  Re-sync needs to be done.                                                                                                                                         |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Swact occurs in site B while a peer group association is syncing.   |  Expected behavior should be similar to that of site B abrupt shutdown during sync.                                                                                |
 | 
			
		||||
    |                                                                     |  Re-sync needs to be done.                                                                                                                                         |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | In the event of either site going down or swact occurring:          |  a) Use the :command:`dcmanager peer-group-association show <association-id>` command to view the sync status in available site..                                  |
 | 
			
		||||
    |                                                                     |     If the status is ``in-sync``, all the subclouds are added, otherwise synchronization has not finished and it needs to be re-initiated in the primary site when |
 | 
			
		||||
    |                                                                     |     both sites are online.                                                                                                                                         |
 | 
			
		||||
    | a) How to track secondary subclouds added to site B                 |  b) Run the :command:`dcmanager subcloud-peer-group list-subclouds <peer-group>` command on site B to check                                                        |
 | 
			
		||||
    |    and subclouds yet to be added to site B as secondary subcloud?   |     total number of secondary subclouds and the subcloud details.                                                                                                  |
 | 
			
		||||
    | b) How to track newly added subclouds to peer group and yet to be   |                                                                                                                                                                    |
 | 
			
		||||
    |    added new subclouds to peer group?                               |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
 | 
			
		||||
---------
 | 
			
		||||
Migration
 | 
			
		||||
---------
 | 
			
		||||
 | 
			
		||||
Assumption: Subclouds will be migrated to site B if site A goes down.
 | 
			
		||||
 | 
			
		||||
The following are the error scenarios that can occur during peer group migration.
 | 
			
		||||
 | 
			
		||||
.. table::
 | 
			
		||||
    :widths: auto
 | 
			
		||||
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Error scenarios                                                     |  Recovery mechanism                                                                                                                                                |
 | 
			
		||||
    +=====================================================================+====================================================================================================================================================================+
 | 
			
		||||
    | What will be the status of the |SPG| if some subclouds failed       |  After the migration, you can use :command:`dcmanager subcloud-peer-group list-subclouds` to check the subclouds status under this |SPG| and you can check the     |
 | 
			
		||||
    | to migrate?                                                         |  |SPG| status using :command:`dcmanager subcloud-peer-group status`.                                                                                               |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  Re-run the :command:`dcmanager subcloud-peer-group migrate PEER_GROUP` command after fixing the failure.                                                          |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | How to recover when the subcloud rehome fails because of            |  When site A goes down, migrate |SPG| to site B. The subcloud will go to the ``rehome-failed`` deploy status when it has the wrong bootstrap address or bootstrap  |
 | 
			
		||||
    | incorrect bootstrap address or bootstrap values and site A cannot   |  values. You can update the bootstrap address and bootstrap values if the subcloud migration fails and the primary site is down using the                          |
 | 
			
		||||
    | recover in a time period?                                           |  :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands. You do not need to remove          |
 | 
			
		||||
    |                                                                     |  the rehome failed subcloud from the |SPG|.                                                                                                                        |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | How to fix when the subcloud has incorrect bootstrap address        |  Check the |SPG| migration status using the command :command:`dcmanager subcloud-peer-group status` command to confirm if it has a subcloud in ``rehoming`` status.|
 | 
			
		||||
    | or bootstrap values in the following situations of the |SPG|        |  If there is no subcloud in ``rehoming`` status, it means the |SPG| migration was completed and you need to migrate the |SPG| back to site A. You can update the   |
 | 
			
		||||
    | migration of site B?                                                |  subcloud after the migration failure and try again. If you want to recover the subcloud, follow the instructions below:                                           |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    | - Site A is recovered during migration.                             |  - When site A is recovered during migration, you can update the subcloud on site A. After the update, you need to wait for the |SPG| migration process to finish. |
 | 
			
		||||
    |                                                                     |    You can then migrate |SPG| back to site A to recover the subcloud.                                                                                              |
 | 
			
		||||
    | - Site A is recovered post migration.                               |  - When site A is recovered post migration, you can migrate the |SPG| back to site A. If the subcloud rehome fails again in site A, you can update the subcloud.   |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    | - Site A is online before the migration process.                    |  - When site A is online before the migration process, you can update the subcloud on site A and sync the updated subcloud to site B.                              |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |                                                                     |  Use the :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands to update the subcloud.     |
 | 
			
		||||
    |                                                                     |  You do not need to remove the rehome failed subcloud from the |SPG|.                                                                                              |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Site B goes down during |SPG| migration.                            |  Re-execute the |SPG| migration if there is any subcloud with ``rehome-failed`` deploy status after site B is online.                                              |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
 | 
			
		||||
--------------
 | 
			
		||||
Post migration
 | 
			
		||||
--------------
 | 
			
		||||
 | 
			
		||||
Audit operations will be triggered when the network is restored or
 | 
			
		||||
``migration_status`` of the peer group retrieved is changed to ``complete``.
 | 
			
		||||
 | 
			
		||||
.. table::
 | 
			
		||||
    :widths: auto
 | 
			
		||||
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Error scenarios                                                     |  Recovery mechanism                                                                                                                                                |
 | 
			
		||||
    +=====================================================================+====================================================================================================================================================================+
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    |  Site B goes down after the |SPG| has been migrated to its site.    | Upon site A recovery, the administrator can trigger the migration of the |SPG| back to site A.                                                                     |
 | 
			
		||||
    |                                                                     |                                                                                                                                                                    |
 | 
			
		||||
    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
		Reference in New Issue
	
	Block a user