Merge "Types of Errors reported on Subclouds introduced in Distributed Cloud System Controller GEO Redundancy - Phase1 (r9, dsr8MR3)"
This commit is contained in:
commit
0194155b00
@ -205,3 +205,4 @@ Appendix
|
||||
|
||||
distributed-cloud-ports-reference
|
||||
certificate-management-for-admin-rest-api-endpoints
|
||||
subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae
|
||||
|
@ -0,0 +1,137 @@
|
||||
.. _subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae:
|
||||
|
||||
==============================================================
|
||||
Subcloud GEO Redundancy Error Root Cause and Correction Action
|
||||
==============================================================
|
||||
|
||||
This section describes different error scenarios that can occur while using the
|
||||
GEO Redundancy feature. The error scenarios described here are based on the
|
||||
assumption that you are dealing with two distributed clouds, site A and site B.
|
||||
In this context, the GEO Redundancy feature is activated designating site A as
|
||||
the primary site and site B as the non-primary site. The GEO Redundancy feature
|
||||
allows migration of subclouds to the non-primary site when the primary site
|
||||
becomes unavailable, and also allows migrating them back to the primary site when it
|
||||
becomes available again.
|
||||
|
||||
The error scenarios are divided into the following categories:
|
||||
|
||||
.. contents::
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
----------------------
|
||||
Protection group setup
|
||||
----------------------
|
||||
|
||||
This scenario covers the errors detected during setup of the protection group and issues.
|
||||
|
||||
.. table::
|
||||
:widths: auto
|
||||
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Error scenarios | Recovery mechanism |
|
||||
+=====================================================================+====================================================================================================================================================================+
|
||||
| Site A goes down temporarily in the middle of association. | Upon site A recovery, the peer group association will automatically change its sync status to ``failed``. |
|
||||
| | |
|
||||
| | The administrator can trigger re-sync from the ``primary`` site if ``sync_status`` is either ``failed`` or ``out-of-sync``. |
|
||||
| | |
|
||||
| | Possible values of ``sync_status`` include ``syncing``, ``in_sync``, ``out-of-sync``, ``failed``, and ``unknown``. |
|
||||
| | |
|
||||
| | Possible values of ``association_type`` include ``primary``, ``non-primary``. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Site A is down in the middle of synchronization and remains offline | The administrator can check the peer group association sync status in the non-primary site to decide the next step. If the sync status is ``in-sync``, |
|
||||
| for an extended period of time. | migration can be initiated. |
|
||||
| | |
|
||||
| How does the user check the syncing status from site B to initiate | |
|
||||
| the migration? | |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| After initial sync is completed, site B goes down. | |
|
||||
| How does site A sync to site B after site B comes back online? | Site A needs to keep track of subcloud group updates when site B is down. The sync status will go into unknown status in site A. |
|
||||
| | |
|
||||
| | The peer group association sync status in site A will change to ``unknown`` as soon as site B becomes unavailable. Upon the recovery of site B, the sync status |
|
||||
| | will become ``in-sync`` on both sites again. |
|
||||
| | |
|
||||
| | If changes are made to the peer group while site B is offline, the sync status in site A will change to ``failed``. Upon the recovery of site B, |
|
||||
| | the sync status in site A will change to ``out-of-sync``. The administrator will need to re-initiate the sync in site A using |
|
||||
| | the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` command. |
|
||||
| | |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Site B is offline while creating peer group association to associate| Creation of association will be accepted but ``sync_status`` will be ``failed``. Protection group cannot be created. |
|
||||
| peer and a |SPG|. | |
|
||||
| | The administrator can re-sync the association after site B is online using the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` |
|
||||
| | command. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Swact occurs in site A while a peer group association is syncing. | Expected behavior should be similar to that of site A abrupt shutdown during sync. |
|
||||
| | Re-sync needs to be done. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Swact occurs in site B while a peer group association is syncing. | Expected behavior should be similar to that of site B abrupt shutdown during sync. |
|
||||
| | Re-sync needs to be done. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| In the event of either site going down or swact occurring: | a) Use the :command:`dcmanager peer-group-association show <association-id>` command to view the sync status in available site.. |
|
||||
| | If the status is ``in-sync``, all the subclouds are added, otherwise synchronization has not finished and it needs to be re-initiated in the primary site when |
|
||||
| | both sites are online. |
|
||||
| a) How to track secondary subclouds added to site B | b) Run the :command:`dcmanager subcloud-peer-group list-subclouds <peer-group>` command on site B to check |
|
||||
| and subclouds yet to be added to site B as secondary subcloud? | total number of secondary subclouds and the subcloud details. |
|
||||
| b) How to track newly added subclouds to peer group and yet to be | |
|
||||
| added new subclouds to peer group? | |
|
||||
| | |
|
||||
| | |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
---------
|
||||
Migration
|
||||
---------
|
||||
|
||||
Assumption: Subclouds will be migrated to site B if site A goes down.
|
||||
|
||||
The following are the error scenarios that can occur during peer group migration.
|
||||
|
||||
.. table::
|
||||
:widths: auto
|
||||
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Error scenarios | Recovery mechanism |
|
||||
+=====================================================================+====================================================================================================================================================================+
|
||||
| What will be the status of the |SPG| if some subclouds failed | After the migration, you can use :command:`dcmanager subcloud-peer-group list-subclouds` to check the subclouds status under this |SPG| and you can check the |
|
||||
| to migrate? | |SPG| status using :command:`dcmanager subcloud-peer-group status`. |
|
||||
| | |
|
||||
| | Re-run the :command:`dcmanager subcloud-peer-group migrate PEER_GROUP` command after fixing the failure. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| How to recover when the subcloud rehome fails because of | When site A goes down, migrate |SPG| to site B. The subcloud will go to the ``rehome-failed`` deploy status when it has the wrong bootstrap address or bootstrap |
|
||||
| incorrect bootstrap address or bootstrap values and site A cannot | values. You can update the bootstrap address and bootstrap values if the subcloud migration fails and the primary site is down using the |
|
||||
| recover in a time period? | :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands. You do not need to remove |
|
||||
| | the rehome failed subcloud from the |SPG|. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| How to fix when the subcloud has incorrect bootstrap address | Check the |SPG| migration status using the command :command:`dcmanager subcloud-peer-group status` command to confirm if it has a subcloud in ``rehoming`` status.|
|
||||
| or bootstrap values in the following situations of the |SPG| | If there is no subcloud in ``rehoming`` status, it means the |SPG| migration was completed and you need to migrate the |SPG| back to site A. You can update the |
|
||||
| migration of site B? | subcloud after the migration failure and try again. If you want to recover the subcloud, follow the instructions below: |
|
||||
| | |
|
||||
| - Site A is recovered during migration. | - When site A is recovered during migration, you can update the subcloud on site A. After the update, you need to wait for the |SPG| migration process to finish. |
|
||||
| | You can then migrate |SPG| back to site A to recover the subcloud. |
|
||||
| - Site A is recovered post migration. | - When site A is recovered post migration, you can migrate the |SPG| back to site A. If the subcloud rehome fails again in site A, you can update the subcloud. |
|
||||
| | |
|
||||
| - Site A is online before the migration process. | - When site A is online before the migration process, you can update the subcloud on site A and sync the updated subcloud to site B. |
|
||||
| | |
|
||||
| | Use the :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands to update the subcloud. |
|
||||
| | You do not need to remove the rehome failed subcloud from the |SPG|. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Site B goes down during |SPG| migration. | Re-execute the |SPG| migration if there is any subcloud with ``rehome-failed`` deploy status after site B is online. |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
--------------
|
||||
Post migration
|
||||
--------------
|
||||
|
||||
Audit operations will be triggered when the network is restored or
|
||||
``migration_status`` of the peer group retrieved is changed to ``complete``.
|
||||
|
||||
.. table::
|
||||
:widths: auto
|
||||
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Error scenarios | Recovery mechanism |
|
||||
+=====================================================================+====================================================================================================================================================================+
|
||||
| | |
|
||||
| Site B goes down after the |SPG| has been migrated to its site. | Upon site A recovery, the administrator can trigger the migration of the |SPG| back to site A. |
|
||||
| | |
|
||||
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
Loading…
Reference in New Issue
Block a user