From cbc821d7509879e5c5721467b6b13fb2c0dd6e03 Mon Sep 17 00:00:00 2001 From: Keane Lim Date: Tue, 22 Mar 2022 11:53:12 -0400 Subject: [PATCH] New disk replacement procedures Change-Id: I703a61da792e59fdd19dfcaef5376b7f2f2ca975 Signed-off-by: Keane Lim --- .../index-storage-kub-e797132c87a8.rst | 4 + .../replace-osds-and-journal-disks.rst | 20 +- ...osds-on-a-standard-system-f3b1e376304c.rst | 176 +++++++++++++ ...-aio-sx-multi-disk-system-b4ddd1c1257c.rst | 203 +++++++++++++++ ...e-disk-system-with-backup-770c9324f372.rst | 236 ++++++++++++++++++ ...isk-system-without-backup-951eefebd1f2.rst | 71 ++++++ 6 files changed, 707 insertions(+), 3 deletions(-) create mode 100644 doc/source/storage/kubernetes/replace-osds-on-a-standard-system-f3b1e376304c.rst create mode 100644 doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-multi-disk-system-b4ddd1c1257c.rst create mode 100644 doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372.rst create mode 100644 doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-without-backup-951eefebd1f2.rst diff --git a/doc/source/storage/kubernetes/index-storage-kub-e797132c87a8.rst b/doc/source/storage/kubernetes/index-storage-kub-e797132c87a8.rst index b5a6dfffb..1ddf8ab63 100644 --- a/doc/source/storage/kubernetes/index-storage-kub-e797132c87a8.rst +++ b/doc/source/storage/kubernetes/index-storage-kub-e797132c87a8.rst @@ -123,7 +123,11 @@ Configure Ceph OSDs on a Host provision-storage-on-a-controller-or-storage-host-using-horizon provision-storage-on-a-storage-host-using-the-cli replace-osds-and-journal-disks + replace-osds-on-a-standard-system-f3b1e376304c replace-osds-on-an-aio-dx-system-319b0bc2f7e6 + replace-osds-on-an-aio-sx-multi-disk-system-b4ddd1c1257c + replace-osds-on-an-aio-sx-single-disk-system-without-backup-951eefebd1f2 + replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372 ------------------------- Persistent Volume Support diff --git a/doc/source/storage/kubernetes/replace-osds-and-journal-disks.rst b/doc/source/storage/kubernetes/replace-osds-and-journal-disks.rst index c6f6d7802..7e5ee57eb 100644 --- a/doc/source/storage/kubernetes/replace-osds-and-journal-disks.rst +++ b/doc/source/storage/kubernetes/replace-osds-and-journal-disks.rst @@ -17,9 +17,23 @@ the same peer group. Do not substitute a smaller disk than the original. Due to a limitation in **udev**, the device path of a disk connected through a SAS controller changes when the disk is replaced. Therefore, in the general procedure below, you must lock, delete, and re-install the node. - However, for an |AIO-DX| system, use the following alternative procedure to - replace |OSDs| without reinstalling the host: - :ref:`Replace OSDs on an AIO-DX System `. + However, for standard, |AIO-SX|, and |AIO-DX| systems, use the following + alternative procedures to replace |OSDs| without reinstalling the host: + + - :ref:`Replace OSDs on a Standard System + ` + + - :ref:`Replace OSDs on an AIO-DX System + ` + + - :ref:`Replace OSDs on an AIO-SX Multi-Disk System + ` + + - :ref:`Replace ODSs on an AIO-SX Single Disk System without Backup + ` + + - :ref:`Replace OSDs on an AIO-SX Single Disk System with Backup + ` .. rubric:: |proc| diff --git a/doc/source/storage/kubernetes/replace-osds-on-a-standard-system-f3b1e376304c.rst b/doc/source/storage/kubernetes/replace-osds-on-a-standard-system-f3b1e376304c.rst new file mode 100644 index 000000000..e92700fd9 --- /dev/null +++ b/doc/source/storage/kubernetes/replace-osds-on-a-standard-system-f3b1e376304c.rst @@ -0,0 +1,176 @@ +.. _replace-osds-on-a-standard-system-f3b1e376304c: + +================================= +Replace OSDs on a Standard System +================================= + +You can replace |OSDs| in a standard system to increase capacity, or replace +faulty disks on the host without reinstalling the host. + +.. rubric:: |prereq| + +For standard systems with controller storage, ensure that the controller with the |OSD| to be replaced is the standby controller. + +For example, if the disk replacement has to be done on controller-1 and it is the active controller, use the following command to swact the controller to controller-0: + +.. code-block:: none + + ~(keystone_admin)$ system host-swact controller-1 + +After controller swact, you will have to connect via ssh again to the to connect to the newly active controller-0. + +.. rubric:: |proc| + +**Standard systems with controller storage** + +#. If controller-1 has the OSD to be replaced, lock it. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock controller-1 + +#. Run the :command:`ceph osd destroy osd. --yes-i-really-mean-it` command. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd destroy osd. --yes-i-really-mean-it + +#. Power down controller-1. + +#. Replace the storage disk. + +#. Power on controller-1. + +#. Unlock controller-1. + + .. code-block:: none + + ~(keystone_admin)]$ system host-unlock controller-1 + +#. Wait for the recovery process in the Ceph cluster to start and finish. + + .. code-block:: none + + ~(keystone_admin)]$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_WARN + Degraded data redundancy: 13/50 objects degraded (26.000%), 10 pgs degraded + + services: + mon: 1 daemons, quorum controller (age 68m) + mgr: controller-0(active, since 66m) + mds: kube-cephfs:1 {0=controller-0=up:active} 1 up:standby + osd: 2 osds: 2 up (since 9s), 2 in (since 9s) + + data: + pools: 3 pools, 192 pgs + objects: 25 objects, 300 MiB + usage: 655 MiB used, 15 GiB / 16 GiB avail + pgs: 13/50 objects degraded (26.000%) + 182 active+clean + 8 active+recovery_wait+degraded + 2 active+recovering+degraded + + io: + recovery: 24 B/s, 1 keys/s, 1 objects/s + +#. Ensure that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)]$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_OK + + services: + mon: 1 daemons, quorum controller (age 68m) + mgr: controller-0(active, since 66m), standbys: controller-1 + mds: kube-cephfs:1 {0=controller-0=up:active} 1 up:standby + osd: 2 osds: 2 up (since 36s), 2 in (since 36s) + + data: + pools: 3 pools, 192 pgs + objects: 25 objects, 300 MiB + usage: 815 MiB used, 15 GiB / 16 GiB avail + pgs: 192 active+clean + +**Standard systems with dedicated storage nodes** + +#. If storage-1 has the OSD to be replaced, lock it. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock storage-1 + +#. Run the :command:`ceph osd destroy osd. --yes-i-really-mean-it` command. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd destroy osd. --yes-i-really-mean-it + +#. Power down storage-1. + +#. Replace the storage disk. + +#. Power on storage-1. + +#. Unlock storage-1. + + .. code-block:: none + + ~(keystone_admin)]$ system host-unlock storage-1 + +#. Wait for the recovery process in the Ceph cluster to start and finish. + + .. code-block:: none + + ~(keystone_admin)]$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_WARN + Degraded data redundancy: 13/50 objects degraded (26.000%), 10 pgs degraded + + services: + mon: 1 daemons, quorum controller (age 68m) + mgr: controller-0(active, since 66m) + mds: kube-cephfs:1 {0=controller-0=up:active} 1 up:standby + osd: 2 osds: 2 up (since 9s), 2 in (since 9s) + + data: + pools: 3 pools, 192 pgs + objects: 25 objects, 300 MiB + usage: 655 MiB used, 15 GiB / 16 GiB avail + pgs: 13/50 objects degraded (26.000%) + 182 active+clean + 8 active+recovery_wait+degraded + 2 active+recovering+degraded + + io: + recovery: 24 B/s, 1 keys/s, 1 objects/s + +#. Ensure that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)]$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_OK + + services: + mon: 1 daemons, quorum controller (age 68m) + mgr: controller-0(active, since 66m), standbys: controller-1 + mds: kube-cephfs:1 {0=controller-0=up:active} 1 up:standby + osd: 2 osds: 2 up (since 36s), 2 in (since 36s) + + data: + pools: 3 pools, 192 pgs + objects: 25 objects, 300 MiB + usage: 815 MiB used, 15 GiB / 16 GiB avail + pgs: 192 active+clean diff --git a/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-multi-disk-system-b4ddd1c1257c.rst b/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-multi-disk-system-b4ddd1c1257c.rst new file mode 100644 index 000000000..97d339867 --- /dev/null +++ b/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-multi-disk-system-b4ddd1c1257c.rst @@ -0,0 +1,203 @@ +.. _replace-osds-on-an-aio-sx-multi-disk-system-b4ddd1c1257c: + +=========================================== +Replace OSDs on an AIO-SX Multi-Disk System +=========================================== + +You can replace |OSDs| in an |AIO-SX| system to increase capacity, or replace +faulty disks on the host without reinstalling the host. + +.. rubric:: |proc| + +**Replication factor > 1** + +#. Make sure there is more than one OSD installed, otherwise there could be + data loss. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd tree + +#. Verify that all Ceph pools are present. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd lspools + +#. For each pool, make sure its size attribute is larger than 1, otherwise + there could be data loss. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd pool get size + +#. Disable pool size change during the procedure. This must be run for all + pools. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd pool set nosizechange true + +#. Verify that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_OK + +#. Lock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock controller-0 + +#. Power down the controller. + +#. Replace the disk. + +#. Power on the controller. + +#. Unlock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-unlock controller-0 + +#. Wait for the recovery process in the Ceph cluster to start and finish. + +#. Ensure that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)]$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_OK + +#. Enable pool size changes. + + .. code-block:: none + + ~(keystone_admin)]$ ceph osd pool set nosizechange false + + +**Replication factor 1 with space to backup** + +#. Make sure there is more than one OSD installed, otherwise there could be + data loss. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd tree + +#. Verify all present ceph pools. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd lspools + +#. For each pool, make sure its size attribute is larger than 1, otherwise + there could be data loss. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd pool get size + +#. Disable pool size change during the procedure. This must be run for all + pools. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd pool set nosizechange true + +#. Verify that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_OK + +#. Lock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock controller-0 + +#. Power down the controller. + +#. Replace the disk. + +#. Power on the controller. + +#. Unlock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-unlock controller-0 + +#. Wait for the recovery process in the Ceph cluster to start and finish. + +#. Ensure that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)]$ ceph -s + + cluster: + id: 50ce952f-bd16-4864-9487-6c7e959be95e + health: HEALTH_OK + +#. Enable pool size changes. + + .. code-block:: none + + ~(keystone_admin)]$ ceph osd pool set nosizechange false + +#. Set the replication factor to 1 for all pools. + + .. code-block:: none + + ~(keystone_admin)]$ ceph osd pool set size 1 + + +**Replication factor 1 without space to backup** + +#. Lock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock controller-0 + +#. Backup file /etc/pmon.d/ceph.conf, then remove it. + +#. Mark |OSD| as out and down, stop it, and destroy it. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd out osd. + ~(keystone_admin)$ ceph osd down osd. + ~(keystone_admin)$ sudo /etc/init.d/ceph stop osd.1 + ~(keystone_admin)$ ceph osd destroy osd.1 + +#. Shutdown the machine, replace disk, turn it on, and wait for boot to finish. + +#. Unlock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-unlock controller-0 + +#. Copy the backup ceph.conf to /etc/pmon.d/. + +#. Verify that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)$ ceph -s \ No newline at end of file diff --git a/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372.rst b/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372.rst new file mode 100644 index 000000000..27d124848 --- /dev/null +++ b/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372.rst @@ -0,0 +1,236 @@ +.. _replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372: + +======================================================== +Replace OSDs on an AIO-SX Single Disk System with Backup +======================================================== + +When replacing |OSDs| on an AIO-SX system with replication factor 1, it is possible to make a backup. + +.. rubric:: |prereq| + +Verify if there is an available disk to create a new |OSD| in order to backup +data from an existing |OSD|. Make sure the disk is at least the same size as +the disk to be replaced. + +.. code-block:: none + + ~(keystone_admin)$ system host-disk-list controller-0 + +.. rubric:: |proc| + +#. Add the new OSD with the previously displayed disk UUID of the available + disk identified in the prerequisites. + + .. code-block:: none + + ~(keystone_admin)$ system host-stor-add controller-0 + +#. Wait for the new OSD to get configured. Run :command:`ceph -s` to verify + that the output shows two |OSDs| and that the cluster has finished recovery. + Make sure the Ceph cluster is healthy (``HEALTH_OK``) before proceeding. + +#. Change replication factor of the pools to 2. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd lspools # will list all ceph pools + ~(keystone_admin)$ ceph osd pool set size 2 + ~(keystone_admin)$ ceph osd pool set nosizechange true + + This will make the cluster enter a recovery state: + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph -s + cluster: + id: 38563514-4726-4664-9155-5efd5701de86 + health: HEALTH_WARN + Degraded data redundancy: 3/57 objects degraded (5.263%), 3 pgs degraded + + services: + mon: 1 daemons, quorum controller-0 (age 28m) + mgr: controller-0(active, since 27m) + mds: kube-cephfs:1 {0=controller-0=up:active} + osd: 2 osds: 2 up (since 6m), 2 in (since 6m) + + data: + pools: 3 pools, 192 pgs + objects: 32 objects, 1000 MiB + usage: 1.2 GiB used, 16 GiB / 18 GiB avail + pgs: 2.604% pgs not active + 3/57 objects degraded (5.263%) + 184 active+clean + 5 activating + 2 active+recovery_wait+degraded + 1 active+recovering+degraded + + io: + recovery: 323 B/s, 1 keys/s, 3 objects/s + + +#. Wait for recovery to end and the Ceph cluster to become healthy. + + .. code-block:: none + + ~(keystone_admin)$ ceph -s + + cluster: + id: 38563514-4726-4664-9155-5efd5701de86 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum controller-0 (age 28m) + mgr: controller-0(active, since 28m) + mds: kube-cephfs:1 {0=controller-0=up:active} + osd: 2 osds: 2 up (since 7m), 2 in (since 7m) + + data: + pools: 3 pools, 192 pgs + objects: 32 objects, 1000 MiB + usage: 2.2 GiB used, 15 GiB / 18 GiB avail + pgs: 192 active+clean + +#. Lock the system. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock controller-0 + +#. Mark the |OSD| out. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd out osd. + +#. Wait for the rebalance to finish. + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph -s + cluster: + id: 38563514-4726-4664-9155-5efd5701de86 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum controller-0 (age 37m) + mgr: controller-0(active, since 36m) + mds: kube-cephfs:1 {0=controller-0=up:active} + osd: 2 osds: 2 up (since 15m), 1 in (since 2s) + + data: + pools: 3 pools, 192 pgs + objects: 32 objects, 1000 MiB + usage: 808 MiB used, 8.0 GiB / 8.8 GiB avail + pgs: 192 active+clean + + progress: + Rebalancing after osd.0 marked out + [..............................] + +#. Stop the |OSD| and purge it from the Ceph cluster. + + .. code-block:: none + + ~(keystone_admin)$ sudo mv /etc/pmon.d/ceph.conf ~/ + ~(keystone_admin)$ sudo /etc/init.d/ceph stop osd. + +#. Obtain the stor UUID and delete it from the platform. + + .. code-block:: none + + ~(keystone_admin)$ system host-stor-list controller-0 # list all stors + ~(keystone_admin)$ system host-stor-delete # delete stor + +#. Purge the disk from the Ceph cluster. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd purge osd. --yes-i-really-mean-it + +#. Remove the |OSD| entry in /etc/ceph/ceph.conf. + +#. Unmount and remove any remaining folders. + + .. code-block:: none + + ~(keystone_admin)$ sudo umount /var/lib/ceph/osd/ceph- + ~(keystone_admin)$ sudo rm -rf /var/lib/ceph/osd/ceph-/ + +#. Set the pool to allow size changes. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd pool set nosizechange false + +#. Unlock machine. + + .. code-block:: none + + ~(keystone_admin)$ system host-unlock controller-0 + +#. Verify that the Ceph cluster is healthy. + + .. code-block:: none + + ~(keystone_admin)$ ceph -s + + If you see a ``HEALTH_ERR`` message like the following: + + .. code-block:: none + + controller-0:~$ ceph -s + cluster: + id: 38563514-4726-4664-9155-5efd5701de86 + health: HEALTH_ERR + 1 filesystem is degraded + 1 filesystem has a failed mds daemon + 1 filesystem is offline + no active mgr + + services: + mon: 1 daemons, quorum controller-0 (age 38s) + mgr: no daemons active (since 3s) + mds: kube-cephfs:0/1, 1 failed + osd: 1 osds: 1 up (since 14m), 1 in (since 15m) + + data: + pools: 3 pools, 192 pgs + objects: 32 objects, 1000 MiB + usage: 1.1 GiB used, 7.7 GiB / 8.8 GiB avail + pgs: 192 active+clean + + Wait a few minutes until the Ceph cluster shows ``HEALTH_OK``. + + .. code-block:: none + + controller-0:~$ ceph -s + cluster: + id: 38563514-4726-4664-9155-5efd5701de86 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum controller-0 (age 2m) + mgr: controller-0(active, since 96s) + mds: kube-cephfs:1 {0=controller-0=up:active} + osd: 1 osds: 1 up (since 46s), 1 in (since 17m) + + task status: + + data: + pools: 3 pools, 192 pgs + objects: 32 objects, 1000 MiB + usage: 1.1 GiB used, 7.7 GiB / 8.8 GiB avail + pgs: 192 active+clean + + +#. The |OSD| tree should display the new |OSD| and not the previous one. + + .. code-block:: none + + controller-0:~$ ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -1 0.00850 root storage-tier + -2 0.00850 chassis group-0 + -3 0.00850 host controller-0 + 1 hdd 0.00850 osd.1 up 1.00000 1.00000 + diff --git a/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-without-backup-951eefebd1f2.rst b/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-without-backup-951eefebd1f2.rst new file mode 100644 index 000000000..1c9bf0528 --- /dev/null +++ b/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-without-backup-951eefebd1f2.rst @@ -0,0 +1,71 @@ +.. _replace-osds-on-an-aio-sx-single-disk-system-without-backup-951eefebd1f2: + +=========================================================== +Replace OSDs on an AIO-SX Single Disk System without Backup +=========================================================== + +.. rubric:: |proc| + +#. Get a list of all pools and their settings (size, min_size, pg_num, pgp_num). + + .. code-block:: none + + ~(keystone_admin)$ ceph osd lspools # list all pools + ~(keystone_admin)$ ceph osd pool get $POOLNAME $SETTING + + Keep the pool names and settings as they will be used in step 12. + +#. Lock the controller. + + .. code-block:: none + + ~(keystone_admin)$ system host-lock controller-0 + +#. Remove all applications that use ceph pools. + + .. code-block:: none + + ~(keystone_admin)$ system application-list # list the applications + ~(keystone_admin)$ system application-remove $APPLICATION_NAME # remove + application + + Keep the names of the removed applications as they will be used in step 11. + +#. Make a backup of /etc/pmon.d/ceph.conf to a safe location and remove the + ceph.conf file from the /etc/pmon.d folder. + +#. Stop ``ceph-mds``. + + .. code-block:: none + + ~(keystone_admin)$ /etc/init.d/ceph stop mds + +#. Declare ``ceph fs`` as failed and delete it. + + .. code-block:: none + + ~(keystone_admin)$ ceph mds fail 0 + ~(keystone_admin)$ ceph fs rm --yes-i-really-mean-it + +#. Allow Ceph pools to be deleted. + + .. code-block:: none + + ~(keystone_admin)$ ceph tell mon.\* injectargs + '--mon-allow-pool-delete=true' + +#. Remove all the pools. + + .. code-block:: none + + ~(keystone_admin)$ ceph osd pool ls | xargs -i ceph osd pool delete {} + {} --yes-i-really-really-mean-it + +#. Shutdown machine, replace disk, turn it on and wait for boot to finish. + +#. Move the backed up ceph.conf from step 4 to /etc/pmon.d and unlock the + controller. + +#. Add the applications that were removed in step 3. + +#. Verify that all pools and settings listed in step 1 are recreated.