docs/doc/source/storage/kubernetes/replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372.rst
Keane Lim cbc821d750 New disk replacement procedures
Change-Id: I703a61da792e59fdd19dfcaef5376b7f2f2ca975
Signed-off-by: Keane Lim <keane.lim@windriver.com>
2022-03-30 17:50:44 -04:00

237 lines
7.2 KiB
ReStructuredText

.. _replace-osds-on-an-aio-sx-single-disk-system-with-backup-770c9324f372:
========================================================
Replace OSDs on an AIO-SX Single Disk System with Backup
========================================================
When replacing |OSDs| on an AIO-SX system with replication factor 1, it is possible to make a backup.
.. rubric:: |prereq|
Verify if there is an available disk to create a new |OSD| in order to backup
data from an existing |OSD|. Make sure the disk is at least the same size as
the disk to be replaced.
.. code-block:: none
~(keystone_admin)$ system host-disk-list controller-0
.. rubric:: |proc|
#. Add the new OSD with the previously displayed disk UUID of the available
disk identified in the prerequisites.
.. code-block:: none
~(keystone_admin)$ system host-stor-add controller-0 <disk uuid>
#. Wait for the new OSD to get configured. Run :command:`ceph -s` to verify
that the output shows two |OSDs| and that the cluster has finished recovery.
Make sure the Ceph cluster is healthy (``HEALTH_OK``) before proceeding.
#. Change replication factor of the pools to 2.
.. code-block:: none
~(keystone_admin)$ ceph osd lspools # will list all ceph pools
~(keystone_admin)$ ceph osd pool set <pool-name> size 2
~(keystone_admin)$ ceph osd pool set <pool-name> nosizechange true
This will make the cluster enter a recovery state:
.. code-block:: none
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
cluster:
id: 38563514-4726-4664-9155-5efd5701de86
health: HEALTH_WARN
Degraded data redundancy: 3/57 objects degraded (5.263%), 3 pgs degraded
services:
mon: 1 daemons, quorum controller-0 (age 28m)
mgr: controller-0(active, since 27m)
mds: kube-cephfs:1 {0=controller-0=up:active}
osd: 2 osds: 2 up (since 6m), 2 in (since 6m)
data:
pools: 3 pools, 192 pgs
objects: 32 objects, 1000 MiB
usage: 1.2 GiB used, 16 GiB / 18 GiB avail
pgs: 2.604% pgs not active
3/57 objects degraded (5.263%)
184 active+clean
5 activating
2 active+recovery_wait+degraded
1 active+recovering+degraded
io:
recovery: 323 B/s, 1 keys/s, 3 objects/s
#. Wait for recovery to end and the Ceph cluster to become healthy.
.. code-block:: none
~(keystone_admin)$ ceph -s
cluster:
id: 38563514-4726-4664-9155-5efd5701de86
health: HEALTH_OK
services:
mon: 1 daemons, quorum controller-0 (age 28m)
mgr: controller-0(active, since 28m)
mds: kube-cephfs:1 {0=controller-0=up:active}
osd: 2 osds: 2 up (since 7m), 2 in (since 7m)
data:
pools: 3 pools, 192 pgs
objects: 32 objects, 1000 MiB
usage: 2.2 GiB used, 15 GiB / 18 GiB avail
pgs: 192 active+clean
#. Lock the system.
.. code-block:: none
~(keystone_admin)$ system host-lock controller-0
#. Mark the |OSD| out.
.. code-block:: none
~(keystone_admin)$ ceph osd out osd.<id>
#. Wait for the rebalance to finish.
.. code-block:: none
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
cluster:
id: 38563514-4726-4664-9155-5efd5701de86
health: HEALTH_OK
services:
mon: 1 daemons, quorum controller-0 (age 37m)
mgr: controller-0(active, since 36m)
mds: kube-cephfs:1 {0=controller-0=up:active}
osd: 2 osds: 2 up (since 15m), 1 in (since 2s)
data:
pools: 3 pools, 192 pgs
objects: 32 objects, 1000 MiB
usage: 808 MiB used, 8.0 GiB / 8.8 GiB avail
pgs: 192 active+clean
progress:
Rebalancing after osd.0 marked out
[..............................]
#. Stop the |OSD| and purge it from the Ceph cluster.
.. code-block:: none
~(keystone_admin)$ sudo mv /etc/pmon.d/ceph.conf ~/
~(keystone_admin)$ sudo /etc/init.d/ceph stop osd.<id>
#. Obtain the stor UUID and delete it from the platform.
.. code-block:: none
~(keystone_admin)$ system host-stor-list controller-0 # list all stors
~(keystone_admin)$ system host-stor-delete <stor uuid> # delete stor
#. Purge the disk from the Ceph cluster.
.. code-block:: none
~(keystone_admin)$ ceph osd purge osd.<id> --yes-i-really-mean-it
#. Remove the |OSD| entry in /etc/ceph/ceph.conf.
#. Unmount and remove any remaining folders.
.. code-block:: none
~(keystone_admin)$ sudo umount /var/lib/ceph/osd/ceph-<id>
~(keystone_admin)$ sudo rm -rf /var/lib/ceph/osd/ceph-<id>/
#. Set the pool to allow size changes.
.. code-block:: none
~(keystone_admin)$ ceph osd pool set <pool-name> nosizechange false
#. Unlock machine.
.. code-block:: none
~(keystone_admin)$ system host-unlock controller-0
#. Verify that the Ceph cluster is healthy.
.. code-block:: none
~(keystone_admin)$ ceph -s
If you see a ``HEALTH_ERR`` message like the following:
.. code-block:: none
controller-0:~$ ceph -s
cluster:
id: 38563514-4726-4664-9155-5efd5701de86
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem has a failed mds daemon
1 filesystem is offline
no active mgr
services:
mon: 1 daemons, quorum controller-0 (age 38s)
mgr: no daemons active (since 3s)
mds: kube-cephfs:0/1, 1 failed
osd: 1 osds: 1 up (since 14m), 1 in (since 15m)
data:
pools: 3 pools, 192 pgs
objects: 32 objects, 1000 MiB
usage: 1.1 GiB used, 7.7 GiB / 8.8 GiB avail
pgs: 192 active+clean
Wait a few minutes until the Ceph cluster shows ``HEALTH_OK``.
.. code-block:: none
controller-0:~$ ceph -s
cluster:
id: 38563514-4726-4664-9155-5efd5701de86
health: HEALTH_OK
services:
mon: 1 daemons, quorum controller-0 (age 2m)
mgr: controller-0(active, since 96s)
mds: kube-cephfs:1 {0=controller-0=up:active}
osd: 1 osds: 1 up (since 46s), 1 in (since 17m)
task status:
data:
pools: 3 pools, 192 pgs
objects: 32 objects, 1000 MiB
usage: 1.1 GiB used, 7.7 GiB / 8.8 GiB avail
pgs: 192 active+clean
#. The |OSD| tree should display the new |OSD| and not the previous one.
.. code-block:: none
controller-0:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.00850 root storage-tier
-2 0.00850 chassis group-0
-3 0.00850 host controller-0
1 hdd 0.00850 osd.1 up 1.00000 1.00000