monitor-failture.rst updated
Change-Id: I979ef44b11ddc56cb41b29c9e60bb1c0bc9191d6
This commit is contained in:
parent
803faedac0
commit
ef3e143bd1
@ -96,3 +96,140 @@ Ceph status shows that the monitor running on ``voyager3`` is now in quorum.
|
||||
objects: 208 objects, 3359 bytes
|
||||
usage: 2635 MB used, 44675 GB / 44678 GB avail
|
||||
pgs: 182 active+clean
|
||||
|
||||
Case: A host machine where ceph-mon is running is down
|
||||
========================================================
|
||||
|
||||
This is for the case when a host machine (where ceph-mon is running) is down.
|
||||
|
||||
Symptom:
|
||||
--------
|
||||
|
||||
After the host is down (node voyager3), the node status changes to ``NotReady``.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ kubectl get nodes
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
voyager1 Ready master 14d v1.10.5
|
||||
voyager2 Ready <none> 14d v1.10.5
|
||||
voyager3 NotReady <none> 14d v1.10.5
|
||||
voyager4 Ready <none> 14d v1.10.5
|
||||
|
||||
Ceph status shows that ceph-mon running on ``voyager3`` becomes out of quorum.
|
||||
Also, 6 osds running on ``voyager3`` are down (i.e., 18 out of 24 osds are up).
|
||||
Some placement groups become degraded and undersized.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
(mon-pod):/# ceph -s
|
||||
cluster:
|
||||
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
|
||||
health: HEALTH_WARN
|
||||
6 osds down
|
||||
1 host (6 osds) down
|
||||
Degraded data redundancy: 227/720 objects degraded (31.528%), 8 pgs
|
||||
degraded
|
||||
too few PGs per OSD (17 < min 30)
|
||||
mon voyager1 is low on available space
|
||||
1/3 mons down, quorum voyager1,voyager2
|
||||
|
||||
services:
|
||||
mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
|
||||
mgr: voyager1(active), standbys: voyager3
|
||||
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:stan
|
||||
dby
|
||||
osd: 24 osds: 18 up, 24 in
|
||||
rgw: 2 daemons active
|
||||
|
||||
data:
|
||||
pools: 18 pools, 182 pgs
|
||||
objects: 240 objects, 3359 bytes
|
||||
usage: 2695 MB used, 44675 GB / 44678 GB avail
|
||||
pgs: 227/720 objects degraded (31.528%)
|
||||
126 active+undersized
|
||||
48 active+clean
|
||||
8 active+undersized+degraded
|
||||
|
||||
The pod status of ceph-mon and ceph-osd shows as ``NodeLost``.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ kubectl get pods -n ceph -o wide|grep voyager3
|
||||
ceph-mgr-55f68d44b8-hncrq 1/1 Unknown 6 8d 135.207.240.43 voyager3
|
||||
ceph-mon-6bbs6 1/1 NodeLost 8 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-64779b8c-lbkcd 1/1 NodeLost 1 6d 135.207.240.43 voyager3
|
||||
ceph-osd-default-6ea9de2c-gp7zm 1/1 NodeLost 2 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-7544b6da-7mfdc 1/1 NodeLost 2 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-7cfc44c1-hhk8v 1/1 NodeLost 2 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-83945928-b95qs 1/1 NodeLost 2 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-f9249fa9-n7p4v 1/1 NodeLost 3 8d 135.207.240.43 voyager3
|
||||
|
||||
After 10+ miniutes, Ceph starts rebalancing with one node lost (i.e., 6 osds down)
|
||||
and the status stablizes with 18 osds.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
(mon-pod):/# ceph -s
|
||||
cluster:
|
||||
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
|
||||
health: HEALTH_WARN
|
||||
mon voyager1 is low on available space
|
||||
1/3 mons down, quorum voyager1,voyager2
|
||||
|
||||
services:
|
||||
mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
|
||||
mgr: voyager1(active), standbys: voyager2
|
||||
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
|
||||
osd: 24 osds: 18 up, 18 in
|
||||
rgw: 2 daemons active
|
||||
|
||||
data:
|
||||
pools: 18 pools, 182 pgs
|
||||
objects: 240 objects, 3359 bytes
|
||||
usage: 2025 MB used, 33506 GB / 33508 GB avail
|
||||
pgs: 182 active+clean
|
||||
|
||||
|
||||
Recovery:
|
||||
---------
|
||||
|
||||
The node status of ``voyager3`` changes to ``Ready`` after the node is up again.
|
||||
Also, Ceph pods are restarted automatically.
|
||||
The Ceph status shows that the monitor running on ``voyager3`` is now in quorum
|
||||
and 6 osds gets back up (i.e., a total of 24 osds are up).
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
(mon-pod):/# ceph -s
|
||||
cluster:
|
||||
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
|
||||
health: HEALTH_WARN
|
||||
too few PGs per OSD (22 < min 30)
|
||||
mon voyager1 is low on available space
|
||||
|
||||
services:
|
||||
mon: 3 daemons, quorum voyager1,voyager2,voyager3
|
||||
mgr: voyager1(active), standbys: voyager2
|
||||
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
|
||||
osd: 24 osds: 24 up, 24 in
|
||||
rgw: 2 daemons active
|
||||
|
||||
data:
|
||||
pools: 18 pools, 182 pgs
|
||||
objects: 240 objects, 3359 bytes
|
||||
usage: 2699 MB used, 44675 GB / 44678 GB avail
|
||||
pgs: 182 active+clean
|
||||
|
||||
Also, the pod status of ceph-mon and ceph-osd changes from ``NodeLost`` back to ``Running``.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ kubectl get pods -n ceph -o wide|grep voyager3
|
||||
ceph-mon-6bbs6 1/1 Running 9 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-64779b8c-lbkcd 1/1 Running 2 7d 135.207.240.43 voyager3
|
||||
ceph-osd-default-6ea9de2c-gp7zm 1/1 Running 3 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-7544b6da-7mfdc 1/1 Running 3 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-7cfc44c1-hhk8v 1/1 Running 3 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-83945928-b95qs 1/1 Running 3 8d 135.207.240.43 voyager3
|
||||
ceph-osd-default-f9249fa9-n7p4v 1/1 Running 4 8d 135.207.240.43 voyager3
|
||||
|
@ -123,3 +123,86 @@ symptoms are similar to when 1 or 2 Monitor processes are killed:
|
||||
The status of the pods (where the three Monitor processes are killed)
|
||||
changed as follows: ``Running`` -> ``Error`` -> ``CrashLoopBackOff``
|
||||
-> ``Running`` and this recovery process takes about 1 minute.
|
||||
|
||||
Case: Monitor database is destroyed
|
||||
===================================
|
||||
|
||||
We intentionlly destroy a Monitor database by removing
|
||||
``/var/lib/openstack-helm/ceph/mon/mon/ceph-voyager3/store.db``.
|
||||
|
||||
Symptom:
|
||||
--------
|
||||
|
||||
A Ceph Monitor running on voyager3 (whose Monitor database is destroyed) becomes out of quorum,
|
||||
and the mon-pod's status stays in ``Running`` -> ``Error`` -> ``CrashLoopBackOff`` while keeps restarting.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
(mon-pod):/# ceph -s
|
||||
cluster:
|
||||
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
|
||||
health: HEALTH_WARN
|
||||
too few PGs per OSD (22 < min 30)
|
||||
mon voyager1 is low on available space
|
||||
1/3 mons down, quorum voyager1,voyager2
|
||||
|
||||
services:
|
||||
mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
|
||||
mgr: voyager1(active), standbys: voyager3
|
||||
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
|
||||
osd: 24 osds: 24 up, 24 in
|
||||
rgw: 2 daemons active
|
||||
|
||||
data:
|
||||
pools: 18 pools, 182 pgs
|
||||
objects: 240 objects, 3359 bytes
|
||||
usage: 2675 MB used, 44675 GB / 44678 GB avail
|
||||
pgs: 182 active+clean
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ kubectl get pods -n ceph -o wide|grep ceph-mon
|
||||
ceph-mon-4gzzw 1/1 Running 0 6d 135.207.240.42 voyager2
|
||||
ceph-mon-6bbs6 0/1 CrashLoopBackOff 5 6d 135.207.240.43 voyager3
|
||||
ceph-mon-qgc7p 1/1 Running 0 6d 135.207.240.41 voyager1
|
||||
|
||||
The logs of the failed mon-pod shows the ceph-mon process cannot run as ``/var/lib/ceph/mon/ceph-voyager3/store.db`` does not exist.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ kubectl logs ceph-mon-6bbs6 -n ceph
|
||||
+ ceph-mon --setuser ceph --setgroup ceph --cluster ceph -i voyager3 --inject-monmap /etc/ceph/monmap-ceph --keyring /etc/ceph/ceph.mon.keyring --mon-data /var/lib/ceph/mon/ceph-voyager3
|
||||
2018-07-10 18:30:04.546200 7f4ca9ed4f00 -1 rocksdb: Invalid argument: /var/lib/ceph/mon/ceph-voyager3/store.db: does not exist (create_if_missing is false)
|
||||
2018-07-10 18:30:04.546214 7f4ca9ed4f00 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-voyager3': (22) Invalid argument
|
||||
|
||||
Recovery:
|
||||
---------
|
||||
|
||||
Remove the entire ceph-mon directory on voyager3, and then Ceph will automatically
|
||||
recreate the database by using the other ceph-mons' database.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sudo rm -rf /var/lib/openstack-helm/ceph/mon/mon/ceph-voyager3
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
(mon-pod):/# ceph -s
|
||||
cluster:
|
||||
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
|
||||
health: HEALTH_WARN
|
||||
too few PGs per OSD (22 < min 30)
|
||||
mon voyager1 is low on available space
|
||||
|
||||
services:
|
||||
mon: 3 daemons, quorum voyager1,voyager2,voyager3
|
||||
mgr: voyager1(active), standbys: voyager3
|
||||
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
|
||||
osd: 24 osds: 24 up, 24 in
|
||||
rgw: 2 daemons active
|
||||
|
||||
data:
|
||||
pools: 18 pools, 182 pgs
|
||||
objects: 240 objects, 3359 bytes
|
||||
usage: 2675 MB used, 44675 GB / 44678 GB avail
|
||||
pgs: 182 active+clean
|
||||
|
Loading…
Reference in New Issue
Block a user