monitor-failture.rst updated

Change-Id: I979ef44b11ddc56cb41b29c9e60bb1c0bc9191d6
This commit is contained in:
Hee Won Lee 2018-07-18 16:36:18 -04:00
parent 803faedac0
commit ef3e143bd1
2 changed files with 220 additions and 0 deletions

View File

@ -96,3 +96,140 @@ Ceph status shows that the monitor running on ``voyager3`` is now in quorum.
objects: 208 objects, 3359 bytes objects: 208 objects, 3359 bytes
usage: 2635 MB used, 44675 GB / 44678 GB avail usage: 2635 MB used, 44675 GB / 44678 GB avail
pgs: 182 active+clean pgs: 182 active+clean
Case: A host machine where ceph-mon is running is down
========================================================
This is for the case when a host machine (where ceph-mon is running) is down.
Symptom:
--------
After the host is down (node voyager3), the node status changes to ``NotReady``.
.. code-block:: console
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
voyager1 Ready master 14d v1.10.5
voyager2 Ready <none> 14d v1.10.5
voyager3 NotReady <none> 14d v1.10.5
voyager4 Ready <none> 14d v1.10.5
Ceph status shows that ceph-mon running on ``voyager3`` becomes out of quorum.
Also, 6 osds running on ``voyager3`` are down (i.e., 18 out of 24 osds are up).
Some placement groups become degraded and undersized.
.. code-block:: console
(mon-pod):/# ceph -s
cluster:
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
health: HEALTH_WARN
6 osds down
1 host (6 osds) down
Degraded data redundancy: 227/720 objects degraded (31.528%), 8 pgs
degraded
too few PGs per OSD (17 < min 30)
mon voyager1 is low on available space
1/3 mons down, quorum voyager1,voyager2
services:
mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
mgr: voyager1(active), standbys: voyager3
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:stan
dby
osd: 24 osds: 18 up, 24 in
rgw: 2 daemons active
data:
pools: 18 pools, 182 pgs
objects: 240 objects, 3359 bytes
usage: 2695 MB used, 44675 GB / 44678 GB avail
pgs: 227/720 objects degraded (31.528%)
126 active+undersized
48 active+clean
8 active+undersized+degraded
The pod status of ceph-mon and ceph-osd shows as ``NodeLost``.
.. code-block:: console
$ kubectl get pods -n ceph -o wide|grep voyager3
ceph-mgr-55f68d44b8-hncrq 1/1 Unknown 6 8d 135.207.240.43 voyager3
ceph-mon-6bbs6 1/1 NodeLost 8 8d 135.207.240.43 voyager3
ceph-osd-default-64779b8c-lbkcd 1/1 NodeLost 1 6d 135.207.240.43 voyager3
ceph-osd-default-6ea9de2c-gp7zm 1/1 NodeLost 2 8d 135.207.240.43 voyager3
ceph-osd-default-7544b6da-7mfdc 1/1 NodeLost 2 8d 135.207.240.43 voyager3
ceph-osd-default-7cfc44c1-hhk8v 1/1 NodeLost 2 8d 135.207.240.43 voyager3
ceph-osd-default-83945928-b95qs 1/1 NodeLost 2 8d 135.207.240.43 voyager3
ceph-osd-default-f9249fa9-n7p4v 1/1 NodeLost 3 8d 135.207.240.43 voyager3
After 10+ miniutes, Ceph starts rebalancing with one node lost (i.e., 6 osds down)
and the status stablizes with 18 osds.
.. code-block:: console
(mon-pod):/# ceph -s
cluster:
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
health: HEALTH_WARN
mon voyager1 is low on available space
1/3 mons down, quorum voyager1,voyager2
services:
mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
mgr: voyager1(active), standbys: voyager2
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
osd: 24 osds: 18 up, 18 in
rgw: 2 daemons active
data:
pools: 18 pools, 182 pgs
objects: 240 objects, 3359 bytes
usage: 2025 MB used, 33506 GB / 33508 GB avail
pgs: 182 active+clean
Recovery:
---------
The node status of ``voyager3`` changes to ``Ready`` after the node is up again.
Also, Ceph pods are restarted automatically.
The Ceph status shows that the monitor running on ``voyager3`` is now in quorum
and 6 osds gets back up (i.e., a total of 24 osds are up).
.. code-block:: console
(mon-pod):/# ceph -s
cluster:
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
health: HEALTH_WARN
too few PGs per OSD (22 < min 30)
mon voyager1 is low on available space
services:
mon: 3 daemons, quorum voyager1,voyager2,voyager3
mgr: voyager1(active), standbys: voyager2
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
osd: 24 osds: 24 up, 24 in
rgw: 2 daemons active
data:
pools: 18 pools, 182 pgs
objects: 240 objects, 3359 bytes
usage: 2699 MB used, 44675 GB / 44678 GB avail
pgs: 182 active+clean
Also, the pod status of ceph-mon and ceph-osd changes from ``NodeLost`` back to ``Running``.
.. code-block:: console
$ kubectl get pods -n ceph -o wide|grep voyager3
ceph-mon-6bbs6 1/1 Running 9 8d 135.207.240.43 voyager3
ceph-osd-default-64779b8c-lbkcd 1/1 Running 2 7d 135.207.240.43 voyager3
ceph-osd-default-6ea9de2c-gp7zm 1/1 Running 3 8d 135.207.240.43 voyager3
ceph-osd-default-7544b6da-7mfdc 1/1 Running 3 8d 135.207.240.43 voyager3
ceph-osd-default-7cfc44c1-hhk8v 1/1 Running 3 8d 135.207.240.43 voyager3
ceph-osd-default-83945928-b95qs 1/1 Running 3 8d 135.207.240.43 voyager3
ceph-osd-default-f9249fa9-n7p4v 1/1 Running 4 8d 135.207.240.43 voyager3

View File

@ -123,3 +123,86 @@ symptoms are similar to when 1 or 2 Monitor processes are killed:
The status of the pods (where the three Monitor processes are killed) The status of the pods (where the three Monitor processes are killed)
changed as follows: ``Running`` -> ``Error`` -> ``CrashLoopBackOff`` changed as follows: ``Running`` -> ``Error`` -> ``CrashLoopBackOff``
-> ``Running`` and this recovery process takes about 1 minute. -> ``Running`` and this recovery process takes about 1 minute.
Case: Monitor database is destroyed
===================================
We intentionlly destroy a Monitor database by removing
``/var/lib/openstack-helm/ceph/mon/mon/ceph-voyager3/store.db``.
Symptom:
--------
A Ceph Monitor running on voyager3 (whose Monitor database is destroyed) becomes out of quorum,
and the mon-pod's status stays in ``Running`` -> ``Error`` -> ``CrashLoopBackOff`` while keeps restarting.
.. code-block:: console
(mon-pod):/# ceph -s
cluster:
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
health: HEALTH_WARN
too few PGs per OSD (22 < min 30)
mon voyager1 is low on available space
1/3 mons down, quorum voyager1,voyager2
services:
mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
mgr: voyager1(active), standbys: voyager3
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
osd: 24 osds: 24 up, 24 in
rgw: 2 daemons active
data:
pools: 18 pools, 182 pgs
objects: 240 objects, 3359 bytes
usage: 2675 MB used, 44675 GB / 44678 GB avail
pgs: 182 active+clean
.. code-block:: console
$ kubectl get pods -n ceph -o wide|grep ceph-mon
ceph-mon-4gzzw 1/1 Running 0 6d 135.207.240.42 voyager2
ceph-mon-6bbs6 0/1 CrashLoopBackOff 5 6d 135.207.240.43 voyager3
ceph-mon-qgc7p 1/1 Running 0 6d 135.207.240.41 voyager1
The logs of the failed mon-pod shows the ceph-mon process cannot run as ``/var/lib/ceph/mon/ceph-voyager3/store.db`` does not exist.
.. code-block:: console
$ kubectl logs ceph-mon-6bbs6 -n ceph
+ ceph-mon --setuser ceph --setgroup ceph --cluster ceph -i voyager3 --inject-monmap /etc/ceph/monmap-ceph --keyring /etc/ceph/ceph.mon.keyring --mon-data /var/lib/ceph/mon/ceph-voyager3
2018-07-10 18:30:04.546200 7f4ca9ed4f00 -1 rocksdb: Invalid argument: /var/lib/ceph/mon/ceph-voyager3/store.db: does not exist (create_if_missing is false)
2018-07-10 18:30:04.546214 7f4ca9ed4f00 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-voyager3': (22) Invalid argument
Recovery:
---------
Remove the entire ceph-mon directory on voyager3, and then Ceph will automatically
recreate the database by using the other ceph-mons' database.
.. code-block:: console
$ sudo rm -rf /var/lib/openstack-helm/ceph/mon/mon/ceph-voyager3
.. code-block:: console
(mon-pod):/# ceph -s
cluster:
id: 9d4d8c61-cf87-4129-9cef-8fbf301210ad
health: HEALTH_WARN
too few PGs per OSD (22 < min 30)
mon voyager1 is low on available space
services:
mon: 3 daemons, quorum voyager1,voyager2,voyager3
mgr: voyager1(active), standbys: voyager3
mds: cephfs-1/1/1 up {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
osd: 24 osds: 24 up, 24 in
rgw: 2 daemons active
data:
pools: 18 pools, 182 pgs
objects: 240 objects, 3359 bytes
usage: 2675 MB used, 44675 GB / 44678 GB avail
pgs: 182 active+clean