diff --git a/doc/source/troubleshooting_ceph.rst b/doc/source/troubleshooting_ceph.rst new file mode 100644 index 000000000..3dca9bdef --- /dev/null +++ b/doc/source/troubleshooting_ceph.rst @@ -0,0 +1,83 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + +-------------------- +Troubleshooting Ceph +-------------------- + +.. contents:: Table of Contents + :depth: 3 + +Initial Troubleshooting +----------------------- + +Many stateful services in Airship rely on Ceph to function correctly. +For more information on Ceph debugging follow the official +`Ceph debugging guide `__. + +Although Ceph tolerates failures of multiple OSDs, it is important +to make sure that your Ceph cluster is healthy. + +:: + + # Many commands require the name of the Ceph monitor pod, use the following + # shell command to assign the pod name to an environment variable for ease + # of use. + CEPH_MON=$(sudo kubectl get --no-headers pods -n=ceph \ + l="application=ceph,component=mon" | awk '{ print $1; exit }') + + # Get the status of the Ceph cluster. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s + + # Get the health of the Ceph cluster + sudo kubectl -n ceph exec ${CEPH_MON} ceph health detail + +The health indicators for Ceph are: + +* `HEALTH_OK`: Indicates the cluster is healthy +* `HEALTH_WARN`: Indicates there may be an issue, but all the data stored in the + cluster remains accessible. In some cases Ceph will return to `HEALTH_OK` + automatically, i.e. when Ceph finishes the rebalancing process +* `HEALTH_ERR`: Indicates a more serious problem that requires immediate + attention as a part or all of your data has become inaccessible + +When the cluster is unhealthy, and some Placement Groups are reported to be in +degraded or down states, determine the problem by inspecting the logs of +Ceph OSD that is down using ``kubectl``. + +There are a few other commands that may be useful during the debugging: + +:: + + # Make sure your CEPH_MON variable is set, mentioned above. + echo ${CEPH_MON} + + # List a hierarchy of OSDs in the cluster to see what OSDs are down. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree + + # Get a detailed information on the status of every Placement Group. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump + + # List allocated block devices. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls + + # See what client uses the device. + # Note: The pvc name will be different in your cluster + sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \ + kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01 + + # List all Ceph block devices mounted on a specific host. + mount | grep rbd + + # Exec into the Monitor pod + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s diff --git a/doc/source/troubleshooting_guide.rst b/doc/source/troubleshooting_guide.rst index 77c8f5f62..73ba2888f 100644 --- a/doc/source/troubleshooting_guide.rst +++ b/doc/source/troubleshooting_guide.rst @@ -18,6 +18,13 @@ to search and create issues. .. contents:: Table of Contents :depth: 3 +**Additional Troubleshooting** + +.. toctree:: + :maxdepth: 3 + + troubleshooting_ceph.rst + --------------------- Perform Health Checks --------------------- @@ -232,62 +239,3 @@ by Kubernetes to satisfy replication factor. # Restart Armada API service. kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv - ----- -Ceph ----- - -Many stateful services in Airship rely on Ceph to function correctly. -For more information on Ceph debugging follow an official -`Ceph debugging guide `__. - -Although Ceph tolerates failures of multiple OSDs, it is important -to make sure that your Ceph cluster is healthy. - -:: - - # Get a name of Ceph Monitor pod. - CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ - grep ceph-mon | sed -n 1p | sed 's|pod/||') - # Get the status of the Ceph cluster. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s - -Cluster is in a helthy state when ``health`` parameter is set to ``HEALTH_OK``. - -When the cluster is unhealthy, and some Placement Groups are reported to be in -degraded or down states, determine the problem by inspecting the logs of -Ceph OSD that is down using ``kubectl``. - -:: - - # Get a name of Ceph Monitor pod. - CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ - grep ceph-mon | sed -n 1p | sed 's|pod/||') - # List a hierarchy of OSDs in the cluster to see what OSDs are down. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree - -There are a few other commands that may be useful during the debugging: - -:: - - # Get a name of Ceph Monitor pod. - CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ - grep ceph-mon | sed -n 1p | sed 's|pod/||') - - # Get a detailed information on the status of every Placement Group. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump - - # List allocated block devices. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls - # See what client uses the device. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \ - kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01 - - # List all Ceph block devices mounted on a specific host. - mount | grep rbd - - # Exec into the Monitor pod - MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \ - l="application=ceph,component=mon" | awk '{ print $1; exit }') - echo $MON_POD - sudo kubectl exec -n ceph ${MON_POD} -- ceph -s