From a37202b0636842d20b169b3198d6869108deea86 Mon Sep 17 00:00:00 2001 From: Danny Massa Date: Thu, 11 Jun 2020 11:16:15 -0500 Subject: [PATCH] Separated Ceph Content into separate section There is a lot of troubleshooting content related to Ceph that will be submitted in following patchsets related to common issues realted to Monitors, OSDs, and PSGs. The amount of content would overshadow the more surface level instructions in the troubleshooting guide. So I have given Ceph it's own separated section under the Troubleshooting Guide Change-Id: I27f67f6813eed4823ea5ff43bcadd7d269a8afa9 --- doc/source/troubleshooting_ceph.rst | 83 ++++++++++++++++++++++++++++ doc/source/troubleshooting_guide.rst | 66 +++------------------- 2 files changed, 90 insertions(+), 59 deletions(-) create mode 100644 doc/source/troubleshooting_ceph.rst diff --git a/doc/source/troubleshooting_ceph.rst b/doc/source/troubleshooting_ceph.rst new file mode 100644 index 000000000..3dca9bdef --- /dev/null +++ b/doc/source/troubleshooting_ceph.rst @@ -0,0 +1,83 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + +-------------------- +Troubleshooting Ceph +-------------------- + +.. contents:: Table of Contents + :depth: 3 + +Initial Troubleshooting +----------------------- + +Many stateful services in Airship rely on Ceph to function correctly. +For more information on Ceph debugging follow the official +`Ceph debugging guide `__. + +Although Ceph tolerates failures of multiple OSDs, it is important +to make sure that your Ceph cluster is healthy. + +:: + + # Many commands require the name of the Ceph monitor pod, use the following + # shell command to assign the pod name to an environment variable for ease + # of use. + CEPH_MON=$(sudo kubectl get --no-headers pods -n=ceph \ + l="application=ceph,component=mon" | awk '{ print $1; exit }') + + # Get the status of the Ceph cluster. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s + + # Get the health of the Ceph cluster + sudo kubectl -n ceph exec ${CEPH_MON} ceph health detail + +The health indicators for Ceph are: + +* `HEALTH_OK`: Indicates the cluster is healthy +* `HEALTH_WARN`: Indicates there may be an issue, but all the data stored in the + cluster remains accessible. In some cases Ceph will return to `HEALTH_OK` + automatically, i.e. when Ceph finishes the rebalancing process +* `HEALTH_ERR`: Indicates a more serious problem that requires immediate + attention as a part or all of your data has become inaccessible + +When the cluster is unhealthy, and some Placement Groups are reported to be in +degraded or down states, determine the problem by inspecting the logs of +Ceph OSD that is down using ``kubectl``. + +There are a few other commands that may be useful during the debugging: + +:: + + # Make sure your CEPH_MON variable is set, mentioned above. + echo ${CEPH_MON} + + # List a hierarchy of OSDs in the cluster to see what OSDs are down. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree + + # Get a detailed information on the status of every Placement Group. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump + + # List allocated block devices. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls + + # See what client uses the device. + # Note: The pvc name will be different in your cluster + sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \ + kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01 + + # List all Ceph block devices mounted on a specific host. + mount | grep rbd + + # Exec into the Monitor pod + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s diff --git a/doc/source/troubleshooting_guide.rst b/doc/source/troubleshooting_guide.rst index 77c8f5f62..73ba2888f 100644 --- a/doc/source/troubleshooting_guide.rst +++ b/doc/source/troubleshooting_guide.rst @@ -18,6 +18,13 @@ to search and create issues. .. contents:: Table of Contents :depth: 3 +**Additional Troubleshooting** + +.. toctree:: + :maxdepth: 3 + + troubleshooting_ceph.rst + --------------------- Perform Health Checks --------------------- @@ -232,62 +239,3 @@ by Kubernetes to satisfy replication factor. # Restart Armada API service. kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv - ----- -Ceph ----- - -Many stateful services in Airship rely on Ceph to function correctly. -For more information on Ceph debugging follow an official -`Ceph debugging guide `__. - -Although Ceph tolerates failures of multiple OSDs, it is important -to make sure that your Ceph cluster is healthy. - -:: - - # Get a name of Ceph Monitor pod. - CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ - grep ceph-mon | sed -n 1p | sed 's|pod/||') - # Get the status of the Ceph cluster. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s - -Cluster is in a helthy state when ``health`` parameter is set to ``HEALTH_OK``. - -When the cluster is unhealthy, and some Placement Groups are reported to be in -degraded or down states, determine the problem by inspecting the logs of -Ceph OSD that is down using ``kubectl``. - -:: - - # Get a name of Ceph Monitor pod. - CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ - grep ceph-mon | sed -n 1p | sed 's|pod/||') - # List a hierarchy of OSDs in the cluster to see what OSDs are down. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree - -There are a few other commands that may be useful during the debugging: - -:: - - # Get a name of Ceph Monitor pod. - CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ - grep ceph-mon | sed -n 1p | sed 's|pod/||') - - # Get a detailed information on the status of every Placement Group. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump - - # List allocated block devices. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls - # See what client uses the device. - sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \ - kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01 - - # List all Ceph block devices mounted on a specific host. - mount | grep rbd - - # Exec into the Monitor pod - MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \ - l="application=ceph,component=mon" | awk '{ print $1; exit }') - echo $MON_POD - sudo kubectl exec -n ceph ${MON_POD} -- ceph -s