Merge "Added Health Checks to the troubleshooting guide"

This commit is contained in:
Zuul 2020-06-03 08:05:06 +00:00 committed by Gerrit Code Review
commit 15de25b1c7
2 changed files with 129 additions and 3 deletions

10
.gitignore vendored
View File

@ -4,3 +4,13 @@ peggles/
# Unit test / coverage reports
.tox/
config-ssh
# Sphinx Build Files
_build
# Various user specific files
.DS_Store
.idea/
.vimrc
*.swp
.vscode/

View File

@ -1,3 +1,4 @@
=====================
Troubleshooting Guide
=====================
@ -10,9 +11,117 @@ root cause of the problem.
For additional support you can contact the Airship team via
`IRC or mailing list <https://www.airshipit.org/community/>`__,
use `Airship bug tracker <https://storyboard.openstack.org/#!/project_group/Airship>`__
use `Airship bug tracker <https://storyboard.openstack.org/#!/
project_group/Airship>`__
to search and create issues.
.. contents:: Table of Contents
:depth: 3
---------------------
Perform Health Checks
---------------------
The first step in troubleshooting an Airship deployment is to identify unhealthy
services by performing health checks.
Verify Peering is established
-----------------------------
::
sudo /opt/cni/bin/calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-----------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-----------+-------+------------+-------------+
| 172.29.0.2 | global | up | 2018-05-22 | Established |
| 172.29.0.3 | global | up | 2018-05-22 | Established |
+--------------+-----------+-------+------------+-------------+
IPv6 BGP status No IPv6 peers found.
Verify that **STATE** is ``up`` and **INFO** is ``Established``. However, if
**STATE** is ``start`` and **INFO** is ``Connect``, peering has failed.
For more information on Calico troubleshooting, visit the
`Calico Documentation <https://docs.projectcalico.org/introduction/>`__
Verify the Health of Kubernetes
-------------------------------
::
# Verify that for all nodes, STATE is Ready.
#
# Note: After a reboot, it may take as long as 30 minutes for
# a node to stabilize and reach a Ready condition.
kubectl get nodes
# Verify that liveness probes for all pods are working.
# This command exposes pods whose liveness probe is failing.
kubectl get pods --all-namespaces | grep Running | grep 0/
# Verify that all pods are in the Running or Completed state.
# This command exposes pods that are not running or completed.
kubectl get pods --all-namespaces | grep -v Running | Completed
# Look for crashed pods.
kubectl get pods --all-namespaces -o wide | grep Crash
# Check the health of core services.
kubectl get pods --all-namespaces -o wide | grep core
kubectl get services --all-namespaces | grep core
# Check the health of proxy services.
kubectl get pods --all-namespaces -o wide | grep proxy
# Get all pod details.
kubectl get pods --all-namespaces -o wide -w
# Look for failed jobs.
kubectl get jobs --all-namespaces -o wide | grep -v "1 1"
Verify the Health of OpenStack
------------------------------
Check OpenStack's health by issuing the following commands at the terminal,
in order to do so you must have a set an OpenStack RC file, details
`here <https://docs.openstack.org/mitaka/cli-reference/common/cli_set_
environment_variables_using_openstack_rc.html#download-and-source-the-
openstack-rc-file>`__
::
# Verify Keystone by requesting a token.
openstack token issue
# Verify networks.
openstack network list
# Verify subnets.
openstack subnet list
# Verify VMs.
openstack server list
# Verify compute hypervisors.
openstack hypervisor list
# Verify Images
openstack image list
Check for kube-proxy iptables NAT Issues
----------------------------------------
::
# Check the iptables and make sure the IP addresses are the same:
% iptables -n -t nat -L | grep coredns
% kubectl -n kube-system get -o wide pod | grep coredns
-----------------------
Configuring Airship CLI
-----------------------
@ -32,6 +141,7 @@ how to get it configured on your environment.
# Run it without arguments to get a help message.
sudo ./treasuremap/tools/airship
---------------------
Manifests Preparation
---------------------
@ -62,6 +172,7 @@ Example:
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
render -o rendered.txt ${SITE}
------------------
Deployment Failure
------------------
@ -122,6 +233,7 @@ by Kubernetes to satisfy replication factor.
# Restart Armada API service.
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
----
Ceph
----
@ -132,8 +244,6 @@ For more information on Ceph debugging follow an official
Although Ceph tolerates failures of multiple OSDs, it is important
to make sure that your Ceph cluster is healthy.
Example:
::
# Get a name of Ceph Monitor pod.
@ -175,3 +285,9 @@ There are a few other commands that may be useful during the debugging:
# List all Ceph block devices mounted on a specific host.
mount | grep rbd
# Exec into the Monitor pod
MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \
l="application=ceph,component=mon" | awk '{ print $1; exit }')
echo $MON_POD
sudo kubectl exec -n ceph ${MON_POD} -- ceph -s