Browse Source

Added Health Checks to the troubleshooting guide

Added information about health checks that can be used to verify
deployment of airship 2. Also made two additions to .gitignore
to refrain from tracking Sphinx build files and IDE files.

Change-Id: Icbf39860e9e137261b302ad5649fb48b095f6220
changes/15/727815/8
Danny Massa 9 months ago
parent
commit
a656374c16
2 changed files with 129 additions and 3 deletions
  1. +10
    -0
      .gitignore
  2. +119
    -3
      doc/source/troubleshooting_guide.rst

+ 10
- 0
.gitignore View File

@ -4,3 +4,13 @@ peggles/
# Unit test / coverage reports
.tox/
config-ssh
# Sphinx Build Files
_build
# Various user specific files
.DS_Store
.idea/
.vimrc
*.swp
.vscode/

+ 119
- 3
doc/source/troubleshooting_guide.rst View File

@ -1,3 +1,4 @@
=====================
Troubleshooting Guide
=====================
@ -10,9 +11,117 @@ root cause of the problem.
For additional support you can contact the Airship team via
`IRC or mailing list <https://www.airshipit.org/community/>`__,
use `Airship bug tracker <https://storyboard.openstack.org/#!/project_group/Airship>`__
use `Airship bug tracker <https://storyboard.openstack.org/#!/
project_group/Airship>`__
to search and create issues.
.. contents:: Table of Contents
:depth: 3
---------------------
Perform Health Checks
---------------------
The first step in troubleshooting an Airship deployment is to identify unhealthy
services by performing health checks.
Verify Peering is established
-----------------------------
::
sudo /opt/cni/bin/calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-----------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-----------+-------+------------+-------------+
| 172.29.0.2 | global | up | 2018-05-22 | Established |
| 172.29.0.3 | global | up | 2018-05-22 | Established |
+--------------+-----------+-------+------------+-------------+
IPv6 BGP status No IPv6 peers found.
Verify that **STATE** is ``up`` and **INFO** is ``Established``. However, if
**STATE** is ``start`` and **INFO** is ``Connect``, peering has failed.
For more information on Calico troubleshooting, visit the
`Calico Documentation <https://docs.projectcalico.org/introduction/>`__
Verify the Health of Kubernetes
-------------------------------
::
# Verify that for all nodes, STATE is Ready.
#
# Note: After a reboot, it may take as long as 30 minutes for
# a node to stabilize and reach a Ready condition.
kubectl get nodes
# Verify that liveness probes for all pods are working.
# This command exposes pods whose liveness probe is failing.
kubectl get pods --all-namespaces | grep Running | grep 0/
# Verify that all pods are in the Running or Completed state.
# This command exposes pods that are not running or completed.
kubectl get pods --all-namespaces | grep -v Running | Completed
# Look for crashed pods.
kubectl get pods --all-namespaces -o wide | grep Crash
# Check the health of core services.
kubectl get pods --all-namespaces -o wide | grep core
kubectl get services --all-namespaces | grep core
# Check the health of proxy services.
kubectl get pods --all-namespaces -o wide | grep proxy
# Get all pod details.
kubectl get pods --all-namespaces -o wide -w
# Look for failed jobs.
kubectl get jobs – --all-namespaces -o wide | grep -v "1 1"
Verify the Health of OpenStack
------------------------------
Check OpenStack's health by issuing the following commands at the terminal,
in order to do so you must have a set an OpenStack RC file, details
`here <https://docs.openstack.org/mitaka/cli-reference/common/cli_set_
environment_variables_using_openstack_rc.html#download-and-source-the-
openstack-rc-file>`__
::
# Verify Keystone by requesting a token.
openstack token issue
# Verify networks.
openstack network list
# Verify subnets.
openstack subnet list
# Verify VMs.
openstack server list
# Verify compute hypervisors.
openstack hypervisor list
# Verify Images
openstack image list
Check for kube-proxy iptables NAT Issues
----------------------------------------
::
# Check the iptables and make sure the IP addresses are the same:
% iptables -n -t nat -L | grep coredns
% kubectl -n kube-system get -o wide pod | grep coredns
-----------------------
Configuring Airship CLI
-----------------------
@ -32,6 +141,7 @@ how to get it configured on your environment.
# Run it without arguments to get a help message.
sudo ./treasuremap/tools/airship
---------------------
Manifests Preparation
---------------------
@ -62,6 +172,7 @@ Example:
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
render -o rendered.txt ${SITE}
------------------
Deployment Failure
------------------
@ -122,6 +233,7 @@ by Kubernetes to satisfy replication factor.
# Restart Armada API service.
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
----
Ceph
----
@ -132,8 +244,6 @@ For more information on Ceph debugging follow an official
Although Ceph tolerates failures of multiple OSDs, it is important
to make sure that your Ceph cluster is healthy.
Example:
::
# Get a name of Ceph Monitor pod.
@ -175,3 +285,9 @@ There are a few other commands that may be useful during the debugging:
# List all Ceph block devices mounted on a specific host.
mount | grep rbd
# Exec into the Monitor pod
MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \
l="application=ceph,component=mon" | awk '{ print $1; exit }')
echo $MON_POD
sudo kubectl exec -n ceph ${MON_POD} -- ceph -s

Loading…
Cancel
Save