Merge "Added Health Checks to the troubleshooting guide"
This commit is contained in:
commit
15de25b1c7
|
@ -4,3 +4,13 @@ peggles/
|
||||||
# Unit test / coverage reports
|
# Unit test / coverage reports
|
||||||
.tox/
|
.tox/
|
||||||
config-ssh
|
config-ssh
|
||||||
|
|
||||||
|
# Sphinx Build Files
|
||||||
|
_build
|
||||||
|
|
||||||
|
# Various user specific files
|
||||||
|
.DS_Store
|
||||||
|
.idea/
|
||||||
|
.vimrc
|
||||||
|
*.swp
|
||||||
|
.vscode/
|
|
@ -1,3 +1,4 @@
|
||||||
|
=====================
|
||||||
Troubleshooting Guide
|
Troubleshooting Guide
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
|
@ -10,9 +11,117 @@ root cause of the problem.
|
||||||
|
|
||||||
For additional support you can contact the Airship team via
|
For additional support you can contact the Airship team via
|
||||||
`IRC or mailing list <https://www.airshipit.org/community/>`__,
|
`IRC or mailing list <https://www.airshipit.org/community/>`__,
|
||||||
use `Airship bug tracker <https://storyboard.openstack.org/#!/project_group/Airship>`__
|
use `Airship bug tracker <https://storyboard.openstack.org/#!/
|
||||||
|
project_group/Airship>`__
|
||||||
to search and create issues.
|
to search and create issues.
|
||||||
|
|
||||||
|
.. contents:: Table of Contents
|
||||||
|
:depth: 3
|
||||||
|
|
||||||
|
---------------------
|
||||||
|
Perform Health Checks
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The first step in troubleshooting an Airship deployment is to identify unhealthy
|
||||||
|
services by performing health checks.
|
||||||
|
|
||||||
|
Verify Peering is established
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
sudo /opt/cni/bin/calicoctl node status
|
||||||
|
|
||||||
|
Calico process is running.
|
||||||
|
IPv4 BGP status
|
||||||
|
+--------------+-----------+-------+------------+-------------+
|
||||||
|
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
|
||||||
|
+--------------+-----------+-------+------------+-------------+
|
||||||
|
| 172.29.0.2 | global | up | 2018-05-22 | Established |
|
||||||
|
| 172.29.0.3 | global | up | 2018-05-22 | Established |
|
||||||
|
+--------------+-----------+-------+------------+-------------+
|
||||||
|
IPv6 BGP status No IPv6 peers found.
|
||||||
|
|
||||||
|
Verify that **STATE** is ``up`` and **INFO** is ``Established``. However, if
|
||||||
|
**STATE** is ``start`` and **INFO** is ``Connect``, peering has failed.
|
||||||
|
|
||||||
|
For more information on Calico troubleshooting, visit the
|
||||||
|
`Calico Documentation <https://docs.projectcalico.org/introduction/>`__
|
||||||
|
|
||||||
|
Verify the Health of Kubernetes
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# Verify that for all nodes, STATE is Ready.
|
||||||
|
#
|
||||||
|
# Note: After a reboot, it may take as long as 30 minutes for
|
||||||
|
# a node to stabilize and reach a Ready condition.
|
||||||
|
kubectl get nodes
|
||||||
|
|
||||||
|
# Verify that liveness probes for all pods are working.
|
||||||
|
# This command exposes pods whose liveness probe is failing.
|
||||||
|
kubectl get pods --all-namespaces | grep Running | grep 0/
|
||||||
|
|
||||||
|
# Verify that all pods are in the Running or Completed state.
|
||||||
|
# This command exposes pods that are not running or completed.
|
||||||
|
kubectl get pods --all-namespaces | grep -v Running | Completed
|
||||||
|
|
||||||
|
# Look for crashed pods.
|
||||||
|
kubectl get pods --all-namespaces -o wide | grep Crash
|
||||||
|
|
||||||
|
# Check the health of core services.
|
||||||
|
kubectl get pods --all-namespaces -o wide | grep core
|
||||||
|
kubectl get services --all-namespaces | grep core
|
||||||
|
|
||||||
|
# Check the health of proxy services.
|
||||||
|
kubectl get pods --all-namespaces -o wide | grep proxy
|
||||||
|
|
||||||
|
# Get all pod details.
|
||||||
|
kubectl get pods --all-namespaces -o wide -w
|
||||||
|
|
||||||
|
# Look for failed jobs.
|
||||||
|
kubectl get jobs – --all-namespaces -o wide | grep -v "1 1"
|
||||||
|
|
||||||
|
Verify the Health of OpenStack
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
Check OpenStack's health by issuing the following commands at the terminal,
|
||||||
|
in order to do so you must have a set an OpenStack RC file, details
|
||||||
|
`here <https://docs.openstack.org/mitaka/cli-reference/common/cli_set_
|
||||||
|
environment_variables_using_openstack_rc.html#download-and-source-the-
|
||||||
|
openstack-rc-file>`__
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# Verify Keystone by requesting a token.
|
||||||
|
openstack token issue
|
||||||
|
|
||||||
|
# Verify networks.
|
||||||
|
openstack network list
|
||||||
|
|
||||||
|
# Verify subnets.
|
||||||
|
openstack subnet list
|
||||||
|
|
||||||
|
# Verify VMs.
|
||||||
|
openstack server list
|
||||||
|
|
||||||
|
# Verify compute hypervisors.
|
||||||
|
openstack hypervisor list
|
||||||
|
|
||||||
|
# Verify Images
|
||||||
|
openstack image list
|
||||||
|
|
||||||
|
Check for kube-proxy iptables NAT Issues
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# Check the iptables and make sure the IP addresses are the same:
|
||||||
|
% iptables -n -t nat -L | grep coredns
|
||||||
|
% kubectl -n kube-system get -o wide pod | grep coredns
|
||||||
|
|
||||||
|
-----------------------
|
||||||
Configuring Airship CLI
|
Configuring Airship CLI
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
|
@ -32,6 +141,7 @@ how to get it configured on your environment.
|
||||||
# Run it without arguments to get a help message.
|
# Run it without arguments to get a help message.
|
||||||
sudo ./treasuremap/tools/airship
|
sudo ./treasuremap/tools/airship
|
||||||
|
|
||||||
|
---------------------
|
||||||
Manifests Preparation
|
Manifests Preparation
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
|
@ -62,6 +172,7 @@ Example:
|
||||||
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
|
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
|
||||||
render -o rendered.txt ${SITE}
|
render -o rendered.txt ${SITE}
|
||||||
|
|
||||||
|
------------------
|
||||||
Deployment Failure
|
Deployment Failure
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
|
@ -122,6 +233,7 @@ by Kubernetes to satisfy replication factor.
|
||||||
# Restart Armada API service.
|
# Restart Armada API service.
|
||||||
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
|
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
|
||||||
|
|
||||||
|
----
|
||||||
Ceph
|
Ceph
|
||||||
----
|
----
|
||||||
|
|
||||||
|
@ -132,8 +244,6 @@ For more information on Ceph debugging follow an official
|
||||||
Although Ceph tolerates failures of multiple OSDs, it is important
|
Although Ceph tolerates failures of multiple OSDs, it is important
|
||||||
to make sure that your Ceph cluster is healthy.
|
to make sure that your Ceph cluster is healthy.
|
||||||
|
|
||||||
Example:
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# Get a name of Ceph Monitor pod.
|
# Get a name of Ceph Monitor pod.
|
||||||
|
@ -175,3 +285,9 @@ There are a few other commands that may be useful during the debugging:
|
||||||
|
|
||||||
# List all Ceph block devices mounted on a specific host.
|
# List all Ceph block devices mounted on a specific host.
|
||||||
mount | grep rbd
|
mount | grep rbd
|
||||||
|
|
||||||
|
# Exec into the Monitor pod
|
||||||
|
MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \
|
||||||
|
l="application=ceph,component=mon" | awk '{ print $1; exit }')
|
||||||
|
echo $MON_POD
|
||||||
|
sudo kubectl exec -n ceph ${MON_POD} -- ceph -s
|
||||||
|
|
Loading…
Reference in New Issue