From a656374c16620f6e11580ca00b2562a8c4f6c21f Mon Sep 17 00:00:00 2001 From: Danny Massa Date: Wed, 13 May 2020 09:51:15 -0500 Subject: [PATCH] Added Health Checks to the troubleshooting guide Added information about health checks that can be used to verify deployment of airship 2. Also made two additions to .gitignore to refrain from tracking Sphinx build files and IDE files. Change-Id: Icbf39860e9e137261b302ad5649fb48b095f6220 --- .gitignore | 10 +++ doc/source/troubleshooting_guide.rst | 122 ++++++++++++++++++++++++++- 2 files changed, 129 insertions(+), 3 deletions(-) diff --git a/.gitignore b/.gitignore index ee3c0fba8..13e9d592a 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,13 @@ peggles/ # Unit test / coverage reports .tox/ config-ssh + +# Sphinx Build Files +_build + +# Various user specific files +.DS_Store +.idea/ +.vimrc +*.swp +.vscode/ \ No newline at end of file diff --git a/doc/source/troubleshooting_guide.rst b/doc/source/troubleshooting_guide.rst index 072a4897d..77c8f5f62 100644 --- a/doc/source/troubleshooting_guide.rst +++ b/doc/source/troubleshooting_guide.rst @@ -1,3 +1,4 @@ +===================== Troubleshooting Guide ===================== @@ -10,9 +11,117 @@ root cause of the problem. For additional support you can contact the Airship team via `IRC or mailing list `__, -use `Airship bug tracker `__ +use `Airship bug tracker `__ to search and create issues. +.. contents:: Table of Contents + :depth: 3 + +--------------------- +Perform Health Checks +--------------------- + +The first step in troubleshooting an Airship deployment is to identify unhealthy +services by performing health checks. + +Verify Peering is established +----------------------------- + +:: + + sudo /opt/cni/bin/calicoctl node status + + Calico process is running. + IPv4 BGP status + +--------------+-----------+-------+------------+-------------+ + | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | + +--------------+-----------+-------+------------+-------------+ + | 172.29.0.2 | global | up | 2018-05-22 | Established | + | 172.29.0.3 | global | up | 2018-05-22 | Established | + +--------------+-----------+-------+------------+-------------+ + IPv6 BGP status No IPv6 peers found. + +Verify that **STATE** is ``up`` and **INFO** is ``Established``. However, if +**STATE** is ``start`` and **INFO** is ``Connect``, peering has failed. + +For more information on Calico troubleshooting, visit the +`Calico Documentation `__ + +Verify the Health of Kubernetes +------------------------------- + +:: + + # Verify that for all nodes, STATE is Ready. + # + # Note: After a reboot, it may take as long as 30 minutes for + # a node to stabilize and reach a Ready condition. + kubectl get nodes + + # Verify that liveness probes for all pods are working. + # This command exposes pods whose liveness probe is failing. + kubectl get pods --all-namespaces | grep Running | grep 0/ + + # Verify that all pods are in the Running or Completed state. + # This command exposes pods that are not running or completed. + kubectl get pods --all-namespaces | grep -v Running | Completed + + # Look for crashed pods. + kubectl get pods --all-namespaces -o wide | grep Crash + + # Check the health of core services. + kubectl get pods --all-namespaces -o wide | grep core + kubectl get services --all-namespaces | grep core + + # Check the health of proxy services. + kubectl get pods --all-namespaces -o wide | grep proxy + + # Get all pod details. + kubectl get pods --all-namespaces -o wide -w + + # Look for failed jobs. + kubectl get jobs – --all-namespaces -o wide | grep -v "1 1" + +Verify the Health of OpenStack +------------------------------ + +Check OpenStack's health by issuing the following commands at the terminal, +in order to do so you must have a set an OpenStack RC file, details +`here `__ + +:: + + # Verify Keystone by requesting a token. + openstack token issue + + # Verify networks. + openstack network list + + # Verify subnets. + openstack subnet list + + # Verify VMs. + openstack server list + + # Verify compute hypervisors. + openstack hypervisor list + + # Verify Images + openstack image list + +Check for kube-proxy iptables NAT Issues +---------------------------------------- + +:: + + # Check the iptables and make sure the IP addresses are the same: + % iptables -n -t nat -L | grep coredns + % kubectl -n kube-system get -o wide pod | grep coredns + +----------------------- Configuring Airship CLI ----------------------- @@ -32,6 +141,7 @@ how to get it configured on your environment. # Run it without arguments to get a help message. sudo ./treasuremap/tools/airship +--------------------- Manifests Preparation --------------------- @@ -62,6 +172,7 @@ Example: sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \ render -o rendered.txt ${SITE} +------------------ Deployment Failure ------------------ @@ -122,6 +233,7 @@ by Kubernetes to satisfy replication factor. # Restart Armada API service. kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv +---- Ceph ---- @@ -132,8 +244,6 @@ For more information on Ceph debugging follow an official Although Ceph tolerates failures of multiple OSDs, it is important to make sure that your Ceph cluster is healthy. -Example: - :: # Get a name of Ceph Monitor pod. @@ -175,3 +285,9 @@ There are a few other commands that may be useful during the debugging: # List all Ceph block devices mounted on a specific host. mount | grep rbd + + # Exec into the Monitor pod + MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \ + l="application=ceph,component=mon" | awk '{ print $1; exit }') + echo $MON_POD + sudo kubectl exec -n ceph ${MON_POD} -- ceph -s