294 lines
9.3 KiB
ReStructuredText
294 lines
9.3 KiB
ReStructuredText
=====================
|
||
Troubleshooting Guide
|
||
=====================
|
||
|
||
This guide provides information on troubleshooting of an Airship
|
||
environment. Debugging of any software component starts with gathering
|
||
more information about the failure, so the intention of the document
|
||
is not to describe specific issues that one can encounter, but to provide
|
||
a generic set of instructions that a user can follow to find the
|
||
root cause of the problem.
|
||
|
||
For additional support you can contact the Airship team via
|
||
`IRC or mailing list <https://www.airshipit.org/community/>`__,
|
||
use `Airship bug tracker <https://storyboard.openstack.org/#!/
|
||
project_group/Airship>`__
|
||
to search and create issues.
|
||
|
||
.. contents:: Table of Contents
|
||
:depth: 3
|
||
|
||
---------------------
|
||
Perform Health Checks
|
||
---------------------
|
||
|
||
The first step in troubleshooting an Airship deployment is to identify unhealthy
|
||
services by performing health checks.
|
||
|
||
Verify Peering is established
|
||
-----------------------------
|
||
|
||
::
|
||
|
||
sudo /opt/cni/bin/calicoctl node status
|
||
|
||
Calico process is running.
|
||
IPv4 BGP status
|
||
+--------------+-----------+-------+------------+-------------+
|
||
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
|
||
+--------------+-----------+-------+------------+-------------+
|
||
| 172.29.0.2 | global | up | 2018-05-22 | Established |
|
||
| 172.29.0.3 | global | up | 2018-05-22 | Established |
|
||
+--------------+-----------+-------+------------+-------------+
|
||
IPv6 BGP status No IPv6 peers found.
|
||
|
||
Verify that **STATE** is ``up`` and **INFO** is ``Established``. However, if
|
||
**STATE** is ``start`` and **INFO** is ``Connect``, peering has failed.
|
||
|
||
For more information on Calico troubleshooting, visit the
|
||
`Calico Documentation <https://docs.projectcalico.org/introduction/>`__
|
||
|
||
Verify the Health of Kubernetes
|
||
-------------------------------
|
||
|
||
::
|
||
|
||
# Verify that for all nodes, STATE is Ready.
|
||
#
|
||
# Note: After a reboot, it may take as long as 30 minutes for
|
||
# a node to stabilize and reach a Ready condition.
|
||
kubectl get nodes
|
||
|
||
# Verify that liveness probes for all pods are working.
|
||
# This command exposes pods whose liveness probe is failing.
|
||
kubectl get pods --all-namespaces | grep Running | grep 0/
|
||
|
||
# Verify that all pods are in the Running or Completed state.
|
||
# This command exposes pods that are not running or completed.
|
||
kubectl get pods --all-namespaces | grep -v Running | Completed
|
||
|
||
# Look for crashed pods.
|
||
kubectl get pods --all-namespaces -o wide | grep Crash
|
||
|
||
# Check the health of core services.
|
||
kubectl get pods --all-namespaces -o wide | grep core
|
||
kubectl get services --all-namespaces | grep core
|
||
|
||
# Check the health of proxy services.
|
||
kubectl get pods --all-namespaces -o wide | grep proxy
|
||
|
||
# Get all pod details.
|
||
kubectl get pods --all-namespaces -o wide -w
|
||
|
||
# Look for failed jobs.
|
||
kubectl get jobs – --all-namespaces -o wide | grep -v "1 1"
|
||
|
||
Verify the Health of OpenStack
|
||
------------------------------
|
||
|
||
Check OpenStack's health by issuing the following commands at the terminal,
|
||
in order to do so you must have a set an OpenStack RC file, details
|
||
`here <https://docs.openstack.org/mitaka/cli-reference/common/cli_set_
|
||
environment_variables_using_openstack_rc.html#download-and-source-the-
|
||
openstack-rc-file>`__
|
||
|
||
::
|
||
|
||
# Verify Keystone by requesting a token.
|
||
openstack token issue
|
||
|
||
# Verify networks.
|
||
openstack network list
|
||
|
||
# Verify subnets.
|
||
openstack subnet list
|
||
|
||
# Verify VMs.
|
||
openstack server list
|
||
|
||
# Verify compute hypervisors.
|
||
openstack hypervisor list
|
||
|
||
# Verify Images
|
||
openstack image list
|
||
|
||
Check for kube-proxy iptables NAT Issues
|
||
----------------------------------------
|
||
|
||
::
|
||
|
||
# Check the iptables and make sure the IP addresses are the same:
|
||
% iptables -n -t nat -L | grep coredns
|
||
% kubectl -n kube-system get -o wide pod | grep coredns
|
||
|
||
-----------------------
|
||
Configuring Airship CLI
|
||
-----------------------
|
||
|
||
Many commands from this guide use Airship CLI, this section describes
|
||
how to get it configured on your environment.
|
||
|
||
::
|
||
|
||
git clone https://opendev.org/airship/treasuremap
|
||
cd treasuremap/
|
||
# List available tags.
|
||
git tag --list
|
||
# Switch to the version your site is using.
|
||
git checkout {your-tag}
|
||
# Go back to a previous directory.
|
||
cd ..
|
||
# Run it without arguments to get a help message.
|
||
sudo ./treasuremap/tools/airship
|
||
|
||
---------------------
|
||
Manifests Preparation
|
||
---------------------
|
||
|
||
When you do any configuration changes to the manifests, there are a few
|
||
commands that you can use to validate the changes without uploading them
|
||
to the Airship environment.
|
||
|
||
Run ``lint`` command for your site; it helps to catch the errors related
|
||
to documents duplication, broken references, etc.
|
||
|
||
Example:
|
||
|
||
::
|
||
|
||
sudo ./treasuremap/tools/airship pegleg site -r airship-treasuremap/ \
|
||
lint {site-name}
|
||
|
||
If you create configuration overrides or do changes to substitutions,
|
||
it is recommended to run ``render`` command this command merges the layers
|
||
and renders all substitutions. This allows finding what parameters are
|
||
passed to Helm as overrides for Charts' defaults.
|
||
|
||
Example:
|
||
|
||
::
|
||
|
||
# Saves the result into rendered.txt file.
|
||
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
|
||
render -o rendered.txt ${SITE}
|
||
|
||
------------------
|
||
Deployment Failure
|
||
------------------
|
||
|
||
During the deployment, it is important to identify a specific step
|
||
where it fails, there are two major deployment steps:
|
||
|
||
1. **Drydock build**: deploys Operating System.
|
||
2. **Armada build**: deploys Helm Charts.
|
||
|
||
After `Configuring Airship CLI`_, setup credentials for accessing
|
||
Shipyard; the password is stored in ``ucp_shipyard_keystone_password``
|
||
secret, you can find it in
|
||
``site/seaworthy/secrets/passphrases/ucp_shipyard_keystone_password.yaml``
|
||
configuration file of your site.
|
||
|
||
::
|
||
|
||
export OS_USERNAME=shipyard
|
||
export OS_PASSWORD={shipyard_password}
|
||
|
||
Now you can use the following commands to access Shipyard:
|
||
|
||
::
|
||
|
||
# Get all actions that were executed on you environment.
|
||
sudo ./treasuremap/tools/airship shipyard get actions
|
||
# Show all the steps within the action.
|
||
sudo ./treasuremap/tools/airship shipyard describe action/{action_id}
|
||
# Get a bit more details on the step.
|
||
sudo ./treasuremap/tools/airship shipyard describe step/{action_id}/armada_build
|
||
# Print the logs from the step.
|
||
sudo ./treasuremap/tools/airship shipyard logs step/{action_id}/armada_build
|
||
|
||
|
||
After the failed step is determined, you can access the logs of a specific
|
||
service (e.g., drydock-api/maas or armada-api) to get more information
|
||
on the failure, note that there may be multiple pods of a single service
|
||
running, you need to check all of them to find where the most recent
|
||
logs are available.
|
||
|
||
Example of accessing Armada API logs:
|
||
|
||
::
|
||
|
||
# Get all pods running on the cluster and find a name of the pod you are
|
||
# interested in.
|
||
kubectl get pods -o wide --all-namespaces
|
||
|
||
# See the logs of specific pod.
|
||
kubectl logs -n ucp -f --tail 200 armada-api-d5f757d5-6z6nv
|
||
|
||
In some cases you want to restart your pod, there is no dedicated command for
|
||
that in Kubernetes. However, you can delete the pod, it will be restarted
|
||
by Kubernetes to satisfy replication factor.
|
||
|
||
::
|
||
|
||
# Restart Armada API service.
|
||
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
|
||
|
||
----
|
||
Ceph
|
||
----
|
||
|
||
Many stateful services in Airship rely on Ceph to function correctly.
|
||
For more information on Ceph debugging follow an official
|
||
`Ceph debugging guide <http://docs.ceph.com/docs/mimic/rados/troubleshooting/log-and-debug/>`__.
|
||
|
||
Although Ceph tolerates failures of multiple OSDs, it is important
|
||
to make sure that your Ceph cluster is healthy.
|
||
|
||
::
|
||
|
||
# Get a name of Ceph Monitor pod.
|
||
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
|
||
grep ceph-mon | sed -n 1p | sed 's|pod/||')
|
||
# Get the status of the Ceph cluster.
|
||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s
|
||
|
||
Cluster is in a helthy state when ``health`` parameter is set to ``HEALTH_OK``.
|
||
|
||
When the cluster is unhealthy, and some Placement Groups are reported to be in
|
||
degraded or down states, determine the problem by inspecting the logs of
|
||
Ceph OSD that is down using ``kubectl``.
|
||
|
||
::
|
||
|
||
# Get a name of Ceph Monitor pod.
|
||
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
|
||
grep ceph-mon | sed -n 1p | sed 's|pod/||')
|
||
# List a hierarchy of OSDs in the cluster to see what OSDs are down.
|
||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree
|
||
|
||
There are a few other commands that may be useful during the debugging:
|
||
|
||
::
|
||
|
||
# Get a name of Ceph Monitor pod.
|
||
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
|
||
grep ceph-mon | sed -n 1p | sed 's|pod/||')
|
||
|
||
# Get a detailed information on the status of every Placement Group.
|
||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump
|
||
|
||
# List allocated block devices.
|
||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls
|
||
# See what client uses the device.
|
||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \
|
||
kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01
|
||
|
||
# List all Ceph block devices mounted on a specific host.
|
||
mount | grep rbd
|
||
|
||
# Exec into the Monitor pod
|
||
MON_POD=$(sudo kubectl get --no-headers pods -n=ceph \
|
||
l="application=ceph,component=mon" | awk '{ print $1; exit }')
|
||
echo $MON_POD
|
||
sudo kubectl exec -n ceph ${MON_POD} -- ceph -s
|