Initial implementation of Troubleshooting Guide
Add an initial implementation of Airship Troubleshooting Guide that users can use when they encounter problems with their Airship installation. Change-Id: I9c5546cbc5f12db81cc3fcc6a3be95e8dd6f52fe
This commit is contained in:
parent
2b1095e1bd
commit
5c84aec587
@ -193,6 +193,7 @@ Process Flows
|
||||
:maxdepth: 2
|
||||
|
||||
authoring_and_deployment
|
||||
troubleshooting_guide
|
||||
seaworthy
|
||||
airskiff
|
||||
airsloop
|
||||
|
177
doc/source/troubleshooting_guide.rst
Normal file
177
doc/source/troubleshooting_guide.rst
Normal file
@ -0,0 +1,177 @@
|
||||
Troubleshooting Guide
|
||||
=====================
|
||||
|
||||
This guide provides information on troubleshooting of an Airship
|
||||
environment. Debugging of any software component starts with gathering
|
||||
more information about the failure, so the intention of the document
|
||||
is not to describe specific issues that one can encounter, but to provide
|
||||
a generic set of instructions that a user can follow to find the
|
||||
root cause of the problem.
|
||||
|
||||
For additional support you can contact the Airship team via
|
||||
`IRC or mailing list <https://www.airshipit.org/community/>`__,
|
||||
use `Airship bug tracker <https://storyboard.openstack.org/#!/project_group/Airship>`__
|
||||
to search and create issues.
|
||||
|
||||
Configuring Airship CLI
|
||||
-----------------------
|
||||
|
||||
Many commands from this guide use Airship CLI, this section describes
|
||||
how to get it configured on your environment.
|
||||
|
||||
::
|
||||
|
||||
git clone https://opendev.org/airship/treasuremap
|
||||
cd treasuremap/
|
||||
# List available tags.
|
||||
git tag --list
|
||||
# Switch to the version of your site.
|
||||
git checkout {your-tag}
|
||||
# Go back to a previous directory.
|
||||
cd ..
|
||||
# Run it without arguments to get a help message.
|
||||
sudo ./treasuremap/tools/airship
|
||||
|
||||
Manifests Preparation
|
||||
---------------------
|
||||
|
||||
When you do any configuration changes to the manifests, there are a few
|
||||
commands that you can use to validate the changes without uploading them
|
||||
to the Airship environment.
|
||||
|
||||
Run ``lint`` command for your site; it helps to catch the errors related
|
||||
to documents duplication, broken references, etc.
|
||||
|
||||
Example:
|
||||
|
||||
::
|
||||
|
||||
sudo ./treasuremap/tools/airship pegleg site -r airship-treasuremap/ \
|
||||
lint {site-name}
|
||||
|
||||
If you create configuration overrides or do changes to substitutions,
|
||||
it is recommended to run ``render`` command this command merges the layers
|
||||
and renders all substitutions. This allows finding what parameters are
|
||||
passed to Helm as overrides for Charts' defaults.
|
||||
|
||||
Example:
|
||||
|
||||
::
|
||||
|
||||
# Saves the result into rendered.txt file.
|
||||
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
|
||||
render -o rendered.txt ${SITE}
|
||||
|
||||
Deployment Failure
|
||||
------------------
|
||||
|
||||
During the deployment, it is important to identify a specific step
|
||||
where it fails, there are two major deployment steps:
|
||||
|
||||
1. **Drydock build**: deploys Operating System.
|
||||
2. **Armada build**: deploys Helm Charts.
|
||||
|
||||
After `Configuring Airship CLI`_, setup credentials for accessing
|
||||
Shipyard; the password is stored in ``ucp_shipyard_keystone_password``
|
||||
secret, you can find it in
|
||||
``site/airship-seaworthy/secrets/passphrases/ucp_shipyard_keystone_password.yaml``
|
||||
configuration file of your site.
|
||||
|
||||
::
|
||||
|
||||
export OS_USERNAME=shipyard
|
||||
export OS_PASSWORD={shipyard_password}
|
||||
|
||||
Now you can use the following commands to access Shipyard:
|
||||
|
||||
::
|
||||
|
||||
# Get all actions that were executed on you environment.
|
||||
sudo ./treasuremap/tools/airship shipyard get actions
|
||||
# Show all the steps within the action.
|
||||
sudo ./treasuremap/tools/airship shipyard describe action/{action_id}
|
||||
# Get a bit more details on the step.
|
||||
sudo ./treasuremap/tools/airship shipyard describe step/{action_id}/armada_build
|
||||
# Print the logs from the step.
|
||||
sudo ./treasuremap/tools/airship shipyard logs step/{action_id}/armada_build
|
||||
|
||||
|
||||
After the failed step is determined, you can access the logs of a specific
|
||||
service (e.g., drydock-api/maas or armada-api) to get more information
|
||||
on the failure, note that there may be multiple pods of a single service
|
||||
running, you need to check all of them to find where the most recent
|
||||
logs are available.
|
||||
|
||||
Example of accessing Armada API logs:
|
||||
|
||||
::
|
||||
|
||||
# Get all pods running on the cluster and find a name of the pod you are
|
||||
# interested in.
|
||||
kubectl get pods -o wide --all-namespaces
|
||||
|
||||
# See the logs of specific pod.
|
||||
kubectl logs -n ucp -f --tail 200 armada-api-d5f757d5-6z6nv
|
||||
|
||||
In some cases you want to restart your pod, there is no dedicated command for
|
||||
that in Kubernetes. However, you can delete the pod, it will be restarted
|
||||
by Kubernetes to satisfy replication factor.
|
||||
|
||||
::
|
||||
|
||||
# Restart Armada API service.
|
||||
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
|
||||
|
||||
Ceph
|
||||
----
|
||||
|
||||
Many stateful services in Airship rely on Ceph to function correctly.
|
||||
For more information on Ceph debugging follow an official
|
||||
`Ceph debugging guide <http://docs.ceph.com/docs/mimic/rados/troubleshooting/log-and-debug/>`__.
|
||||
|
||||
Although Ceph tolerates failures of multiple OSDs, it is important
|
||||
to make sure that your Ceph cluster is healthy.
|
||||
|
||||
Example:
|
||||
|
||||
::
|
||||
|
||||
# Get a name of Ceph Monitor pod.
|
||||
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
|
||||
grep ceph-mon | sed -n 1p | sed 's|pod/||')
|
||||
# Get the status of the Ceph cluster.
|
||||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s
|
||||
|
||||
Cluster is in a helthy state when ``health`` parameter is set to ``HEALTH_OK``.
|
||||
|
||||
When the cluster is unhealthy, and some Placement Groups are reported to be in
|
||||
degraded or down states, determine the problem by inspecting the logs of
|
||||
Ceph OSD that is down using ``kubectl``.
|
||||
|
||||
::
|
||||
|
||||
# Get a name of Ceph Monitor pod.
|
||||
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
|
||||
grep ceph-mon | sed -n 1p | sed 's|pod/||')
|
||||
# List a hierarchy of OSDs in the cluster to see what OSDs are down.
|
||||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree
|
||||
|
||||
There are a few other commands that may be useful during the debugging:
|
||||
|
||||
::
|
||||
|
||||
# Get a name of Ceph Monitor pod.
|
||||
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
|
||||
grep ceph-mon | sed -n 1p | sed 's|pod/||')
|
||||
|
||||
# Get a detailed information on the status of every Placement Group.
|
||||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump
|
||||
|
||||
# List allocated block devices.
|
||||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls
|
||||
# See what client uses the device.
|
||||
sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \
|
||||
kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01
|
||||
|
||||
# List all Ceph block devices mounted on a specific host.
|
||||
mount | grep rbd
|
Loading…
Reference in New Issue
Block a user