Initial implementation of Troubleshooting Guide

Add an initial implementation of Airship Troubleshooting Guide that users can use when they encounter problems with their Airship installation. Change-Id: I9c5546cbc5f12db81cc3fcc6a3be95e8dd6f52fe
2019-04-18 20:12:50 +00:00 · 2019-04-18 20:12:50 +00:00 · 5c84aec587
commit 5c84aec587
parent 2b1095e1bd
2 changed files with 178 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -193,6 +193,7 @@ Process Flows
   :maxdepth: 2

   authoring_and_deployment
+   troubleshooting_guide
   seaworthy
   airskiff
   airsloop
--- a/doc/source/troubleshooting_guide.rst
+++ b/doc/source/troubleshooting_guide.rst
@ -0,0 +1,177 @@
+Troubleshooting Guide
+=====================
+
+This guide provides information on troubleshooting of an Airship
+environment. Debugging of any software component starts with gathering
+more information about the failure, so the intention of the document
+is not to describe specific issues that one can encounter, but to provide
+a generic set of instructions that a user can follow to find the
+root cause of the problem.
+
+For additional support you can contact the Airship team via
+`IRC or mailing list <https://www.airshipit.org/community/>`__,
+use `Airship bug tracker <https://storyboard.openstack.org/#!/project_group/Airship>`__
+to search and create issues.
+
+Configuring Airship CLI
+-----------------------
+
+Many commands from this guide use Airship CLI, this section describes
+how to get it configured on your environment.
+
+::
+
+    git clone https://opendev.org/airship/treasuremap
+    cd treasuremap/
+    # List available tags.
+    git tag --list
+    # Switch to the version of your site.
+    git checkout {your-tag}
+    # Go back to a previous directory.
+    cd ..
+    # Run it without arguments to get a help message.
+    sudo ./treasuremap/tools/airship
+
+Manifests Preparation
+---------------------
+
+When you do any configuration changes to the manifests, there are a few
+commands that you can use to validate the changes without uploading them
+to the Airship environment.
+
+Run ``lint`` command for your site; it helps to catch the errors related
+to documents duplication, broken references, etc.
+
+Example:
+
+::
+
+    sudo ./treasuremap/tools/airship pegleg site -r airship-treasuremap/ \
+        lint {site-name}
+
+If you create configuration overrides or do changes to substitutions,
+it is recommended to run ``render`` command this command merges the layers
+and renders all substitutions. This allows finding what parameters are
+passed to Helm as overrides for Charts' defaults.
+
+Example:
+
+::
+
+    # Saves the result into rendered.txt file.
+    sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
+        render -o rendered.txt ${SITE}
+
+Deployment Failure
+------------------
+
+During the deployment, it is important to identify a specific step
+where it fails, there are two major deployment steps:
+
+1. **Drydock build**: deploys Operating System.
+2. **Armada build**: deploys Helm Charts.
+
+After `Configuring Airship CLI`_, setup credentials for accessing
+Shipyard; the password is stored in ``ucp_shipyard_keystone_password``
+secret, you can find it in
+``site/airship-seaworthy/secrets/passphrases/ucp_shipyard_keystone_password.yaml``
+configuration file of your site.
+
+::
+
+    export OS_USERNAME=shipyard
+    export OS_PASSWORD={shipyard_password}
+
+Now you can use the following commands to access Shipyard:
+
+::
+
+    # Get all actions that were executed on you environment.
+    sudo ./treasuremap/tools/airship shipyard get actions
+    # Show all the steps within the action.
+    sudo ./treasuremap/tools/airship shipyard describe action/{action_id}
+    # Get a bit more details on the step.
+    sudo ./treasuremap/tools/airship shipyard describe step/{action_id}/armada_build
+    # Print the logs from the step.
+    sudo ./treasuremap/tools/airship shipyard logs step/{action_id}/armada_build
+
+
+After the failed step is determined, you can access the logs of a specific
+service (e.g., drydock-api/maas or armada-api) to get more information
+on the failure, note that there may be multiple pods of a single service
+running, you need to check all of them to find where the most recent
+logs are available.
+
+Example of accessing Armada API logs:
+
+::
+
+   # Get all pods running on the cluster and find a name of the pod you are
+   # interested in.
+   kubectl get pods -o wide --all-namespaces
+
+   # See the logs of specific pod.
+   kubectl logs -n ucp -f --tail 200 armada-api-d5f757d5-6z6nv
+
+In some cases you want to restart your pod, there is no dedicated command for
+that in Kubernetes. However, you can delete the pod, it will be restarted
+by Kubernetes to satisfy replication factor.
+
+::
+
+    # Restart Armada API service.
+    kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv
+
+Ceph
+----
+
+Many stateful services in Airship rely on Ceph to function correctly.
+For more information on Ceph debugging follow an official
+`Ceph debugging guide <http://docs.ceph.com/docs/mimic/rados/troubleshooting/log-and-debug/>`__.
+
+Although Ceph tolerates failures of multiple OSDs, it is important
+to make sure that your Ceph cluster is healthy.
+
+Example:
+
+::
+
+    # Get a name of Ceph Monitor pod.
+    CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
+        grep ceph-mon | sed -n 1p | sed 's|pod/||')
+    # Get the status of the Ceph cluster.
+    sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s
+
+Cluster is in a helthy state when ``health`` parameter is set to ``HEALTH_OK``.
+
+When the cluster is unhealthy, and some Placement Groups are reported to be in
+degraded or down states, determine the problem by inspecting the logs of
+Ceph OSD that is down using ``kubectl``.
+
+::
+
+    # Get a name of Ceph Monitor pod.
+    CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
+        grep ceph-mon | sed -n 1p | sed 's|pod/||')
+    # List a hierarchy of OSDs in the cluster to see what OSDs are down.
+    sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree
+
+There are a few other commands that may be useful during the debugging:
+
+::
+
+    # Get a name of Ceph Monitor pod.
+    CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
+        grep ceph-mon | sed -n 1p | sed 's|pod/||')
+
+    # Get a detailed information on the status of every Placement Group.
+    sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump
+
+    # List allocated block devices.
+    sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls
+    # See what client uses the device.
+    sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \
+        kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01
+
+    # List all Ceph block devices mounted on a specific host.
+    mount | grep rbd