treasuremap/doc/source/troubleshooting_guide.rst
Evgeny L 90c3e84488 Add a site update guide
The guide describes how to apply changes on already running
cluster. Describe the flow, from config change to deployment start.

Change-Id: I6be24794d570ca6b42db36f2a10b065b8e16f428
2019-04-25 21:38:41 +00:00

5.9 KiB

Troubleshooting Guide

This guide provides information on troubleshooting of an Airship environment. Debugging of any software component starts with gathering more information about the failure, so the intention of the document is not to describe specific issues that one can encounter, but to provide a generic set of instructions that a user can follow to find the root cause of the problem.

For additional support you can contact the Airship team via IRC or mailing list, use Airship bug tracker to search and create issues.

Configuring Airship CLI

Many commands from this guide use Airship CLI, this section describes how to get it configured on your environment.

git clone https://opendev.org/airship/treasuremap
cd treasuremap/
# List available tags.
git tag --list
# Switch to the version your site is using.
git checkout {your-tag}
# Go back to a previous directory.
cd ..
# Run it without arguments to get a help message.
sudo ./treasuremap/tools/airship

Manifests Preparation

When you do any configuration changes to the manifests, there are a few commands that you can use to validate the changes without uploading them to the Airship environment.

Run lint command for your site; it helps to catch the errors related to documents duplication, broken references, etc.

Example:

sudo ./treasuremap/tools/airship pegleg site -r airship-treasuremap/ \
    lint {site-name}

If you create configuration overrides or do changes to substitutions, it is recommended to run render command this command merges the layers and renders all substitutions. This allows finding what parameters are passed to Helm as overrides for Charts' defaults.

Example:

# Saves the result into rendered.txt file.
sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \
    render -o rendered.txt ${SITE}

Deployment Failure

During the deployment, it is important to identify a specific step where it fails, there are two major deployment steps:

  1. Drydock build: deploys Operating System.
  2. Armada build: deploys Helm Charts.

After Configuring Airship CLI, setup credentials for accessing Shipyard; the password is stored in ucp_shipyard_keystone_password secret, you can find it in site/airship-seaworthy/secrets/passphrases/ucp_shipyard_keystone_password.yaml configuration file of your site.

export OS_USERNAME=shipyard
export OS_PASSWORD={shipyard_password}

Now you can use the following commands to access Shipyard:

# Get all actions that were executed on you environment.
sudo ./treasuremap/tools/airship shipyard get actions
# Show all the steps within the action.
sudo ./treasuremap/tools/airship shipyard describe action/{action_id}
# Get a bit more details on the step.
sudo ./treasuremap/tools/airship shipyard describe step/{action_id}/armada_build
# Print the logs from the step.
sudo ./treasuremap/tools/airship shipyard logs step/{action_id}/armada_build

After the failed step is determined, you can access the logs of a specific service (e.g., drydock-api/maas or armada-api) to get more information on the failure, note that there may be multiple pods of a single service running, you need to check all of them to find where the most recent logs are available.

Example of accessing Armada API logs:

# Get all pods running on the cluster and find a name of the pod you are
# interested in.
kubectl get pods -o wide --all-namespaces

# See the logs of specific pod.
kubectl logs -n ucp -f --tail 200 armada-api-d5f757d5-6z6nv

In some cases you want to restart your pod, there is no dedicated command for that in Kubernetes. However, you can delete the pod, it will be restarted by Kubernetes to satisfy replication factor.

# Restart Armada API service.
kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv

Ceph

Many stateful services in Airship rely on Ceph to function correctly. For more information on Ceph debugging follow an official Ceph debugging guide.

Although Ceph tolerates failures of multiple OSDs, it is important to make sure that your Ceph cluster is healthy.

Example:

# Get a name of Ceph Monitor pod.
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
    grep ceph-mon | sed -n 1p | sed 's|pod/||')
# Get the status of the Ceph cluster.
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s

Cluster is in a helthy state when health parameter is set to HEALTH_OK.

When the cluster is unhealthy, and some Placement Groups are reported to be in degraded or down states, determine the problem by inspecting the logs of Ceph OSD that is down using kubectl.

# Get a name of Ceph Monitor pod.
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
    grep ceph-mon | sed -n 1p | sed 's|pod/||')
# List a hierarchy of OSDs in the cluster to see what OSDs are down.
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree

There are a few other commands that may be useful during the debugging:

# Get a name of Ceph Monitor pod.
CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \
    grep ceph-mon | sed -n 1p | sed 's|pod/||')

# Get a detailed information on the status of every Placement Group.
sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump

# List allocated block devices.
sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls
# See what client uses the device.
sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \
    kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01

# List all Ceph block devices mounted on a specific host.
mount | grep rbd