Add etcd troubleshooting

Add section on etcd to troubleshooting guide. etcd provides key/value pair storage and management for many services in the COE, therefore if it fails, many other services will be failing also. In some cases, the secondary failures can be rather mysterious, for instance kube-apiserver will simply fail to start with no specific error message. This section covers the basic failure scenario and gives some pointers to verify the correct configuration, operation. Partially implements: blueprint magnum-troubleshooting-guide Change-Id: I602a9c3b8e54796c72848cf945107a319e45b973
2016-02-22 20:55:42 +00:00 · 2016-02-22 20:55:42 +00:00 · 590849ac6a
parent 39288ea1dd
commit 590849ac6a
1 changed files with 65 additions and 1 deletions
--- a/doc/source/troubleshooting-guide.rst
+++ b/doc/source/troubleshooting-guide.rst
@ -379,8 +379,72 @@ In this case, check the following:

 etcd service
 ------------
-*To be filled in*

+The etcd service is used by many other components for key/value pair
+management, therefore if it fails to start, these other components
+will not be running correctly either.
+Check that etcd is running on the master nodes by::
+
+    sudo service etcd status -l
+
+If it is running correctly, you should see that the service is
+successfully deployed::
+
+    Active: active (running) since ....
+
+The log message should show the service being published::
+
+    etcdserver: published {Name:10.0.0.5 ClientURLs:[http://10.0.0.5:2379]} to cluster 3451e4c04ec92893
+
+In some cases, the service may show as *active* but may still be stuck
+in discovery mode and not fully operational.  The log message may show
+something like::
+
+    discovery: waiting for other nodes: error connecting to https://discovery.etcd.io, retrying in 8m32s
+
+If this condition persists, check for `Cluster internet access`_.
+
+If the daemon is not running, the status will show the service as failed,
+something like::
+
+    Active: failed (Result: timeout)
+
+In this case, try restarting etcd by::
+
+    sudo service etcd start
+
+If etcd continues to fail, check the following:
+
+- Check the log for etcd::
+
+    sudo journalctl -u etcd
+
+- etcd requires discovery, and the default discovery method is the
+  public discovery service provided by etcd.io; therefore, a common
+  cause of failure is that this public discovery service is not
+  reachable.  Check by running on the master nodes::
+
+    source /etc/sysconfig/heat-params
+    curl $ETCD_DISCOVERY_URL
+
+  You should receive something like::
+
+    {"action":"get",
+     "node":{"key":"/_etcd/registry/00a6b00064174c92411b0f09ad5466c6",
+             "dir":true,
+             "nodes":[
+               {"key":"/_etcd/registry/00a6b00064174c92411b0f09ad5466c6/7d8a68781a20c0a5",
+                "value":"10.0.0.5=http://10.0.0.5:2380",
+                "modifiedIndex":978239406,
+                "createdIndex":978239406}],
+             "modifiedIndex":978237118,
+             "createdIndex":978237118}
+    }
+
+  The list of master IP is provided by Magnum during cluster deployment,
+  therefore it should match the current IP of the master nodes.
+  If the public discovery service is not reachable, check the
+  `Cluster internet access`_.

 flannel service
 ---------------