Avoid calico-etcd crashloop

Sometimes the calico-etcd pod crashloops when it is being bootstrapped. This occurs intermittently in the gates. Best guess .. when the etcd-anchor pod initially creates the etcd static manifest, it waits for the anchor period (15 seconds) for the etcd pod to become ready. If it is not ready, the next iteration through the loop recreates an identical manifest. The fact that it is a new file causes kubelet to terminate the original container and start up a new one. Kubelet and the container runtime get out of sync, and kubelet can't figure out the correct container id, so the pod ends up crashlooping forever. Manually removing and readding the manifest file doesn't resolve the condition, although a kubelet restart actually does. This "fix" will only write the updated manifest if it is different, and hopefully will prevent the condition from occurring. Change-Id: I4b6b1bf17fd8f0b36d24a741779505b38dba349f
2021-02-10 21:20:19 +00:00 · 2021-02-10 21:20:19 +00:00 · d161528ae8
commit d161528ae8
parent 77c762463b
1 changed files with 1 additions and 1 deletions
--- a/charts/etcd/templates/bin/_etcdctl_anchor.tpl
+++ b/charts/etcd/templates/bin/_etcdctl_anchor.tpl
@ -36,7 +36,7 @@ create_manifest () {
    cp -f /anchor-etcd/{{ .Values.service.name }}.yaml $WIP
    sed -i -e 's#_ETCD_INITIAL_CLUSTER_STATE_#'$2'#g' $WIP
    sed -i -e 's#_ETCD_INITIAL_CLUSTER_#'$1'#g' $WIP
-    mv -f "$WIP" "$3"
+    sync_file "$WIP" "$3"
 }
 sync_configuration () {