Fix: Avoid etcd bootstrap race

This adds a sleep to avoid a tight restart loop for etcd when running in bootstrap mode (e.g. to spin up etcd for calico). This doesn't seem to have manifested before, but I saw it while troubleshooting an environment yesterday, and I'm surprised it hasn't been seen before. The issue manifests as repeated teardown and replacement of the bootstrapping <svc>-etcd-<hostname> pod put in place by the anchor. The log messages in the etcd container of the pod will say that etcd is terminating because it got SIGTERM, and a large number of pause containers will be left behind and visible in `docker ps -a`. The constant pod replacement was racing with how quickly kubernetes would see the healthy (non-anchor) etcd pod allowing the anchor to be able to reach etcd over the kubernetes service to check its health. A successful health check by the anchor ends the bootstrapping phase, exiting the race. I'm confident there's a better approach to clean this section of code up; however, the concern with this PS is to address the problematic tight loop, allowing a more rigorous improvement to come later. Change-Id: I0e3181194cfcd376967672b47a5e126103b4dfe4
2018-09-07 07:52:44 -05:00 · 2018-09-07 07:52:44 -05:00 · dcac36c8cf
commit dcac36c8cf
parent 0233c30ffb
1 changed files with 1 additions and 0 deletions
--- a/charts/etcd/templates/bin/_etcdctl_anchor.tpl
+++ b/charts/etcd/templates/bin/_etcdctl_anchor.tpl
@ -76,6 +76,7 @@ while true; do
        ETCD_INITIAL_CLUSTER=${ETCD_NAME}=https://\$\(POD_IP\):{{ .Values.network.service_peer.target_port }}
        ETCD_INITIAL_CLUSTER_STATE=new
        create_manifest "$ETCD_INITIAL_CLUSTER" "$ETCD_INITIAL_CLUSTER_STATE" "$MANIFEST_PATH"
+        sleep {{ .Values.anchor.period }}
        continue
    fi
    {{- end }}