Deleting ic-nginx-ingress-controller at restore

Once k8s comes up after the etcd restore, there is a span of time (around 20s) that the pod states have not been updated and are reported as they were at the point in time where the backup was taken. This returns that the ic-nginx-ingress-ingress-nginx-controller-XXX pod is "Ready", but it is not... in several instances during my tests, the pod was restarted 3-10 seconds after the task "Launch Armada with Helm v3" failed due to not being able to call the webhook. The proposed solution is to delete the pod preemptively and wait for it to be recreated and "Ready". TEST PLAN PASS restore on virtual AIO-SX (CentOS) Closes-Bug: #1978899 Signed-off-by: Thiago Brito <thiago.brito@windriver.com> Change-Id: I20bec1fbbf809bfcf5d515ef55c6d47ab968dbf3
2022-08-09 18:34:43 -03:00 · 2022-08-09 18:34:43 -03:00 · c2e5db4305
commit c2e5db4305
parent 822540ac77
1 changed files with 7 additions and 0 deletions
--- a/playbookconfig/src/playbooks/roles/common/armada-helm/tasks/main.yml
+++ b/playbookconfig/src/playbooks/roles/common/armada-helm/tasks/main.yml
@ -162,6 +162,13 @@
        register: nginx_webhook_service
        ignore_errors: true

+      - name: If on system restore mode, kill ingress validating webhook pod so it can be recreated
+        shell: >-
+          kubectl delete pod -n kube-system
+          -l $(kubectl get service -n kube-system {{ nginx_webhook_service.stdout }}
+          -o jsonpath="{.spec.selector}" | tr -d "{}\"" | tr ":" "=")
+        when: mode == 'restore' and armada_check.rc == 0 and nginx_webhook_service.rc == 0
+
      - name: Check ingress validating webhook service and pod status
        shell: >-
          kubectl wait pod -n kube-system