integ

Author	SHA1	Message	Date
Cole Walker	8e84309624	Add check to avoid restarting running device plugin pod This script was set to always restart the local sriov device plugin pod which could result in sriov pods not starting properly. Originally, this sequence of commands would not work properly if the device plugin was running kubectl delete pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --wait=false kubectl wait pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --for=condition=Ready --timeout=360s Result when device plugin is running: pod "kube-sriov-device-plugin-amd64-rbjpw" deleted pod/kube-sriov-device-plugin-amd64-rbjpw condition met The wait command succeeds against the deleted pod and the script continues. It then deletes labeled pods without having confirmed that the device plugin is running and can result in sriov pods not starting properly. Ensuring that we are only restarting a not-running device plugin pod prevents the wait condition from immediately passing. Closes-Bug: 1928965 Signed-off-by: Cole Walker <cole.walker@windriver.com> Change-Id: I1cc576b26a4bba4eba4a088d33f918bb07ef3b0d	2021-06-09 17:24:48 -04:00
Cole Walker	6c61e3b665	Wait for SRIOV device plugin before recovering labeled pods This change modifies the k8s-pod-recovery service to wait for the kube-sriov-device-plugin-amd64 pod on the local node to become available before proceeding with the recovery of restart-on-reboot=true labeled pods. This is required because of a race condition where pods marked for recovery would be restarted before the device plugin was ready and the pods would then be stuck in "ContainerCreating". The fix in this commit uses the kubectl wait ... command to wait for the daemonset to be available. A timeout of 360s has been set for this command in order to all enough time on busy systems for the device-plugin pod to come up. The wait command completes as soon as the pod is ready. Closes-Bug: 1928965 Signed-off-by: Cole Walker <cole.walker@windriver.com> Change-Id: Ie1937cf0612827b28762049e2dc440e55726d4f3	2021-06-02 13:00:51 -04:00
Zuul	c9aaf25330	Merge "Revert "Remove recover operations to "restart-on-reboot" pods""	2021-05-19 18:58:19 +00:00
Cole Walker	b428a5de00	Revert "Remove recover operations to "restart-on-reboot" pods" This reverts commit 8abcbf6fb1951b25e9964933558b75b9aff88135. Reason for revert: After performing a backup and restore on an AIO-SX system, SRIOV pods do not return to a running state and are instead stuck in "container creating". The workaround for this is to restart SRIOV pods when the system unlocks. Reverting this commit to allow users to label SRIOV pods and have them restarted by k8s-pod-recovery. Labelled pods will be restarted by k8s-pod-recovery and will be running after backup and restore is completed. This change has been tested by performing backup and restore on an AIO-SX system. SRIOV pods now come up correctly when labelled with restart-on-reboot=true Closes-Bug: 1928965 Signed-off-by: Cole Walker <cole.walker@windriver.com> Change-Id: I9c520c0a47aabca7b96e50adf0f71742f4199c2f	2021-05-19 14:31:34 -04:00
Angie Wang	03665ae745	Add armada namespace in k8s pod recovery Update k8s pod recovery service to include armada namespace so armada pod that stuck in an unknown state after host lock/unlock or reboot could be recovered by the service. Change-Id: Iacd92637a9b4fcaf4c0076e922e1bd739f69a584 Closes-Bug: 1928018 Signed-off-by: Angie Wang <angie.wang@windriver.com>	2021-05-12 12:07:35 -04:00
Bin Qian	8abcbf6fb1	Remove recover operations to "restart-on-reboot" pods The pods being labeled as "restart-on-reboot" is to workaround kubernetes restart on worker manifest. As the AIO running a single manifest to start kubernetes only once, the operation is no longer needed. Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/785736 Change-Id: I0d6c549199559b2bc19d8edff52f64ea0b08b50d Closes-Bug: 1918139 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2021-04-14 21:26:21 -04:00
Zuul	0e8206e8b7	Merge "Add custom apps in the k8s-pod-recovery service"	2021-03-24 13:58:29 +00:00
Mihnea Saracin	852ec5ed53	Add custom apps in the k8s-pod-recovery service At startup, there might be pods that are left in unknown states. The k8s-pod-recovery service takes care of recovering these unknown pods in specific namespaces. To fix this for custom apps that are not part of starlingx, we modify the service to look into the /etc/k8s-post-recovery.d directory for conf files. Any app that needs to be recovered by this service will have to create a conf file e.g the app-1 will create /etc/k8s-post-recovery.d/APP_1.conf which will contain the following: namespace=app-1-namespace Closes-Bug: 1917781 Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com> Change-Id: I8febdb685d506cff3c34946163612cafdab3e3a8	2021-03-19 14:33:03 +00:00
Douglas Henrique Koerich	6169cc5d81	Handle labeled pods after stabilized Pods that are in a k8s deployment, daemonset, etc can be labeled as restart-on-reboot="true", which will automatically cause them to be restarted after the worker manifest has completed in an AIO system. It may happen, however, that k8s-pod-recovery service is started before the pods are scheduled and created at the node the script is running on, causing them to be not restarted. The proposed solution is to wait for stabilization of labeled pods before restarting them. Closes-Bug: 1900920 Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com> Change-Id: I5c73bd838ab2be070bd40bea9e315dcf3852e47f	2021-03-15 15:27:55 -04:00
Steven Webster	7756299303	Enable pod restart based on a label This commit adds a mechanism to the pod recovery service to restart pods based on the restart-on-reboot label. This is a mitigation for an issue seen on an AIO system using SR-IOV interfaces on an N3000 FPGA device. Since the kubernetes services start coming up after the controller manifest has completed, a race can happen with the configuration of devices and the SR-IOV device plugin in the worker manifest. The symptom of this would be the SR-IOV device in the running pod disappearing as the FPGA device is reset. Notes: - The pod recovery service only runs on controller nodes. - The raciness between the kubernetes bring-up and worker configuration should be fixed in the future by a re-organization of the manifests to either have a separate AIO or kubernetes manifest. This would require extensive feature work. In the meantime, this mitigation will allow pods which experience this issue to recover. Change-Id: If84b66b3a632752bd08293105bb780ea8c7cf400 Closes-Bug: #1896631 Signed-off-by: Steven Webster <steven.webster@windriver.com>	2020-09-22 12:29:03 -04:00
Robert Church	17c1b8894d	Introduce k8s pod recovery service Add a recovery service, started by systemd on a host boot, that waits for pod transitions to stabilize and then takes corrective action for the following set of conditions: - Delete to restart pods stuck in an Unknown or Init:Unknown state for the 'openstack' and 'monitor' namespaces. - Delete to restart Failed pods stuck in a NodeAffinity state that occur in any namespace. - Delete to restart the libvirt pod in the 'openstack' namespace when any of its conditions (Initialized, Ready, ContainersReady, PodScheduled) are not True. This will only recover pods specific to the host where the service is installed. This service is installed on all controller types. There is currently no evidence that we need this on dedicated worker nodes. Each of these conditions should to be evaluated after the next k8s component rebase to determine if any of these recovery action can be removed. Change-Id: I0e304d1a2b0425624881f3b2d9c77f6568844196 Closes-Bug: #1893977 Signed-off-by: Robert Church <robert.church@windriver.com>	2020-09-03 23:38:41 -04:00

11 Commits