284ea72e96
We saw in the field that the pci_devices table can end up in inconsistent state after a compute node HW failure and re-deployment. There could be dependent devices where the parent PF is in available state while the children VFs are in unavailable state. (Before the HW fault the PF was allocated hence the VFs was marked unavailable). In this state this PF is still schedulable but during the PCI claim the handling of dependent devices in the PCI tracker fill fail with the error: "Attempt to consume PCI device XXX from empty pool". The reason of the failure is that when the PF is claimed, all the children VFs are marked unavailable. But if the VF is already unavailable such step fails. One way the deployer might try to recover from this state is to remove the VFs from the hypervisor and restart the compute agent. The compute startup already has a logic to delete PCI devices that are unused and not reported by the hypervisor. However this logic only removed devices in 'available' state and ignored devices in 'unavailable' state. If a device is unused and the hypervisor is not reporting the device any more then it is safe to delete that device from the PCI tracker. So this patch extends the logic to allow deleting 'unavailable' devices. There is a small window when dependent PCI device is in 'unclaimable' state. From cleanup perspective this is an analogous state. So it is also added to the cleanup logic. Related-Bug: #1969496 Change-Id: If9ab424cc7375a1f0d41b03f01c4a823216b3eb8 |
||
---|---|---|
.. | ||
__init__.py | ||
devspec.py | ||
manager.py | ||
request.py | ||
stats.py | ||
utils.py | ||
whitelist.py |