nova/nova/pci
Balazs Gibizer 284ea72e96 Remove unavailable but not reported PCI devices at startup
We saw in the field that the pci_devices table can end up in
inconsistent state after a compute node HW failure and re-deployment.
There could be dependent devices where the parent PF is in available
state while the children VFs are in unavailable state. (Before the HW
fault the PF was allocated hence the VFs was marked unavailable).

In this state this PF is still schedulable but during the
PCI claim the handling of dependent devices in the PCI tracker fill fail
with the error: "Attempt to consume PCI device XXX from empty pool".

The reason of the failure is that when the PF is claimed, all the
children VFs are marked unavailable. But if the VF is already
unavailable such step fails.

One way the deployer might try to recover from this state is to remove
the VFs from the hypervisor and restart the compute agent. The compute
startup already has a logic to delete PCI devices that are unused and
not reported by the hypervisor. However this logic only removed devices
in 'available' state and ignored devices in 'unavailable' state.

If a device is unused and the hypervisor is not reporting the device any
more then it is safe to delete that device from the PCI tracker. So this
patch extends the logic to allow deleting 'unavailable' devices. There
is a small window when dependent PCI device is in 'unclaimable' state.
From cleanup perspective this is an analogous state. So it is also
added to the cleanup logic.

Related-Bug: #1969496
Change-Id: If9ab424cc7375a1f0d41b03f01c4a823216b3eb8
2022-04-28 16:01:38 +02:00
..
__init__.py PCI utils 2013-08-23 14:21:12 +08:00
devspec.py Introduce remote_managed tag for PCI devs 2022-02-09 01:23:24 +03:00
manager.py Remove unavailable but not reported PCI devices at startup 2022-04-28 16:01:38 +02:00
request.py Introduce remote_managed tag for PCI devs 2022-02-09 01:23:24 +03:00
stats.py Filter computes without remote-managed ports early 2022-02-09 01:23:27 +03:00
utils.py Introduce remote_managed tag for PCI devs 2022-02-09 01:23:24 +03:00
whitelist.py Introduce remote_managed tag for PCI devs 2022-02-09 01:23:24 +03:00