Files
nova/doc/source/admin/pci-passthrough.rst
Rajesh Tailor 68fbace8af Fix duplicate words
This change fixes duplicate consecutive words from docs
as well as code.

Signed-off-by: Rajesh Tailor <ratailor@redhat.com>
Change-Id: I236ff41fccf831023b6f85840097148a30e84743
2025-09-02 18:06:31 +05:30

754 lines
33 KiB
ReStructuredText

========================================
Attaching physical PCI devices to guests
========================================
The PCI passthrough feature in OpenStack allows full access and direct control
of a physical PCI device in guests. This mechanism is generic for any kind of
PCI device, and runs with a Network Interface Card (NIC), Graphics Processing
Unit (GPU), or any other devices that can be attached to a PCI bus. Correct
driver installation is the only requirement for the guest to properly use the
devices.
Some PCI devices provide Single Root I/O Virtualization and Sharing (SR-IOV)
capabilities. When SR-IOV is used, a physical device is virtualized and appears
as multiple PCI devices. Virtual PCI devices are assigned to the same or
different guests. In the case of PCI passthrough, the full physical device is
assigned to only one guest and cannot be shared.
PCI devices are requested through flavor extra specs, specifically via the
:nova:extra-spec:`pci_passthrough:alias` flavor extra spec.
This guide demonstrates how to enable PCI passthrough for a type of PCI device
with a vendor ID of ``8086`` and a product ID of ``154d`` - an Intel X520
Network Adapter - by mapping them to the alias ``a1``.
You should adjust the instructions for other devices with potentially different
capabilities.
.. note::
For information on creating servers with SR-IOV network interfaces, refer to
the :neutron-doc:`Networking Guide <admin/config-sriov>`.
**Limitations**
* Attaching SR-IOV ports to existing servers was not supported until the
22.0.0 Victoria release. Due to various bugs in libvirt and qemu we
recommend to use at least libvirt version 6.0.0 and at least qemu version
4.2.
* Cold migration (resize) of servers with SR-IOV devices attached was not
supported until the 14.0.0 Newton release, see
`bug 1512800 <https://bugs.launchpad.net/nova/+bug/1512880>`_ for details.
.. note::
Nova only supports PCI addresses where the fields are restricted to the
following maximum value:
* domain - 0xFFFF
* bus - 0xFF
* slot - 0x1F
* function - 0x7
Nova will ignore PCI devices reported by the hypervisor if the address is
outside of these ranges.
.. versionchanged:: 26.0.0 (Zed):
PCI passthrough device inventories now can be tracked in Placement.
For more information, refer to :ref:`pci-tracking-in-placement`.
.. versionchanged:: 26.0.0 (Zed):
The nova-compute service will refuse to start if both the parent PF and its
children VFs are configured in :oslo.config:option:`pci.device_spec`.
For more information, refer to :ref:`pci-tracking-in-placement`.
.. versionchanged:: 26.0.0 (Zed):
The nova-compute service will refuse to start with
:oslo.config:option:`pci.device_spec` configuration that uses the
``devname`` field.
.. versionchanged:: 27.0.0 (2023.1 Antelope):
Nova provides Placement based scheduling support for servers with flavor
based PCI requests. This support is disable by default.
.. versionchanged:: 31.0.0 (2025.1 Epoxy):
* Add managed tag to define if the PCI device is managed (attached/detached
from the host) by libvirt. This is required to support SR-IOV devices
using the new kernel variant driver interface.
* Add a live_migratable tag to define whether a PCI device supports live
migration.
* Add a live_migratable tag to alias definitions to allow requesting either
a live-migratable or non-live-migratable device.
Enabling PCI passthrough
------------------------
Configure compute host
~~~~~~~~~~~~~~~~~~~~~~
To enable PCI passthrough on an x86, Linux-based compute node, the following
are required:
* VT-d enabled in the BIOS
* IOMMU enabled on the host OS, e.g. by adding the ``intel_iommu=on`` or
``amd_iommu=on`` parameter to the kernel parameters
* Assignable PCIe devices
Configure ``nova-compute``
~~~~~~~~~~~~~~~~~~~~~~~~~~
Once PCI passthrough has been configured for the host, :program:`nova-compute`
must be configured to allow the PCI device to pass through to VMs. This is done
using the :oslo.config:option:`pci.device_spec` option. For example,
assuming our sample PCI device has a PCI address of ``41:00.0`` on each host:
.. code-block:: ini
[pci]
device_spec = { "address": "0000:41:00.0" }
Refer to :oslo.config:option:`pci.device_spec` for syntax information.
Alternatively, to enable passthrough of all devices with the same product and
vendor ID:
.. code-block:: ini
[pci]
device_spec = { "vendor_id": "8086", "product_id": "154d" }
If using vendor and product IDs, all PCI devices matching the ``vendor_id`` and
``product_id`` are added to the pool of PCI devices available for passthrough
to VMs.
In addition, it is necessary to configure the :oslo.config:option:`pci.alias`
option, which is a JSON-style configuration option that allows you to map a
given device type, identified by the standard PCI ``vendor_id`` and (optional)
``product_id`` fields, to an arbitrary name or *alias*. This alias can then be
used to request a PCI device using the :nova:extra-spec:`pci_passthrough:alias`
flavor extra spec, as discussed previously.
For our sample device with a vendor ID of ``0x8086`` and a product ID of
``0x154d``, this would be:
.. code-block:: ini
[pci]
alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1" }
It's important to note the addition of the ``device_type`` field. This is
necessary because this PCI device supports SR-IOV. The ``nova-compute`` service
categorizes devices into one of three types, depending on the capabilities the
devices report:
``type-PF``
The device supports SR-IOV and is the parent or root device.
``type-VF``
The device is a child device of a device that supports SR-IOV.
``type-PCI``
The device does not support SR-IOV.
By default, it is only possible to attach ``type-PCI`` devices using PCI
passthrough. If you wish to attach ``type-PF`` or ``type-VF`` devices, you must
specify the ``device_type`` field in the config option. If the device was a
device that did not support SR-IOV, the ``device_type`` field could be omitted.
Refer to :oslo.config:option:`pci.alias` for syntax information.
.. important::
This option must also be configured on controller nodes. This is discussed later
in this document.
Once configured, restart the :program:`nova-compute` service.
Special Tags
^^^^^^^^^^^^
When specified in :oslo.config:option:`pci.device_spec` some tags
have special meaning:
``physical_network``
Associates a device with a physical network label which corresponds to the
``physical_network`` attribute of a network segment object in Neutron. For
virtual networks such as overlays a value of ``null`` should be specified
as follows: ``"physical_network": null``. In the case of physical networks,
this tag is used to supply the metadata necessary for identifying a switched
fabric to which a PCI device belongs and associate the port with the correct
network segment in the networking backend. Besides typical SR-IOV scenarios,
this tag can be used for remote-managed devices in conjunction with the
``remote_managed`` tag.
.. note::
The use of ``"physical_network": null`` is only supported in single segment
networks. This is due to Nova not supporting multisegment networks for
SR-IOV ports. See
`bug 1983570 <https://bugs.launchpad.net/nova/+bug/1983570>`_ for details.
``remote_managed``
Used to specify whether a PCI device is managed remotely or not. By default,
devices are implicitly tagged as ``"remote_managed": "false"`` but and they
must be tagged as ``"remote_managed": "true"`` if ports with
``VNIC_TYPE_REMOTE_MANAGED`` are intended to be used. Once that is done,
those PCI devices will not be available for allocation for regular
PCI passthrough use. Specifying ``"remote_managed": "true"`` is only valid
for SR-IOV VFs and specifying it for PFs is prohibited.
.. important::
It is recommended that PCI VFs that are meant to be remote-managed
(e.g. the ones provided by SmartNIC DPUs) are tagged as remote-managed in
order to prevent them from being allocated for regular PCI passthrough since
they have to be programmed accordingly at the host that has access to the
NIC switch control plane. If this is not done, instances requesting regular
SR-IOV ports may get a device that will not be configured correctly and
will not be usable for sending network traffic.
.. important::
For the Libvirt virt driver, clearing a VLAN by programming VLAN 0 must not
result in errors in the VF kernel driver at the compute host. Before v8.1.0
Libvirt clears a VLAN before passing a VF through to the guest which may
result in an error depending on your driver and kernel version (see, for
example, `this bug <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1957753>`_
which discusses a case relevant to one driver). As of Libvirt v8.1.0, EPERM
errors encountered while programming a VLAN are ignored if VLAN clearing is
not explicitly requested in the device XML.
``trusted``
If a port is requested to be trusted by specifying an extra option during
port creation via ``--binding-profile trusted=true``, only devices tagged as
``trusted: "true"`` will be allocated to instances. Nova will then configure
those devices as trusted by the network controller through its PF device driver.
The specific set of features allowed by the trusted mode of a VF will differ
depending on the network controller itself, its firmware version and what a PF
device driver version allows to pass to the NIC. Common features to be affected
by this tag are changing the VF MAC address, enabling promiscuous mode or
multicast promiscuous mode.
.. important::
While the ``trusted tag`` does not directly conflict with the
``remote_managed`` tag, network controllers in SmartNIC DPUs may prohibit
setting the ``trusted`` mode on a VF via a PF device driver in the first
place. It is recommended to test specific devices, drivers and firmware
versions before assuming this feature can be used.
``managed``
Users must specify whether the PCI device is managed by libvirt to allow
detachment from the host and assignment to the guest, or vice versa.
The managed mode of a device depends on the specific device and the support
provided by its driver.
- ``managed='yes'`` means that nova will let libvirt to detach the device
from the host before attaching it to the guest and re-attach it to the host
after the guest is deleted.
- ``managed='no'`` means that Nova will not request libvirt to attach
or detach the device from the host. Instead, Nova assumes that
the operator has pre-configured the host so that the devices are
already bound to vfio-pci or an appropriate variant driver. This
setup allows the devices to be directly usable by QEMU without
requiring any additional operations to enable passthrough.
.. note::
If not set, the default value is managed='yes' to preserve the existing
behavior, primarily for upgrade purposes.
.. warning::
Incorrect configuration of this parameter may result in compute
node crashes.
Configure ``nova-scheduler``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :program:`nova-scheduler` service must be configured to enable the
``PciPassthroughFilter``. To do this, add this filter to the list of filters
specified in :oslo.config:option:`filter_scheduler.enabled_filters` and set
:oslo.config:option:`filter_scheduler.available_filters` to the default of
``nova.scheduler.filters.all_filters``. For example:
.. code-block:: ini
[filter_scheduler]
enabled_filters = ...,PciPassthroughFilter
available_filters = nova.scheduler.filters.all_filters
Once done, restart the :program:`nova-scheduler` service.
Configure ``nova-api``
~~~~~~~~~~~~~~~~~~~~~~
It is necessary to also configure the :oslo.config:option:`pci.alias` config
option on the controller. This configuration should match the configuration
found on the compute nodes. For example:
.. code-block:: ini
[pci]
alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1", "numa_policy":"preferred" }
Refer to :oslo.config:option:`pci.alias` for syntax information.
Refer to :ref:`Affinity <pci-numa-affinity-policy>` for ``numa_policy``
information.
Once configured, restart the :program:`nova-api-wsgi` service.
Configuring a flavor or image
-----------------------------
Once the alias has been configured, it can be used for an flavor extra spec.
For example, to request two of the PCI devices referenced by alias ``a1``, run:
.. code-block:: console
$ openstack flavor set m1.large --property "pci_passthrough:alias"="a1:2"
For more information about the syntax for ``pci_passthrough:alias``, refer to
:doc:`the documentation </configuration/extra-specs>`.
.. _pci-numa-affinity-policy:
PCI-NUMA affinity policies
--------------------------
By default, the libvirt driver enforces strict NUMA affinity for PCI devices,
be they PCI passthrough devices or neutron SR-IOV interfaces. This means that
by default a PCI device must be allocated from the same host NUMA node as at
least one of the instance's CPUs. This isn't always necessary, however, and you
can configure this policy using the
:nova:extra-spec:`hw:pci_numa_affinity_policy` flavor extra spec or equivalent
image metadata property. There are three possible values allowed:
**required**
This policy means that nova will boot instances with PCI devices **only**
if at least one of the NUMA nodes of the instance is associated with these
PCI devices. It means that if NUMA node info for some PCI devices could not
be determined, those PCI devices wouldn't be consumable by the instance.
This provides maximum performance.
**socket**
This policy means that the PCI device must be affined to the same host
socket as at least one of the guest NUMA nodes. For example, consider a
system with two sockets, each with two NUMA nodes, numbered node 0 and node
1 on socket 0, and node 2 and node 3 on socket 1. There is a PCI device
affined to node 0. An PCI instance with two guest NUMA nodes and the
``socket`` policy can be affined to either:
* node 0 and node 1
* node 0 and node 2
* node 0 and node 3
* node 1 and node 2
* node 1 and node 3
The instance cannot be affined to node 2 and node 3, as neither of those
are on the same socket as the PCI device. If the other nodes are consumed
by other instances and only nodes 2 and 3 are available, the instance
will not boot.
**preferred**
This policy means that ``nova-scheduler`` will choose a compute host
with minimal consideration for the NUMA affinity of PCI devices.
``nova-compute`` will attempt a best effort selection of PCI devices
based on NUMA affinity, however, if this is not possible then
``nova-compute`` will fall back to scheduling on a NUMA node that is not
associated with the PCI device.
**legacy**
This is the default policy and it describes the current nova behavior.
Usually we have information about association of PCI devices with NUMA
nodes. However, some PCI devices do not provide such information. The
``legacy`` value will mean that nova will boot instances with PCI device
if either:
* The PCI device is associated with at least one NUMA nodes on which the
instance will be booted
* There is no information about PCI-NUMA affinity available
For example, to configure a flavor to use the ``preferred`` PCI NUMA affinity
policy for any neutron SR-IOV interfaces attached by the user:
.. code-block:: console
$ openstack flavor set $FLAVOR \
--property hw:pci_numa_affinity_policy=preferred
You can also configure this for PCI passthrough devices by specifying the
policy in the alias configuration via :oslo.config:option:`pci.alias`. For more
information, refer to :oslo.config:option:`the documentation <pci.alias>`.
.. _pci-tracking-in-placement:
PCI tracking in Placement
-------------------------
.. note::
The feature described below are optional and disabled by default in nova
26.0.0. (Zed). The legacy PCI tracker code path is still supported and
enabled. The Placement PCI tracking can be enabled via the
:oslo.config:option:`pci.report_in_placement` configuration.
.. warning::
Please note that once it is enabled on a given compute host
**it cannot be disabled there any more**.
Since nova 26.0.0 (Zed) PCI passthrough device inventories are tracked in
Placement. If a PCI device exists on the hypervisor and
matches one of the device specifications configured via
:oslo.config:option:`pci.device_spec` then Placement will have a representation
of the device. Each PCI device of type ``type-PCI`` and ``type-PF`` will be
modeled as a Placement resource provider (RP) with the name
``<hypervisor_hostname>_<pci_address>``. A devices with type ``type-VF`` is
represented by its parent PCI device, the PF, as resource provider.
By default nova will use ``CUSTOM_PCI_<vendor_id>_<product_id>`` as the
resource class in PCI inventories in Placement. However the name of the
resource class can be customized via the ``resource_class`` tag in the
:oslo.config:option:`pci.device_spec` option. There is also a new ``traits``
tag in that configuration that allows specifying a list of placement traits to
be added to the resource provider representing the matching PCI devices.
.. note::
In nova 26.0.0 (Zed) the Placement resource tracking of PCI devices does not
support SR-IOV devices intended to be consumed via Neutron ports and
therefore having ``physical_network`` tag in
:oslo.config:option:`pci.device_spec`. Such devices are supported via the
legacy PCI tracker code path in Nova.
.. note::
Having different resource class or traits configuration for VFs under the
same parent PF is not supported and the nova-compute service will refuse to
start with such configuration.
.. important::
While nova supported configuring both the PF and its children VFs for PCI
passthrough in the past, it only allowed consuming either the parent PF or
its children VFs. Since 26.0.0. (Zed) the nova-compute service will
enforce the same rule for the configuration as well and will refuse to
start if both the parent PF and its VFs are configured.
.. important::
While nova supported configuring PCI devices by device name via the
``devname`` parameter in :oslo.config:option:`pci.device_spec` in the past,
this proved to be problematic as the netdev name of a PCI device could
change for multiple reasons during hypervisor reboot. So since nova 26.0.0
(Zed) the nova-compute service will refuse to start with such configuration.
It is suggested to use the PCI address of the device instead.
.. important::
While nova supported configuring :oslo.config:option:`pci.alias` where an
alias name is repeated and therefore associated to multiple alias
specifications, such configuration is not supported when PCI tracking in
Placement is enabled.
The nova-compute service makes sure that existing instances with PCI
allocations in the nova DB will have a corresponding PCI allocation in
placement. This allocation healing also acts on any new instances regardless of
the status of the scheduling part of this feature to make sure that the nova
DB and placement are in sync. There is one limitation of the healing logic.
It assumes that there is no in-progress migration when the nova-compute service
is upgraded. If there is an in-progress migration then the PCI allocation on
the source host of the migration will not be healed. The placement view will be
consistent after such migration is completed or reverted.
Reconfiguring the PCI devices on the hypervisor or changing the
:oslo.config:option:`pci.device_spec` configuration option and restarting the
nova-compute service is supported in the following cases:
* new devices are added
* devices without allocation are removed
Removing a device that has allocations is not supported. If a device having any
allocation is removed then the nova-compute service will keep the device and
the allocation exists in the nova DB and in placement and logs a warning. If
a device with any allocation is reconfigured in a way that an allocated PF is
removed and VFs from the same PF is configured (or vice versa) then
nova-compute will refuse to start as it would create a situation where both
the PF and its VFs are made available for consumption.
Since nova 27.0.0 (2023.1 Antelope) scheduling and allocation of PCI devices
in Placement can also be enabled via
:oslo.config:option:`filter_scheduler.pci_in_placement` config option set in
the nova-api, nova-scheduler, and nova-conductor configuration. Please note
that this should only be enabled after all the computes in the system is
configured to report PCI inventory in Placement via enabling
:oslo.config:option:`pci.report_in_placement`. In Antelope flavor
based PCI requests are support but Neutron port base PCI requests are not
handled in Placement.
If you are upgrading from an earlier version with already existing servers with
PCI usage then you must enable :oslo.config:option:`pci.report_in_placement`
first on all your computes having PCI allocations and then restart the
nova-compute service, before you enable
:oslo.config:option:`filter_scheduler.pci_in_placement`. The compute service
will heal the missing PCI allocation in placement during startup and will
continue healing missing allocations for future servers until the scheduling
support is enabled.
If a flavor requests multiple ``type-VF`` devices via
:nova:extra-spec:`pci_passthrough:alias` then it is important to consider the
value of :nova:extra-spec:`group_policy` as well. The value ``none``
allows nova to select VFs from the same parent PF to fulfill the request. The
value ``isolate`` restricts nova to select each VF from a different parent PF
to fulfill the request. If :nova:extra-spec:`group_policy` is not provided in
such flavor then it will defaulted to ``none``.
Symmetrically with the ``resource_class`` and ``traits`` fields of
:oslo.config:option:`pci.device_spec` the :oslo.config:option:`pci.alias`
configuration option supports requesting devices by Placement resource class
name via the ``resource_class`` field and also support requesting traits to
be present on the selected devices via the ``traits`` field in the alias. If
the ``resource_class`` field is not specified in the alias then it is defaulted
by nova to ``CUSTOM_PCI_<vendor_id>_<product_id>``. Either the ``product_id``
and ``vendor_id`` or the ``resource_class`` field must be provided in each
alias.
For deeper technical details please read the `nova specification. <https://specs.openstack.org/openstack/nova-specs/specs/zed/approved/pci-device-tracking-in-placement.html>`_
Support for multiple types of VFs
---------------------------------
SR-IOV devices, such as GPUs, can be configured to provide VFs with various
characteristics under the same vendor ID and product ID.
To enable Nova to model this, if you configure the VFs with different
resource allocations, you will need to use separate resource_classes for each.
This can be achieved by following the steps below:
- Enable PCI in Placement: This is necessary to track PCI devices with
custom resource classes in the placement service.
- Define Device Specifications: Use a custom resource class to represent
a specific VF type and ensure that the VFs existing on the hypervisor are
matched via the VF's PCI address.
- Specify Type-Specific Flavors: Define flavors with an alias that matches
the resource class to ensure proper allocation.
Examples:
.. note::
The following example demonstrates device specifications and alias
configurations, utilizing resource classes as part of the "PCI in
placement" feature.
.. code-block:: shell
[pci]
device_spec = { "vendor_id": "10de", "product_id": "25b6", "address": "0000:25:00.4", "resource_class": "CUSTOM_A16_16A", "managed": "no" }
device_spec = { "vendor_id": "10de", "product_id": "25b6", "address": "0000:25:00.5", "resource_class": "CUSTOM_A16_8A", "managed": "no" }
alias = { "device_type": "type-VF", "resource_class": "CUSTOM_A16_16A", "name": "A16_16A" }
alias = { "device_type": "type-VF", "resource_class": "CUSTOM_A16_8A", "name": "A16_8A" }
Configuring Live Migration for PCI devices
------------------------------------------
Live migration of instances with PCI devices requires specific configuration
at both the device and alias levels to ensure that the migration can succeed.
This section explains how to configure PCI passthrough to support live
migration.
Configuring PCI Device Specification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Administrators must explicitly define whether a PCI device support live
migration.
This is done by adding the ``live_migratable`` attribute to the device
specification in the :oslo.config:option:`pci.device_spec` configuration.
.. note::
Of course, this requires hardware support, as well as proper system
and hypervisor configuration.
Example Configuration:
.. code-block:: ini
[pci]
dev_spec = {'vendor_id': '8086', 'product_id': '1515', 'live_migratable': 'yes'}
dev_spec = {'vendor_id': '8086', 'product_id': '1516', 'live_migratable': 'no'}
Configuring PCI Aliases for Users
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI devices can be requested through flavor exta_specs.. To request a live
migratable PCI device, the PCI alias definition in
the :oslo.config:option:`pci.alias` configuration must include
the ``live_migratable`` key.
Example Configuration:
.. code-block:: ini
[pci]
alias = {'name': 'vf_live', 'vendor_id': '8086', 'product_id': '1515', 'device_type': 'type-VF', 'live_migratable': 'yes'}
alias = {'name': 'vf_no_migrate', 'vendor_id': '8086', 'product_id': '1516', 'device_type': 'type-VF', 'live_migratable': 'no'}
Virtual IOMMU support
---------------------
With provided :nova:extra-spec:`hw:viommu_model` flavor extra spec or equivalent
image metadata property ``hw_viommu_model`` and with the guest CPU architecture
and OS allows, we can enable vIOMMU in libvirt driver.
.. note::
Enable vIOMMU might introduce significant performance overhead.
You can see performance comparison table from
`AMD vIOMMU session on KVM Forum 2021`_.
For the above reason, vIOMMU should only be enabled for workflow that
require it.
.. _`AMD vIOMMU session on KVM Forum 2021`: https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf
Here are four possible values allowed for ``hw:viommu_model``
(and ``hw_viommu_model``):
**virtio**
Supported on Libvirt since 8.3.0, for Q35 and ARM virt guests.
**smmuv3**
Supported on Libvirt since 5.5.0, for ARM virt guests.
**intel**
Supported for Q35 guests.
**auto**
This option will translate to ``virtio`` if Libvirt supported,
else ``intel`` on X86 (Q35) and ``smmuv3`` on AArch64.
For the viommu attributes:
* ``intremap``, ``caching_mode``, and ``iotlb``
options for viommu (These attributes are driver attributes defined in
`Libvirt IOMMU Domain`_) will directly enabled.
* ``eim`` will directly enabled if machine type is Q35.
``eim`` is driver attribute defined in `Libvirt IOMMU Domain`_.
.. note::
eim(Extended Interrupt Mode) attribute (with possible values on and off)
can be used to configure Extended Interrupt Mode.
A q35 domain with split I/O APIC (as described in hypervisor features),
and both interrupt remapping and EIM turned on for the IOMMU, will be
able to use more than 255 vCPUs. Since 3.4.0 (QEMU/KVM only).
* ``aw_bits`` attribute can used to set the address width to allow mapping
larger iova addresses in the guest. Since Qemu current supported
values are 39 and 48, we directly set this to larger width (48)
if Libvirt supported.
``aw_bits`` is driver attribute defined in `Libvirt IOMMU Domain`_.
.. _`Libvirt IOMMU Domain`: https://libvirt.org/formatdomain.html#iommu-devices
Known Issues
------------
A known issue exists where the ``live_migratable`` flag is ignored for
devices that include the ``physical_network`` tag.
As a result, instances using such devices do not behave as non-live
migratable, and instead, they continue to migrate using the legacy VIF
unplug/live migrate/VIF plug procedure.
Example configuration where the live_migratable flag is ignored:
.. code-block:: ini
[pci]
device_spec = { "vendor_id":"8086", "product_id":"10ca", "address": "0000:06:", "physical_network": "physnet2", "live_migratable": false}
A fix for this issue is planned in a follow-up for the **Epoxy** release.
The upstream bug report is `here`__.
.. __: https://bugs.launchpad.net/nova/+bug/2102161
One-Time-Use Devices
--------------------
Certain devices may need attention after they are released from one user and
before they are attached to another. This is especially true of direct
passthrough devices because the instance has full control over them while
attached, and Nova doesn't know specifics about the device itself, unlike
regular more cloudy resources. Examples include:
* Securely erasing NVMe devices to ensure data residue is not passed from one
user to the other unintentionally
* Reinstalling known-good firmware to the device to avoid a hijack attack
* Updating firmware to the latest release before each user
* Checking a property of the device to determine if it needs repair or
replacement before giving it to another user (i.e. NVMe write-wear indicator)
* Some custom behavior, reset, etc
Nova's scope does not cover the above, but it does support a feature that makes
it easier for the operator to orchestrate tasks like this. By marking a device
as "one time use" (hereafter referred to as OTU), Nova will allocate a device
once, after which it will remain in a "reserved" state to avoid being
allocated to another instance. After the operator's workflow is performed and
the device should be returned to the pool of available resources, the reserved
flag can be dropped and Nova will consider it usable again.
.. note:: This feature requires :ref:`pci-tracking-in-placement` in order to
work. The compute configuration is required, but the transitional scheduler
config is optional (during transition but required for safety).
A device can be marked as OTU by adding a tag in the ``device_spec`` like this:
.. code-block:: shell
device_spec = {"address": "0000:00:1.0", "one_time_use": true}
By marking the device as such, Nova will set the ``reserved`` inventory value
on the placement provider to fully cover the device (i.e. ``reserved=total``
at the point at which the instance is assigned the PCI device on the compute
node. When the instance is deleted, the ``used`` value will return to zero but
``reserved`` will remain. It is the operator's responsibility to return the
``reserved`` value to zero when the device is ready for re-assignment.
The best way to handle this would be to listen to Nova's notifications for the
``instance.delete.end`` event so that the post-processing workflow can happen
immediately. However, since notifications could be dropped or missed, regular
polling should be performed. Providers that represent devices that Nova is
applying the OTU behavior to will have the ``HW_PCI_ONE_TIME_USE`` trait,
making it easier to identify them. For example:
.. code-block:: shell
$ openstack resource provider list --required HW_PCI_ONE_TIME_USE
+--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+
| b9e67d7d-43db-49c7-8ce8-803cad08e656 | jammy_0000:00:01.0 | 39 | 2ee402e8-c5c6-4586-9ac7-58e7594d27d1 | 2ee402e8-c5c6-4586-9ac7-58e7594d27d1 |
+--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+
Will find all such providers. For each of those, checking the inventory to find
ones with ``used=0`` and ``reserved=1`` will identify devices in need of
processing. To use the above example:
.. code-block:: shell
$ openstack resource provider inventory list b9e67d7d-43db-49c7-8ce8-803cad08e656
+----------------------+------------------+----------+----------+----------+-----------+-------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------------+------------------+----------+----------+----------+-----------+-------+------+
| CUSTOM_PCI_1B36_0100 | 1.0 | 1 | 1 | 1 | 1 | 1 | 0 |
+----------------------+------------------+----------+----------+----------+-----------+-------+------+
To return the above device back to the pool of allocatable resources, we can
set the reserved count back to zero:
.. code-block:: shell
$ openstack resource provider inventory set --amend \
--resource CUSTOM_PCI_1B36_0100:reserved=0 \
b9e67d7d-43db-49c7-8ce8-803cad08e656
+----------------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------------+------------------+----------+----------+----------+-----------+-------+
| CUSTOM_PCI_1B36_0100 | 1.0 | 1 | 1 | 0 | 1 | 1 |
+----------------------+------------------+----------+----------+----------+-----------+-------+