15 KiB
SR-IOV NICs Tracking In Placement
https://blueprints.launchpad.net/nova/+spec/track-sriov-nics-in-placement
In the zed and 2023.1 (antelope) releases support was added for tracking PCI devices that did not contain the physical_network tag in placement. This enables generic PCI devices that are consumed via flavor based PCI passthough to be tracked in placement. PCI devices that are consomed via Neutron port however are not tracked in placement. This spec aims to address that gap and enable tracking of Neutron manged PCI devices.
Problem description
Nova has supported generic stateless PCI passthrough for many
releases using a dedicated PCI tracker in conjunction with a
PciPassthroughFilter
scheduler post filter.
The PCI tracker is responsible for tracking which PCI devices are available, claimed, and allocated, the capabilities of the device, its consumer when claimed or allocated as well as the type of PCI device and location.
The PciPassthroughFilter
is responsible for ensuring
that devices, requested by the VM, exist on a host during scheduling.
These PCI requests come from two sources: flavor-based PCI requests that
are generated using the pci_passthrough:alias
flavor
extra specs and Neutron based PCI requests generated from SR-IOV
backed Neutron ports.
Currently Nova has the capability to model the availablity of flavor
managed PCI devices in placement but lack the same capability for
devices consumed via Neutron ports. All instance requests or VM with
SR-IOV, VDPA, hardware offloaded OVS or DPU ports rely on the
PciPassthroughFilter
to select hosts. While the current
approach to SR-IOV nics tracking works there are some limitations in the
current design and there is room for optimization.
Limitations
- The current implementation is functionally slow.
- While Nova today tracks the capabilities of network interfaces in
the
extra_info
field of thepci_devices
table and thePciPassthroughFilter
could match on those capabilities there is no user-facing way to express a request for an SR-IOV Neutron port with a specific network capability e.g. TSO.
Optimizations
- Use placement to track SR-IOV nics.
Use Cases
- As an operator, I want to use Placement API to track the used SR-IOV resources for Neutron managed ports.
- As an operator, I want to schedule my VMs to the correct PF/VF even if multiple device choices are available on the host.
- As an operator, I want to associate quotas with SR-IOV Neutron ports.
Note
Device quotas would require unified limits to be implemented. Implementing quotas is out of the scope of this spec beyond enabling the use case by modeling PCI devices in Placement.
This spec will also only focus on Neutron SR-IOV ports.
Proposed change
PCI device_spec configuration
There is list of Neutron port vnic_type
(e.g.
direct
, direct-physical
, vdpa
etc.) where port needs to be backed by VF or PF PCI devices.
In a simple case, when a port only requires a PCI device but does not require any other resources (e.g. bandwidth) then Nova needs to create placement request groups for each Neutron port by extending the prefilter introduced in generic PCI in placement implementation. Compared to the PCI alias case, in SR-IOV nics case, neither the name of resource class nor the vendor ID, product ID pair is known at scheduling time. Therefore, the prefilter does not know what resource class needs to be requested in the placement request group.
To resolve this, PCI devices that are intended to be used for Neutron
based SR-IOV, hardware offloaded OVS or VDPA should not use
resource_class
tag in the [PCI]device_spec
.
Instead Nova will use standard resource classes to model these
resources. The resource classes are explained in later section.
Today Nova allows consuming type-PCI or type-VF for
direct
ports. This is mostly there due to historical
reasons and it should be cleaned up.
Modeling SR-IOV devices in Placement
PCI device modeling in Placement is already implemented. Each PCI
device of type type-PCI
and type-PF
will be
modeled as a Placement resource provider (RP) with the name
<hypervisor_hostname>_<pci_address>
. The
hypervisor_hostname prefix will be the same string as the name of the
root RP. The pci_address part of the name will be the full PCI address
in the same format of DDDD:BB:AA.FF
.
Each SR-IOV NIC device RP will have an inventory of resource class and traits derived by Nova based on device categorization explained below:
- A device in the
device_spec
will be consumable only via PCI alias if it does not havephysical_network
tag attached.
* A device that has physical_network
tag attached will
be considered a network device and it will be modelled as
PCI_NETDEV
resource.
* A device that has physical_network
tag and also has
the capability to provide VFs will have a trait
HW_NIC_SRIOV
but still use the PCI_NETDEV
resource class.
* A device that has physical_network
tag and is a VF
will be modelled as a SRIOV_NET_VF
resource.
* A device that has physical_network and is a VF with a vdpa device
will be modelled as a VDPA_NETDEV
.
The actual implementation of this will reuse the existing logic to
determine the device_type and to determine which resource class to use
rather than implementing this in the device_spec
parsing.
PCI placement tracking will only take effect if the
[pci]report_in_placement
config option is set to
True
.
Every Neutron vnic_type
can be mapped to one single
resource class by Nova. The following vnic_type
->
resource class mapping is suggested:
direct-physical
->PCI_NETDEV
* direct
, macvtap
,
virtio-forwarder
, remote-managed
->
SRIOV_NET_VF
* vdpa
->
VDPA_NETDEV
Nova will use these resource classes to report device inventories to
Placement. Then the prefilter can translate the vnic_type
of the ports to request the specific resource class during
scheduling.
Another speciality of Neutron-based SR-IOV is that the devices listed
in the device_spec
always have a
physical_network
tag. This information needs to be reported
as a trait to the PF RP in Placement. Also, the port's requested physnet
needs to be included in the Placement request group by the prefilter. If
a SR-IOV device is matching a device_spec
entry with a
physical_network
tag then an inventory of 1 is reported of
the resource_class
derived by Nova.
There is a more complex case when the Neutron port not only requests
a PCI device but also requests additional resources (e.g. bandwidth) via
the port resource_request
attribute. Supporting this is
currently out of scope of this spec but is intended to be suported in
the future.
Nova will detect and refuse to boot an instance with an SR-IOV type port that contains addtional resouce requests. This will be done by returning a code 409 until support for this is added. Attaching SR-IOV ports with additional resources will also be detected and rejected.
Neutron SR-IOV ports with QoS (out of scope)
When a Neutron port requests additional resources, Nova generates
Placement request groups from the resource_request
and as
in the simple case will generate a request group from the PCI request.
The resource request of these groups of a Neutron port needs to be
correlated to ensure that a port gets the PCI device and the bandwidth
from the same physical device. However, today the bandwidth is modeled
under the Neutron RP subtree while PCI devices will be modeled right
under the root RP. So the two RPs to allocate from are not within the
same subtree. (Note that Placement always fulfills a named request group
from a single RP but allows correlating such request groups within the
same subtree.) We have multiple options here:
- Create a scheduler filter that removes allocation candidates where these request groups are fulfilled from different physical devices.
- Report the bandwidth and the PCI device resource on the same RP. This breaks the clear ownership of a single RP as the bandwidth is reported by the Neutron agent while the PCI device is reported by Nova.
- Move the two RPs (bandwidth and PCI dev) into the same subtree. This needs an agreement between Nova and Neutron devs where to move the RPs and needs an extra reshape to implement the move.
- Enhance Placement to allow sharing of resources between RPs within the same RP tree. By that, we could make the bandwidth RP a sharing RP that shares resources with the PCI device RP representing the physical device.
To enable forward progress with the minimum of dependencies and incremental progress, the preferred short term solution is to enhance the existing pci_passthough_filter to remove allocation candidates where these request groups are fulfilled from different physical devices or add a new scheduler filter that removes allocation candidates.
Requesting PCI devices
Nova will continue using the InstancePCIRequest
to track
the requested SR-IOV NIC devices for a VM.
VM lifecycle operations
The initial scheduling is very similar to the later scheduling done due to move operations. So, the existing implementation can be reused. Also, the current logic that switches the source node Placement allocation to be held by the migration UUID can be reused.
Attaching and detaching PCI devices will continue to be supported via Neutron SR-IOV ports.
Alternatives
- We could keep using the legacy tracking with all its good and bad properties.
- We could have Nova create the resource providers for the SR-IOV devices under the Neutron SR-IOV NIC agent resource provider. However Nova does not know which network backend is in use and this would create a start up dependency loop between Nova and Neutron. Neutron need the comptue RP to be created before it can create the SR-IOV NIC agent RP and Nova would need the agent RP to finish reporting the pci devices.
- We could defer the Nova support for tracking SR-IOV devices in placement until we reshape how banditwith is tracked or we develop a new placement featutre to corralate the relevent RPs.
Data model impact
InstancePCIRequest
object will be extended to include
the required and forbidden traits and the resource class generated by
Nova.
REST API impact
None
Security impact
None
Notifications impact
None
Other end user impact
None
Performance Impact
In general, this is expected to improve the scheduling performance but should have no runtime performance impact on guests.
The introduction of new RequestGroup
objects will make
the computation of the placement query slightly longer and the resulting
execution time may increase for instances with SR-IOV NIC requests but
should have no effect for instances without such requests. This added
complexity is expected to be offset the result of the offloading of the
filtering to Placement and the removal of reschedules due to racing for
the last PCI device on a host, the overall performance is expected to
improve.
Other deployer impact
To utilize the new feature the operator will have to define two new config options. One to enable the placement scheduling logic and a second to enable the reporting of the PCI devices to Placement.
Developer impact
None
Upgrade impact
The new Placement based PCI tracking will be disabled by default.
Deployments already using PCI devices can freely upgrade to the new Nova
version without any impact. At this state the PCI device management will
be done by the PciPassthroughFilter
in the scheduler and
the PCI claim in the PCI device tracker in the compute service same as
in the previous version of Nova. Then after the upgrade the new PCI
device tracking can be enabled in two phases.
First the PCI inventory reporting needs to be enabled by
[pci]report_to_placement
on each compute host. During the
startup of the nova-compute service with
[pci]report_to_placement = True
config the service will do
the reshape of the provider tree and start reporting PCI device
inventory to Placement. Nova compute will also heal the PCI allocation
of the existing instances in Placement. This healing will be done for
new instances with PCI requests until a future release where the
prefilter enabled by default. This is needed to keep the resource usage
in sync in Placement even if the instance scheduling is done without the
prefilter requesting PCI allocations in Placement.
Note
Once [pci]report_to_placement
is enabled for a compute
host it cannot be disabled any more.
Second, after every compute has been configured to report PCI
inventories to Placement the scheduling logic needs to be enabled in the
nova-scheduler configuration via the
[filter_scheduler]pci_in_placement
configuration
option.
Implementation
Assignee(s)
- Primary assignee:
-
kpawar
Feature Liaison
- Feature liaison:
-
balazs-gibizer
Work Items
- translate
InstancePCIRequest
objects to RequestGroup objects in the RequestSpec - support adding resource class and required traits to SR-IOV NIC requests.
- filter and reject SR-IOV NIC requests with resources both during VM creation and SR-IOV port attachment.
Dependencies
The unified limits feature exists in an opt-in, experimental state and will allow defining limits for the new PCI resources if enabled.
Testing
As this is a PCI passthrough related feature it cannot be tested in upstream tempest. Testing will be primarily done via the extensive unit and functional test suites that exists for instances with PCI devices.
Documentation Impact
The PCI passthrough doc will have to be rewritten to document the new
resource_class
and trait
tags for SR-IOV NIC
devices.
References
- _`CPU resource tracking spec`: https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
- _`Unified Limits Integration in Nova`: https://specs.openstack.org/openstack/nova-specs/specs/ussuri/approved/unified-limits-nova.html
- _`Support virtual GPU resources`: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html
- _`PCI device tracking in Placement`: https://specs.openstack.org/openstack/nova-specs/specs/2023.1/implemented/pci-device-tracking-in-placement.html
History
Release Name | Description |
---|---|
2023.2 Bobcat | Introduced |