nova-specs/specs/2023.2/approved/track-sriov-nics-in-placeme...

15 KiB

SR-IOV NICs Tracking In Placement

https://blueprints.launchpad.net/nova/+spec/track-sriov-nics-in-placement

In the zed and 2023.1 (antelope) releases support was added for tracking PCI devices that did not contain the physical_network tag in placement. This enables generic PCI devices that are consumed via flavor based PCI passthough to be tracked in placement. PCI devices that are consomed via Neutron port however are not tracked in placement. This spec aims to address that gap and enable tracking of Neutron manged PCI devices.

Problem description

Nova has supported generic stateless PCI passthrough for many releases using a dedicated PCI tracker in conjunction with a PciPassthroughFilter scheduler post filter.

The PCI tracker is responsible for tracking which PCI devices are available, claimed, and allocated, the capabilities of the device, its consumer when claimed or allocated as well as the type of PCI device and location.

The PciPassthroughFilter is responsible for ensuring that devices, requested by the VM, exist on a host during scheduling. These PCI requests come from two sources: flavor-based PCI requests that are generated using the pci_passthrough:alias flavor extra specs and Neutron based PCI requests generated from SR-IOV backed Neutron ports.

Currently Nova has the capability to model the availablity of flavor managed PCI devices in placement but lack the same capability for devices consumed via Neutron ports. All instance requests or VM with SR-IOV, VDPA, hardware offloaded OVS or DPU ports rely on the PciPassthroughFilter to select hosts. While the current approach to SR-IOV nics tracking works there are some limitations in the current design and there is room for optimization.

Limitations

  • The current implementation is functionally slow.
  • While Nova today tracks the capabilities of network interfaces in the extra_info field of the pci_devices table and the PciPassthroughFilter could match on those capabilities there is no user-facing way to express a request for an SR-IOV Neutron port with a specific network capability e.g. TSO.

Optimizations

  • Use placement to track SR-IOV nics.

Use Cases

  • As an operator, I want to use Placement API to track the used SR-IOV resources for Neutron managed ports.
  • As an operator, I want to schedule my VMs to the correct PF/VF even if multiple device choices are available on the host.
  • As an operator, I want to associate quotas with SR-IOV Neutron ports.

Note

Device quotas would require unified limits to be implemented. Implementing quotas is out of the scope of this spec beyond enabling the use case by modeling PCI devices in Placement.

This spec will also only focus on Neutron SR-IOV ports.

Proposed change

PCI device_spec configuration

There is list of Neutron port vnic_type (e.g. direct, direct-physical, vdpa etc.) where port needs to be backed by VF or PF PCI devices.

In a simple case, when a port only requires a PCI device but does not require any other resources (e.g. bandwidth) then Nova needs to create placement request groups for each Neutron port by extending the prefilter introduced in generic PCI in placement implementation. Compared to the PCI alias case, in SR-IOV nics case, neither the name of resource class nor the vendor ID, product ID pair is known at scheduling time. Therefore, the prefilter does not know what resource class needs to be requested in the placement request group.

To resolve this, PCI devices that are intended to be used for Neutron based SR-IOV, hardware offloaded OVS or VDPA should not use resource_class tag in the [PCI]device_spec. Instead Nova will use standard resource classes to model these resources. The resource classes are explained in later section.

Today Nova allows consuming type-PCI or type-VF for direct ports. This is mostly there due to historical reasons and it should be cleaned up.

Modeling SR-IOV devices in Placement

PCI device modeling in Placement is already implemented. Each PCI device of type type-PCI and type-PF will be modeled as a Placement resource provider (RP) with the name <hypervisor_hostname>_<pci_address>. The hypervisor_hostname prefix will be the same string as the name of the root RP. The pci_address part of the name will be the full PCI address in the same format of DDDD:BB:AA.FF.

Each SR-IOV NIC device RP will have an inventory of resource class and traits derived by Nova based on device categorization explained below:

  • A device in the device_spec will be consumable only via PCI alias if it does not have physical_network tag attached.

* A device that has physical_network tag attached will be considered a   network device and it will be modelled as PCI_NETDEV resource.

* A device that has physical_network tag and also has the capability to   provide VFs will have a trait HW_NIC_SRIOV but still use the   PCI_NETDEV resource class.

* A device that has physical_network tag and is a VF will be modelled   as a SRIOV_NET_VF resource.

* A device that has physical_network and is a VF with a vdpa device will   be modelled as a VDPA_NETDEV.

The actual implementation of this will reuse the existing logic to determine the device_type and to determine which resource class to use rather than implementing this in the device_spec parsing. PCI placement tracking will only take effect if the [pci]report_in_placement config option is set to True.

Every Neutron vnic_type can be mapped to one single resource class by Nova. The following vnic_type -> resource class mapping is suggested:

  • direct-physical -> PCI_NETDEV

* direct, macvtap, virtio-forwarder, remote-managed ->   SRIOV_NET_VF * vdpa -> VDPA_NETDEV

Nova will use these resource classes to report device inventories to Placement. Then the prefilter can translate the vnic_type of the ports to request the specific resource class during scheduling.

Another speciality of Neutron-based SR-IOV is that the devices listed in the device_spec always have a physical_network tag. This information needs to be reported as a trait to the PF RP in Placement. Also, the port's requested physnet needs to be included in the Placement request group by the prefilter. If a SR-IOV device is matching a device_spec entry with a physical_network tag then an inventory of 1 is reported of the resource_class derived by Nova.

There is a more complex case when the Neutron port not only requests a PCI device but also requests additional resources (e.g. bandwidth) via the port resource_request attribute. Supporting this is currently out of scope of this spec but is intended to be suported in the future.

Nova will detect and refuse to boot an instance with an SR-IOV type port that contains addtional resouce requests. This will be done by returning a code 409 until support for this is added. Attaching SR-IOV ports with additional resources will also be detected and rejected.

Neutron SR-IOV ports with QoS (out of scope)

When a Neutron port requests additional resources, Nova generates Placement request groups from the resource_request and as in the simple case will generate a request group from the PCI request. The resource request of these groups of a Neutron port needs to be correlated to ensure that a port gets the PCI device and the bandwidth from the same physical device. However, today the bandwidth is modeled under the Neutron RP subtree while PCI devices will be modeled right under the root RP. So the two RPs to allocate from are not within the same subtree. (Note that Placement always fulfills a named request group from a single RP but allows correlating such request groups within the same subtree.) We have multiple options here:

  • Create a scheduler filter that removes allocation candidates where these request groups are fulfilled from different physical devices.
  • Report the bandwidth and the PCI device resource on the same RP. This breaks the clear ownership of a single RP as the bandwidth is reported by the Neutron agent while the PCI device is reported by Nova.
  • Move the two RPs (bandwidth and PCI dev) into the same subtree. This needs an agreement between Nova and Neutron devs where to move the RPs and needs an extra reshape to implement the move.
  • Enhance Placement to allow sharing of resources between RPs within the same RP tree. By that, we could make the bandwidth RP a sharing RP that shares resources with the PCI device RP representing the physical device.

To enable forward progress with the minimum of dependencies and incremental progress, the preferred short term solution is to enhance the existing pci_passthough_filter to remove allocation candidates where these request groups are fulfilled from different physical devices or add a new scheduler filter that removes allocation candidates.

Requesting PCI devices

Nova will continue using the InstancePCIRequest to track the requested SR-IOV NIC devices for a VM.

VM lifecycle operations

The initial scheduling is very similar to the later scheduling done due to move operations. So, the existing implementation can be reused. Also, the current logic that switches the source node Placement allocation to be held by the migration UUID can be reused.

Attaching and detaching PCI devices will continue to be supported via Neutron SR-IOV ports.

Alternatives

  • We could keep using the legacy tracking with all its good and bad properties.
  • We could have Nova create the resource providers for the SR-IOV devices under the Neutron SR-IOV NIC agent resource provider. However Nova does not know which network backend is in use and this would create a start up dependency loop between Nova and Neutron. Neutron need the comptue RP to be created before it can create the SR-IOV NIC agent RP and Nova would need the agent RP to finish reporting the pci devices.
  • We could defer the Nova support for tracking SR-IOV devices in placement until we reshape how banditwith is tracked or we develop a new placement featutre to corralate the relevent RPs.

Data model impact

InstancePCIRequest object will be extended to include the required and forbidden traits and the resource class generated by Nova.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

In general, this is expected to improve the scheduling performance but should have no runtime performance impact on guests.

The introduction of new RequestGroup objects will make the computation of the placement query slightly longer and the resulting execution time may increase for instances with SR-IOV NIC requests but should have no effect for instances without such requests. This added complexity is expected to be offset the result of the offloading of the filtering to Placement and the removal of reschedules due to racing for the last PCI device on a host, the overall performance is expected to improve.

Other deployer impact

To utilize the new feature the operator will have to define two new config options. One to enable the placement scheduling logic and a second to enable the reporting of the PCI devices to Placement.

Developer impact

None

Upgrade impact

The new Placement based PCI tracking will be disabled by default. Deployments already using PCI devices can freely upgrade to the new Nova version without any impact. At this state the PCI device management will be done by the PciPassthroughFilter in the scheduler and the PCI claim in the PCI device tracker in the compute service same as in the previous version of Nova. Then after the upgrade the new PCI device tracking can be enabled in two phases.

First the PCI inventory reporting needs to be enabled by [pci]report_to_placement on each compute host. During the startup of the nova-compute service with [pci]report_to_placement = True config the service will do the reshape of the provider tree and start reporting PCI device inventory to Placement. Nova compute will also heal the PCI allocation of the existing instances in Placement. This healing will be done for new instances with PCI requests until a future release where the prefilter enabled by default. This is needed to keep the resource usage in sync in Placement even if the instance scheduling is done without the prefilter requesting PCI allocations in Placement.

Note

Once [pci]report_to_placement is enabled for a compute host it cannot be disabled any more.

Second, after every compute has been configured to report PCI inventories to Placement the scheduling logic needs to be enabled in the nova-scheduler configuration via the [filter_scheduler]pci_in_placement configuration option.

Implementation

Assignee(s)

Primary assignee:

kpawar

Feature Liaison

Feature liaison:

balazs-gibizer

Work Items

  • translate InstancePCIRequest objects to RequestGroup objects in the RequestSpec
  • support adding resource class and required traits to SR-IOV NIC requests.
  • filter and reject SR-IOV NIC requests with resources both during VM creation and SR-IOV port attachment.

Dependencies

The unified limits feature exists in an opt-in, experimental state and will allow defining limits for the new PCI resources if enabled.

Testing

As this is a PCI passthrough related feature it cannot be tested in upstream tempest. Testing will be primarily done via the extensive unit and functional test suites that exists for instances with PCI devices.

Documentation Impact

The PCI passthrough doc will have to be rewritten to document the new resource_class and trait tags for SR-IOV NIC devices.

References

History

Revisions
Release Name Description
2023.2 Bobcat Introduced