SR-IOV NIC device tracking in Placement

This change introduces a spec to describe modeling of
SR-IOV NIC device using Placement.

Blueprint: track-sriov-nics-in-placement
Change-Id: Idb188abde2781827a35c354db34be66b100251e1
This commit is contained in:
Kiran Pawar 2023-05-28 10:08:29 +05:30
parent e0ef37d5d0
commit 8dbd287c0d
1 changed files with 401 additions and 0 deletions

View File

@ -0,0 +1,401 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=================================
SR-IOV NICs Tracking In Placement
=================================
https://blueprints.launchpad.net/nova/+spec/track-sriov-nics-in-placement
In the zed and 2023.1 (antelope) releases support was added for tracking
PCI devices that did not contain the physical_network tag in placement.
This enables generic PCI devices that are consumed via flavor based PCI
passthough to be tracked in placement. PCI devices that are consomed
via Neutron port however are not tracked in placement. This spec aims to
address that gap and enable tracking of Neutron manged PCI devices.
Problem description
===================
Nova has supported generic stateless PCI passthrough for many releases using a
dedicated PCI tracker in conjunction with a ``PciPassthroughFilter`` scheduler
post filter.
The PCI tracker is responsible for tracking which PCI devices are available,
claimed, and allocated, the capabilities of the device, its consumer when
claimed or allocated as well as the type of PCI device and location.
The ``PciPassthroughFilter`` is responsible for ensuring that devices,
requested by the VM, exist on a host during scheduling. These PCI requests come
from two sources: flavor-based PCI requests that are generated using the
``pci_passthrough:alias`` `flavor extra specs`_ and Neutron based PCI requests
generated from SR-IOV backed Neutron ports.
.. _`flavor extra specs`: https://docs.openstack.org/nova/latest/configuration/extra-specs.html#pci_passthrough:alias
Currently Nova has the capability to model the availablity of flavor managed
PCI devices in placement but lack the same capability for devices consumed via
Neutron ports. All instance requests or VM with SR-IOV, VDPA, hardware
offloaded OVS or DPU ports rely on the ``PciPassthroughFilter`` to select
hosts. While the current approach to SR-IOV nics tracking works there are some
limitations in the current design and there is room for optimization.
.. rubric:: Limitations
* The current implementation is functionally slow.
* While Nova today tracks the capabilities of network interfaces in the
``extra_info`` field of the ``pci_devices`` table and the
``PciPassthroughFilter`` could match on those capabilities there is no
user-facing way to express a request for an SR-IOV Neutron port with a
specific network capability e.g. TSO.
.. rubric:: Optimizations
* Use placement to track SR-IOV nics.
Use Cases
---------
- As an operator, I want to use Placement API to track the used SR-IOV
resources for Neutron managed ports.
- As an operator, I want to schedule my VMs to the correct PF/VF even if
multiple device choices are available on the host.
- As an operator, I want to associate quotas with SR-IOV Neutron ports.
.. note::
Device quotas would require unified limits to be implemented. Implementing
quotas is out of the scope of this spec beyond enabling the use case by
modeling PCI devices in Placement.
This spec will also only focus on Neutron SR-IOV ports.
Proposed change
===============
PCI device_spec configuration
-----------------------------
There is list of Neutron port ``vnic_type`` (e.g. ``direct``,
``direct-physical``, ``vdpa`` etc.) where port needs to be backed by VF or PF
PCI devices.
In a simple case, when a port only requires a PCI device but does not require
any other resources (e.g. bandwidth) then Nova needs to create placement
request groups for each Neutron port by extending the prefilter introduced in
generic PCI in placement implementation. Compared to the PCI alias case, in
SR-IOV nics case, neither the name of resource class nor the vendor ID, product
ID pair is known at scheduling time. Therefore, the prefilter does not know
what resource class needs to be requested in the placement request group.
To resolve this, PCI devices that are intended to be used for Neutron based
SR-IOV, hardware offloaded OVS or VDPA should not use ``resource_class`` tag
in the ``[PCI]device_spec``. Instead Nova will use standard resource classes
to model these resources. The resource classes are explained in later section.
Today Nova allows consuming type-PCI or type-VF for ``direct`` ports. This is
mostly there due to historical reasons and it should be cleaned up.
Modeling SR-IOV devices in Placement
------------------------------------
PCI device modeling in Placement is already implemented. Each PCI device of
type ``type-PCI`` and ``type-PF`` will be modeled as a Placement resource
provider (RP) with the name ``<hypervisor_hostname>_<pci_address>``. The
hypervisor_hostname prefix will be the same string as the name of the root
RP. The pci_address part of the name will be the full PCI address in the
same format of ``DDDD:BB:AA.FF``.
Each SR-IOV NIC device RP will have an inventory of resource class and traits
derived by Nova based on device categorization explained below:
* A device in the ``device_spec`` will be consumable only via PCI alias
if it does not have ``physical_network`` tag attached.
* A device that has ``physical_network`` tag attached will be considered a
  network device and it will be modelled as ``PCI_NETDEV`` resource.
* A device that has ``physical_network`` tag and also has the capability to
  provide VFs will have a trait ``HW_NIC_SRIOV`` but still use the
  ``PCI_NETDEV`` resource class.
* A device that has ``physical_network`` tag and is a VF will be modelled
  as a ``SRIOV_NET_VF`` resource.
* A device that has physical_network and is a VF with a vdpa device will
  be modelled as a ``VDPA_NETDEV``.
The actual implementation of this will reuse the existing logic to determine
the device_type and to determine which resource class to use rather than
implementing this in the ``device_spec`` parsing. PCI placement tracking
will only take effect if the ``[pci]report_in_placement`` config option
is set to ``True``.
Every Neutron ``vnic_type`` can be mapped to one single resource class by
Nova. The following ``vnic_type`` -> resource class mapping is suggested:
* ``direct-physical`` -> ``PCI_NETDEV``
* ``direct``, ``macvtap``, ``virtio-forwarder``, ``remote-managed`` ->
  ``SRIOV_NET_VF``
* ``vdpa`` -> ``VDPA_NETDEV``
Nova will use these resource classes to report device inventories to
Placement. Then the prefilter can translate the ``vnic_type`` of the ports to
request the specific resource class during scheduling.
Another speciality of Neutron-based SR-IOV is that the devices listed in the
``device_spec`` always have a ``physical_network`` tag. This information
needs to be reported as a trait to the PF RP in Placement. Also, the port's
requested physnet needs to be included in the Placement request group by
the prefilter. If a SR-IOV device is matching a ``device_spec`` entry with a
``physical_network`` tag then an inventory of 1 is reported of the
``resource_class`` derived by Nova.
There is a more complex case when the Neutron port not only requests a PCI
device but also requests additional resources (e.g. bandwidth) via the port
``resource_request`` attribute. Supporting this is currently out of scope
of this spec but is intended to be suported in the future.
Nova will detect and refuse to boot an instance with an SR-IOV type port that
contains addtional resouce requests. This will be done by returning a code
409 until support for this is added. Attaching SR-IOV ports with additional
resources will also be detected and rejected.
Neutron SR-IOV ports with QoS (out of scope)
--------------------------------------------
When a Neutron port requests additional resources, Nova generates Placement
request groups from the ``resource_request`` and as in the simple case will
generate a request group from the PCI request. The resource request of these
groups of a Neutron port needs to be correlated to ensure that a port gets
the PCI device and the bandwidth from the same physical device. However,
today the bandwidth is modeled under the Neutron RP subtree while PCI devices
will be modeled right under the root RP. So the two RPs to allocate from are
not within the same subtree. (Note that Placement always fulfills a named
request group from a single RP but allows correlating such request groups
within the same subtree.) We have multiple options here:
* Create a scheduler filter that removes allocation candidates where these
request groups are fulfilled from different physical devices.
* Report the bandwidth and the PCI device resource on the same RP. This breaks
the clear ownership of a single RP as the bandwidth is reported by the
Neutron agent while the PCI device is reported by Nova.
* Move the two RPs (bandwidth and PCI dev) into the same subtree. This
needs an agreement between Nova and Neutron devs where to move the RPs and
needs an extra reshape to implement the move.
* Enhance Placement to allow sharing of resources between RPs within the same
RP tree. By that, we could make the bandwidth RP a sharing RP that shares
resources with the PCI device RP representing the physical device.
To enable forward progress with the minimum of dependencies and incremental
progress, the preferred short term solution is to enhance the existing
pci_passthough_filter to remove allocation candidates where these request
groups are fulfilled from different physical devices or add a new scheduler
filter that removes allocation candidates.
Requesting PCI devices
----------------------
Nova will continue using the ``InstancePCIRequest`` to track the requested
SR-IOV NIC devices for a VM.
VM lifecycle operations
-----------------------
The initial scheduling is very similar to the later scheduling done due to
move operations. So, the existing implementation can be reused. Also, the
current logic that switches the source node Placement allocation to be held by
the migration UUID can be reused.
Attaching and detaching PCI devices will continue to be supported via Neutron
SR-IOV ports.
Alternatives
------------
* We could keep using the legacy tracking with all its good and bad properties.
* We could have Nova create the resource providers for the SR-IOV devices under
the Neutron SR-IOV NIC agent resource provider. However Nova does not know
which network backend is in use and this would create a start up dependency
loop between Nova and Neutron. Neutron need the comptue RP to be created
before it can create the SR-IOV NIC agent RP and Nova would need the agent RP
to finish reporting the pci devices.
* We could defer the Nova support for tracking SR-IOV devices in placement
until we reshape how banditwith is tracked or we develop a new placement
featutre to corralate the relevent RPs.
Data model impact
-----------------
``InstancePCIRequest`` object will be extended to include the required and
forbidden traits and the resource class generated by Nova.
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
In general, this is expected to improve the scheduling performance but
should have no runtime performance impact on guests.
The introduction of new ``RequestGroup`` objects will make the computation
of the placement query slightly longer and the resulting execution time may
increase for instances with SR-IOV NIC requests but should have no effect for
instances without such requests. This added complexity is expected to be offset
the result of the offloading of the filtering to Placement and the removal of
reschedules due to racing for the last PCI device on a host, the overall
performance is expected to improve.
Other deployer impact
---------------------
To utilize the new feature the operator will have to define two new config
options. One to enable the placement scheduling logic and a second to enable
the reporting of the PCI devices to Placement.
Developer impact
----------------
None
Upgrade impact
--------------
The new Placement based PCI tracking will be disabled by default. Deployments
already using PCI devices can freely upgrade to the new Nova version without
any impact. At this state the PCI device management will be done by the
``PciPassthroughFilter`` in the scheduler and the PCI claim in the PCI device
tracker in the compute service same as in the previous version of Nova.
Then after the upgrade the new PCI device tracking can be enabled in two
phases.
First the PCI inventory reporting needs to be enabled by
``[pci]report_to_placement`` on each compute host. During the startup of the
nova-compute service with ``[pci]report_to_placement = True`` config the
service will do the reshape of the provider tree and start reporting PCI device
inventory to Placement. Nova compute will also heal the PCI allocation of the
existing instances in Placement. This healing will be done for new
instances with PCI requests until a future release where the prefilter enabled
by default. This is needed to keep the resource usage in sync in Placement
even if the instance scheduling is done without the prefilter requesting
PCI allocations in Placement.
.. note::
Once ``[pci]report_to_placement`` is enabled for a compute host it cannot be
disabled any more.
Second, after every compute has been configured to report PCI inventories to
Placement the scheduling logic needs to be enabled in the nova-scheduler
configuration via the ``[filter_scheduler]pci_in_placement`` configuration
option.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
kpawar
Feature Liaison
---------------
Feature liaison:
balazs-gibizer
Work Items
----------
* translate ``InstancePCIRequest`` objects to RequestGroup objects in the
RequestSpec
* support adding resource class and required traits to SR-IOV NIC requests.
* filter and reject SR-IOV NIC requests with resources both during VM creation
and SR-IOV port attachment.
Dependencies
============
The unified limits feature exists in an opt-in, experimental state and will
allow defining limits for the new PCI resources if enabled.
Testing
=======
As this is a PCI passthrough related feature it cannot be tested in upstream
tempest. Testing will be primarily done via the extensive unit and functional
test suites that exists for instances with PCI devices.
Documentation Impact
====================
The PCI passthrough doc will have to be rewritten to document the new
``resource_class`` and ``trait`` tags for SR-IOV NIC devices.
References
==========
* _`CPU resource tracking spec`: https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
* _`Unified Limits Integration in Nova`: https://specs.openstack.org/openstack/nova-specs/specs/ussuri/approved/unified-limits-nova.html
* _`Support virtual GPU resources`: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html
* _`PCI device tracking in Placement`: https://specs.openstack.org/openstack/nova-specs/specs/2023.1/implemented/pci-device-tracking-in-placement.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - 2023.2 Bobcat
- Introduced