SR-IOV NIC device tracking in Placement
This change introduces a spec to describe modeling of SR-IOV NIC device using Placement. Blueprint: track-sriov-nics-in-placement Change-Id: Idb188abde2781827a35c354db34be66b100251e1
This commit is contained in:
parent
e0ef37d5d0
commit
8dbd287c0d
|
@ -0,0 +1,401 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=================================
|
||||
SR-IOV NICs Tracking In Placement
|
||||
=================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/track-sriov-nics-in-placement
|
||||
|
||||
In the zed and 2023.1 (antelope) releases support was added for tracking
|
||||
PCI devices that did not contain the physical_network tag in placement.
|
||||
This enables generic PCI devices that are consumed via flavor based PCI
|
||||
passthough to be tracked in placement. PCI devices that are consomed
|
||||
via Neutron port however are not tracked in placement. This spec aims to
|
||||
address that gap and enable tracking of Neutron manged PCI devices.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Nova has supported generic stateless PCI passthrough for many releases using a
|
||||
dedicated PCI tracker in conjunction with a ``PciPassthroughFilter`` scheduler
|
||||
post filter.
|
||||
|
||||
The PCI tracker is responsible for tracking which PCI devices are available,
|
||||
claimed, and allocated, the capabilities of the device, its consumer when
|
||||
claimed or allocated as well as the type of PCI device and location.
|
||||
|
||||
The ``PciPassthroughFilter`` is responsible for ensuring that devices,
|
||||
requested by the VM, exist on a host during scheduling. These PCI requests come
|
||||
from two sources: flavor-based PCI requests that are generated using the
|
||||
``pci_passthrough:alias`` `flavor extra specs`_ and Neutron based PCI requests
|
||||
generated from SR-IOV backed Neutron ports.
|
||||
|
||||
.. _`flavor extra specs`: https://docs.openstack.org/nova/latest/configuration/extra-specs.html#pci_passthrough:alias
|
||||
|
||||
Currently Nova has the capability to model the availablity of flavor managed
|
||||
PCI devices in placement but lack the same capability for devices consumed via
|
||||
Neutron ports. All instance requests or VM with SR-IOV, VDPA, hardware
|
||||
offloaded OVS or DPU ports rely on the ``PciPassthroughFilter`` to select
|
||||
hosts. While the current approach to SR-IOV nics tracking works there are some
|
||||
limitations in the current design and there is room for optimization.
|
||||
|
||||
.. rubric:: Limitations
|
||||
|
||||
* The current implementation is functionally slow.
|
||||
|
||||
* While Nova today tracks the capabilities of network interfaces in the
|
||||
``extra_info`` field of the ``pci_devices`` table and the
|
||||
``PciPassthroughFilter`` could match on those capabilities there is no
|
||||
user-facing way to express a request for an SR-IOV Neutron port with a
|
||||
specific network capability e.g. TSO.
|
||||
|
||||
.. rubric:: Optimizations
|
||||
|
||||
* Use placement to track SR-IOV nics.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
- As an operator, I want to use Placement API to track the used SR-IOV
|
||||
resources for Neutron managed ports.
|
||||
|
||||
- As an operator, I want to schedule my VMs to the correct PF/VF even if
|
||||
multiple device choices are available on the host.
|
||||
|
||||
- As an operator, I want to associate quotas with SR-IOV Neutron ports.
|
||||
|
||||
.. note::
|
||||
|
||||
Device quotas would require unified limits to be implemented. Implementing
|
||||
quotas is out of the scope of this spec beyond enabling the use case by
|
||||
modeling PCI devices in Placement.
|
||||
|
||||
This spec will also only focus on Neutron SR-IOV ports.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
|
||||
PCI device_spec configuration
|
||||
-----------------------------
|
||||
|
||||
There is list of Neutron port ``vnic_type`` (e.g. ``direct``,
|
||||
``direct-physical``, ``vdpa`` etc.) where port needs to be backed by VF or PF
|
||||
PCI devices.
|
||||
|
||||
In a simple case, when a port only requires a PCI device but does not require
|
||||
any other resources (e.g. bandwidth) then Nova needs to create placement
|
||||
request groups for each Neutron port by extending the prefilter introduced in
|
||||
generic PCI in placement implementation. Compared to the PCI alias case, in
|
||||
SR-IOV nics case, neither the name of resource class nor the vendor ID, product
|
||||
ID pair is known at scheduling time. Therefore, the prefilter does not know
|
||||
what resource class needs to be requested in the placement request group.
|
||||
|
||||
To resolve this, PCI devices that are intended to be used for Neutron based
|
||||
SR-IOV, hardware offloaded OVS or VDPA should not use ``resource_class`` tag
|
||||
in the ``[PCI]device_spec``. Instead Nova will use standard resource classes
|
||||
to model these resources. The resource classes are explained in later section.
|
||||
|
||||
Today Nova allows consuming type-PCI or type-VF for ``direct`` ports. This is
|
||||
mostly there due to historical reasons and it should be cleaned up.
|
||||
|
||||
|
||||
Modeling SR-IOV devices in Placement
|
||||
------------------------------------
|
||||
|
||||
PCI device modeling in Placement is already implemented. Each PCI device of
|
||||
type ``type-PCI`` and ``type-PF`` will be modeled as a Placement resource
|
||||
provider (RP) with the name ``<hypervisor_hostname>_<pci_address>``. The
|
||||
hypervisor_hostname prefix will be the same string as the name of the root
|
||||
RP. The pci_address part of the name will be the full PCI address in the
|
||||
same format of ``DDDD:BB:AA.FF``.
|
||||
|
||||
Each SR-IOV NIC device RP will have an inventory of resource class and traits
|
||||
derived by Nova based on device categorization explained below:
|
||||
|
||||
* A device in the ``device_spec`` will be consumable only via PCI alias
|
||||
if it does not have ``physical_network`` tag attached.
|
||||
|
||||
* A device that has ``physical_network`` tag attached will be considered a
|
||||
network device and it will be modelled as ``PCI_NETDEV`` resource.
|
||||
|
||||
* A device that has ``physical_network`` tag and also has the capability to
|
||||
provide VFs will have a trait ``HW_NIC_SRIOV`` but still use the
|
||||
``PCI_NETDEV`` resource class.
|
||||
|
||||
* A device that has ``physical_network`` tag and is a VF will be modelled
|
||||
as a ``SRIOV_NET_VF`` resource.
|
||||
|
||||
* A device that has physical_network and is a VF with a vdpa device will
|
||||
be modelled as a ``VDPA_NETDEV``.
|
||||
|
||||
|
||||
The actual implementation of this will reuse the existing logic to determine
|
||||
the device_type and to determine which resource class to use rather than
|
||||
implementing this in the ``device_spec`` parsing. PCI placement tracking
|
||||
will only take effect if the ``[pci]report_in_placement`` config option
|
||||
is set to ``True``.
|
||||
|
||||
Every Neutron ``vnic_type`` can be mapped to one single resource class by
|
||||
Nova. The following ``vnic_type`` -> resource class mapping is suggested:
|
||||
|
||||
* ``direct-physical`` -> ``PCI_NETDEV``
|
||||
* ``direct``, ``macvtap``, ``virtio-forwarder``, ``remote-managed`` ->
|
||||
``SRIOV_NET_VF``
|
||||
* ``vdpa`` -> ``VDPA_NETDEV``
|
||||
|
||||
Nova will use these resource classes to report device inventories to
|
||||
Placement. Then the prefilter can translate the ``vnic_type`` of the ports to
|
||||
request the specific resource class during scheduling.
|
||||
|
||||
Another speciality of Neutron-based SR-IOV is that the devices listed in the
|
||||
``device_spec`` always have a ``physical_network`` tag. This information
|
||||
needs to be reported as a trait to the PF RP in Placement. Also, the port's
|
||||
requested physnet needs to be included in the Placement request group by
|
||||
the prefilter. If a SR-IOV device is matching a ``device_spec`` entry with a
|
||||
``physical_network`` tag then an inventory of 1 is reported of the
|
||||
``resource_class`` derived by Nova.
|
||||
|
||||
There is a more complex case when the Neutron port not only requests a PCI
|
||||
device but also requests additional resources (e.g. bandwidth) via the port
|
||||
``resource_request`` attribute. Supporting this is currently out of scope
|
||||
of this spec but is intended to be suported in the future.
|
||||
|
||||
Nova will detect and refuse to boot an instance with an SR-IOV type port that
|
||||
contains addtional resouce requests. This will be done by returning a code
|
||||
409 until support for this is added. Attaching SR-IOV ports with additional
|
||||
resources will also be detected and rejected.
|
||||
|
||||
|
||||
Neutron SR-IOV ports with QoS (out of scope)
|
||||
--------------------------------------------
|
||||
|
||||
When a Neutron port requests additional resources, Nova generates Placement
|
||||
request groups from the ``resource_request`` and as in the simple case will
|
||||
generate a request group from the PCI request. The resource request of these
|
||||
groups of a Neutron port needs to be correlated to ensure that a port gets
|
||||
the PCI device and the bandwidth from the same physical device. However,
|
||||
today the bandwidth is modeled under the Neutron RP subtree while PCI devices
|
||||
will be modeled right under the root RP. So the two RPs to allocate from are
|
||||
not within the same subtree. (Note that Placement always fulfills a named
|
||||
request group from a single RP but allows correlating such request groups
|
||||
within the same subtree.) We have multiple options here:
|
||||
|
||||
* Create a scheduler filter that removes allocation candidates where these
|
||||
request groups are fulfilled from different physical devices.
|
||||
|
||||
* Report the bandwidth and the PCI device resource on the same RP. This breaks
|
||||
the clear ownership of a single RP as the bandwidth is reported by the
|
||||
Neutron agent while the PCI device is reported by Nova.
|
||||
|
||||
* Move the two RPs (bandwidth and PCI dev) into the same subtree. This
|
||||
needs an agreement between Nova and Neutron devs where to move the RPs and
|
||||
needs an extra reshape to implement the move.
|
||||
|
||||
* Enhance Placement to allow sharing of resources between RPs within the same
|
||||
RP tree. By that, we could make the bandwidth RP a sharing RP that shares
|
||||
resources with the PCI device RP representing the physical device.
|
||||
|
||||
To enable forward progress with the minimum of dependencies and incremental
|
||||
progress, the preferred short term solution is to enhance the existing
|
||||
pci_passthough_filter to remove allocation candidates where these request
|
||||
groups are fulfilled from different physical devices or add a new scheduler
|
||||
filter that removes allocation candidates.
|
||||
|
||||
|
||||
Requesting PCI devices
|
||||
----------------------
|
||||
|
||||
Nova will continue using the ``InstancePCIRequest`` to track the requested
|
||||
SR-IOV NIC devices for a VM.
|
||||
|
||||
|
||||
VM lifecycle operations
|
||||
-----------------------
|
||||
|
||||
The initial scheduling is very similar to the later scheduling done due to
|
||||
move operations. So, the existing implementation can be reused. Also, the
|
||||
current logic that switches the source node Placement allocation to be held by
|
||||
the migration UUID can be reused.
|
||||
|
||||
Attaching and detaching PCI devices will continue to be supported via Neutron
|
||||
SR-IOV ports.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* We could keep using the legacy tracking with all its good and bad properties.
|
||||
|
||||
* We could have Nova create the resource providers for the SR-IOV devices under
|
||||
the Neutron SR-IOV NIC agent resource provider. However Nova does not know
|
||||
which network backend is in use and this would create a start up dependency
|
||||
loop between Nova and Neutron. Neutron need the comptue RP to be created
|
||||
before it can create the SR-IOV NIC agent RP and Nova would need the agent RP
|
||||
to finish reporting the pci devices.
|
||||
|
||||
* We could defer the Nova support for tracking SR-IOV devices in placement
|
||||
until we reshape how banditwith is tracked or we develop a new placement
|
||||
featutre to corralate the relevent RPs.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
``InstancePCIRequest`` object will be extended to include the required and
|
||||
forbidden traits and the resource class generated by Nova.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
In general, this is expected to improve the scheduling performance but
|
||||
should have no runtime performance impact on guests.
|
||||
|
||||
The introduction of new ``RequestGroup`` objects will make the computation
|
||||
of the placement query slightly longer and the resulting execution time may
|
||||
increase for instances with SR-IOV NIC requests but should have no effect for
|
||||
instances without such requests. This added complexity is expected to be offset
|
||||
the result of the offloading of the filtering to Placement and the removal of
|
||||
reschedules due to racing for the last PCI device on a host, the overall
|
||||
performance is expected to improve.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
To utilize the new feature the operator will have to define two new config
|
||||
options. One to enable the placement scheduling logic and a second to enable
|
||||
the reporting of the PCI devices to Placement.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
The new Placement based PCI tracking will be disabled by default. Deployments
|
||||
already using PCI devices can freely upgrade to the new Nova version without
|
||||
any impact. At this state the PCI device management will be done by the
|
||||
``PciPassthroughFilter`` in the scheduler and the PCI claim in the PCI device
|
||||
tracker in the compute service same as in the previous version of Nova.
|
||||
Then after the upgrade the new PCI device tracking can be enabled in two
|
||||
phases.
|
||||
|
||||
First the PCI inventory reporting needs to be enabled by
|
||||
``[pci]report_to_placement`` on each compute host. During the startup of the
|
||||
nova-compute service with ``[pci]report_to_placement = True`` config the
|
||||
service will do the reshape of the provider tree and start reporting PCI device
|
||||
inventory to Placement. Nova compute will also heal the PCI allocation of the
|
||||
existing instances in Placement. This healing will be done for new
|
||||
instances with PCI requests until a future release where the prefilter enabled
|
||||
by default. This is needed to keep the resource usage in sync in Placement
|
||||
even if the instance scheduling is done without the prefilter requesting
|
||||
PCI allocations in Placement.
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
Once ``[pci]report_to_placement`` is enabled for a compute host it cannot be
|
||||
disabled any more.
|
||||
|
||||
Second, after every compute has been configured to report PCI inventories to
|
||||
Placement the scheduling logic needs to be enabled in the nova-scheduler
|
||||
configuration via the ``[filter_scheduler]pci_in_placement`` configuration
|
||||
option.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
|
||||
Primary assignee:
|
||||
kpawar
|
||||
|
||||
|
||||
Feature Liaison
|
||||
---------------
|
||||
|
||||
|
||||
Feature liaison:
|
||||
balazs-gibizer
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* translate ``InstancePCIRequest`` objects to RequestGroup objects in the
|
||||
RequestSpec
|
||||
* support adding resource class and required traits to SR-IOV NIC requests.
|
||||
* filter and reject SR-IOV NIC requests with resources both during VM creation
|
||||
and SR-IOV port attachment.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
The unified limits feature exists in an opt-in, experimental state and will
|
||||
allow defining limits for the new PCI resources if enabled.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
As this is a PCI passthrough related feature it cannot be tested in upstream
|
||||
tempest. Testing will be primarily done via the extensive unit and functional
|
||||
test suites that exists for instances with PCI devices.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The PCI passthrough doc will have to be rewritten to document the new
|
||||
``resource_class`` and ``trait`` tags for SR-IOV NIC devices.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* _`CPU resource tracking spec`: https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
|
||||
* _`Unified Limits Integration in Nova`: https://specs.openstack.org/openstack/nova-specs/specs/ussuri/approved/unified-limits-nova.html
|
||||
* _`Support virtual GPU resources`: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html
|
||||
* _`PCI device tracking in Placement`: https://specs.openstack.org/openstack/nova-specs/specs/2023.1/implemented/pci-device-tracking-in-placement.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - 2023.2 Bobcat
|
||||
- Introduced
|
Loading…
Reference in New Issue