Merge "Re-proposes multiple vGPU types in libvirt"
This commit is contained in:
commit
bf34619cc0
243
specs/stein/approved/vgpu-stein.rst
Normal file
243
specs/stein/approved/vgpu-stein.rst
Normal file
@ -0,0 +1,243 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=========================================================
|
||||
libvirt: Supporting multiple vGPU types for a single pGPU
|
||||
=========================================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/vgpu-rocky
|
||||
|
||||
`Virtual GPUs in Nova`_ was implemented in Queens but only with one supported
|
||||
GPU type per compute node. Now that `Nested Resource Providers`_ is a thing,
|
||||
this spec is discussing about how to have one supported vGPU type *per physical
|
||||
GPU* (given a GPU card can have multiple physical GPUs).
|
||||
|
||||
.. note::
|
||||
As Xen provides a specific feature where physical GPUs supporting same vGPU
|
||||
type are within a single pGPU group, that virt driver doesn't need to know
|
||||
which exact pGPUs need to support a specific type, hence this spec only
|
||||
targets the libvirt driver.
|
||||
|
||||
.. _`Virtual GPUs in Nova`: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html
|
||||
.. _`Nested Resource Providers`: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/nested-resource-providers.html
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Hardware vendors tell us that a physical GPU device can support multiple types.
|
||||
That said, Intel (to be confirmed) and NVidia vendor drivers only accept one
|
||||
type for all the virtual devices *per graphical processing unit*.
|
||||
|
||||
For example, NVidia GRID physical cards can accept a list of different GPU
|
||||
types, but the driver can only support `one type per physical GPU`_.
|
||||
|
||||
.. figure:: http://docs.nvidia.com/grid/5.0/grid-vgpu-user-guide/graphics/sample-vgpu-configurations-grid-2gpus-on-card.png
|
||||
|
||||
.. _`one type per physical GPU`: http://docs.nvidia.com/grid/5.0/grid-vgpu-user-guide/index.html#homogeneous-grid-vgpus
|
||||
|
||||
Consequently, we require a way to instruct the libvirt driver which vGPU types
|
||||
an NVIDIA or Intel physical GPU is configured to accept.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
An operator needs a way to inform the libvirt driver which vGPU types an
|
||||
NVIDIA or Intel physical GPU is configured to accept.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We already have ``[devices]/enabled_vgpu_types`` that define which types the
|
||||
Nova compute node can use:
|
||||
|
||||
.. code::
|
||||
|
||||
[devices]
|
||||
enabled_vgpu_types = [str_vgpu_type_1, str_vgpu_type_2, ...]
|
||||
|
||||
Now we propose that libvirt will accept configuration sections that are related
|
||||
to the [devices]/enabled_vgpu_types and specifies which exact pGPUs are related
|
||||
to the enabled vGPU types and will have a ``device_addresses`` option defined
|
||||
like this:
|
||||
|
||||
.. code::
|
||||
|
||||
cfg.ListOpt('device_addresses',
|
||||
default=[],
|
||||
help="""
|
||||
List of physical PCI addresses to associate with a specific GPU type.
|
||||
|
||||
The particular physical GPU device address needs to be mapped to the vendor
|
||||
vGPU type which that physical GPU is configured to accept. In order to
|
||||
provide this mapping, there will be a CONF section with a name corresponding
|
||||
to the following template: "vgpu_type_%(vgpu_type_name)s
|
||||
|
||||
The vGPU type to associate with the PCI devices has to be the section name
|
||||
prefixed by ``vgpu_``. For example, for 'nvidia-11', you would declare
|
||||
``[vgpu_nvidia-11]/device_addresses``.
|
||||
|
||||
Each vGPU type also has to be declared in ``[devices]/enabled_vgpu_types``.
|
||||
|
||||
Related options:
|
||||
|
||||
* ``[devices]/enabled_vgpu_types``
|
||||
"""),
|
||||
|
||||
For example, it would be set in nova.conf:
|
||||
|
||||
.. code::
|
||||
|
||||
[devices]
|
||||
enabled_vgpu_types = nvidia-35,nvidia-36
|
||||
[vgpu_nvidia-35]
|
||||
device_addresses = 0000:84:00.0,0000:85:00.0
|
||||
[vgpu_nvidia-36]
|
||||
device_addresses = 0000:86:00.0
|
||||
|
||||
|
||||
In that case, the ``nvidia-35`` vGPU type would be supported by the physical
|
||||
GPUs that are in the PCI addresses ``0000:84:00.0`` and ``0000:85:00.0``, while
|
||||
``nvidia-36`` vGPU would only be supported by ``0000:86:00.0``.
|
||||
|
||||
If some operator messes up and provides two types for the same pGPU, an
|
||||
InvalidLibvirtGPUConfig exception will be raised. If the operator forgets to
|
||||
provide a type for a specific pGPU, then the first type given in
|
||||
``enabled_vgpu_types`` will be supported, like the existing situation.
|
||||
If the operator fat-fingers the PCI IDs, then when creating the inventory, it
|
||||
will return an exception.
|
||||
|
||||
|
||||
As one single compute could now support multiple vGPU types, asking operators
|
||||
to provide host aggregates for grouping computes having the same vGPU type
|
||||
becomes irrelevant. Instead, we need to ask operators to amend their flavors
|
||||
for specific GPU capabilities if they care of such things, or Placement will
|
||||
just randomly pick one of the available vGPU types.
|
||||
For this, we propose to standardize GPU capabilities that are unfortunately
|
||||
very vendor specific (eg. a CUDA library version support) by having a
|
||||
nova.virt.vgpu_capabilities module that would translate a vendor-specific vGPU
|
||||
type into a set of os-traits traits.
|
||||
If operators want vendor-specific traits, it's their responsibility to provide
|
||||
custom traits on the resource providers or ask the community to find a standard
|
||||
trait that would fit their needs.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Instead of creating only one inventory per pGPU, we could try to have children
|
||||
resource providers for a pGPU being GPU types. But then, once an instance
|
||||
would create a vGPU, all the other inventories but the one related to the used
|
||||
type would be having a total=0.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Operators need to either look at the sysfs (for libvirt) for knowing the
|
||||
existing pGPUs and which types are supported.
|
||||
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
Support for multiple types won't be supported by the libvirt driver until
|
||||
reshaping existing single vGPU type into separate nested provider is fully
|
||||
done.
|
||||
Consequently, the recommended upgrade path is to first upgrade to Stein with a
|
||||
single supported vGPU type and only later modify the nova-compute service for
|
||||
asking for more types.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bauzas
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create the config option
|
||||
* Modify the libvirt virt driver code to make use of that option for creating
|
||||
the nested Resource Provider inventories.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Classic unittests and functional tests.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A release note will be added with a 'feature' section, and the
|
||||
`Virtual GPU`_ documentation will be modified to explain the new feature.
|
||||
|
||||
.. _`Virtual GPU`: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None.
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Approved
|
||||
* - Stein
|
||||
- Reproposed
|
Loading…
Reference in New Issue
Block a user