Browse Source

Merge "Support different vGPU types per pGPU"

tags/21.0.0.0rc1
Zuul 3 months ago
committed by Gerrit Code Review
parent
commit
77482b05af
10 changed files with 528 additions and 64 deletions
  1. +77
    -4
      doc/source/admin/virtual-gpu.rst
  2. +2
    -0
      nova/compute/manager.py
  3. +40
    -4
      nova/conf/devices.py
  4. +4
    -0
      nova/exception.py
  5. +10
    -4
      nova/tests/functional/libvirt/test_reshape.py
  6. +34
    -0
      nova/tests/unit/conf/test_devices.py
  7. +2
    -0
      nova/tests/unit/conf_fixture.py
  8. +229
    -13
      nova/tests/unit/virt/libvirt/test_driver.py
  9. +122
    -39
      nova/virt/libvirt/driver.py
  10. +8
    -0
      releasenotes/notes/vgpu-multiple-types-2b1ded7d1cc28880.yaml

+ 77
- 4
doc/source/admin/virtual-gpu.rst View File

@@ -35,15 +35,33 @@ Enable GPU types (Compute)
[devices]
enabled_vgpu_types = nvidia-35

.. note::
If you want to support more than a single GPU type, you need to provide a
separate configuration section for each device. For example:

As of the Queens release, Nova only supports a single type. If more
than one vGPU type is specified (as a comma-separated list), only the
first one will be used.
.. code-block:: ini

[devices]
enabled_vgpu_types = nvidia-35, nvidia-36

[vgpu_nvidia-35]
device_addresses = 0000:84:00.0,0000:85:00.0

[vgpu_nvidia-36]
device_addresses = 0000:86:00.0

where you have to define which physical GPUs are supported per GPU type.

If the same PCI address is provided for two different types, nova-compute
will refuse to start and issue a specific error in the logs.

To know which specific type(s) to mention, please refer to `How to discover
a GPU type`_.

.. versionchanged:: 21.0.0

Supporting multiple GPU types is only supported by the Ussuri release and
later versions.

#. Restart the ``nova-compute`` service.


@@ -269,6 +287,59 @@ OpenStackClient. For details on specific commands, see its documentation.
physical GPU having the PCI ID ``0000:85:00.0``.


(Optional) Provide custom traits for multiple GPU types
-------------------------------------------------------

Since operators want to support different GPU types per compute, it would be
nice to have flavors asking for a specific GPU type. This is now possible
using custom traits by decorating child Resource Providers that correspond
to physical GPUs.

.. note::

Possible improvements in a future release could consist of providing
automatic tagging of Resource Providers with standard traits corresponding
to versioned mapping of public GPU types. For the moment, this has to be
done manually.

#. Get the list of resource providers

See `Checking allocations and inventories for virtual GPUs`_ first for getting
the list of Resource Providers that support a ``VGPU`` resource class.

#. Define custom traits that will correspond for each to a GPU type

.. code-block:: console

$ openstack --os-placement-api-version 1.6 trait create CUSTOM_NVIDIA_11

In this example, we ask to create a custom trait named ``CUSTOM_NVIDIA_11``.

#. Add the corresponding trait to the Resource Provider matching the GPU

.. code-block:: console

$ openstack --os-placement-api-version 1.6 resource provider trait set \
--trait CUSTOM_NVIDIA_11 e2f8607b-0683-4141-a8af-f5e20682e28c

In this case, the trait ``CUSTOM_NVIDIA_11`` will be added to the Resource
Provider with the UUID ``e2f8607b-0683-4141-a8af-f5e20682e28c`` that
corresponds to the PCI address ``0000:85:00:0`` as shown above.

#. Amend the flavor to add a requested trait

.. code-block:: console

$ openstack flavor set --property trait:CUSTOM_NVIDIA_11=required vgpu_1

In this example, we add the ``CUSTOM_NVIDIA_11`` trait as a required
information for the ``vgpu_1`` flavor we created earlier.

This will allow the Placement service to only return the Resource Providers
matching this trait so only the GPUs that were decorated with will be checked
for this flavor.


Caveats
-------

@@ -324,6 +395,8 @@ For XenServer:
resize. If you want to migrate an instance, make sure to rebuild it after the
migration.

* Multiple GPU types per compute is not supported by the XenServer driver.

.. [#] https://bugs.launchpad.net/nova/+bug/1762688

.. Links


+ 2
- 0
nova/compute/manager.py View File

@@ -1384,6 +1384,8 @@ class ComputeManager(manager.Manager):
whitelist.Whitelist(CONF.pci.passthrough_whitelist)

nova.conf.neutron.register_dynamic_opts(CONF)
# Even if only libvirt uses them, make it available for all drivers
nova.conf.devices.register_dynamic_opts(CONF)

# Override the number of concurrent disk operations allowed if the
# user has specified a limit.


+ 40
- 4
nova/conf/devices.py View File

@@ -24,12 +24,33 @@ The vGPU types enabled in the compute node.

Some pGPUs (e.g. NVIDIA GRID K1) support different vGPU types. User can use
this option to specify a list of enabled vGPU types that may be assigned to a
guest instance. But please note that Nova only supports a single type in the
Queens release. If more than one vGPU type is specified (as a comma-separated
list), only the first one will be used. An example is as the following::
guest instance.

If more than one single vGPU type is provided, then for each *vGPU type* an
additional section, ``[vgpu_$(VGPU_TYPE)]``, must be added to the configuration
file. Each section then **must** be configured with a single configuration
option, ``device_addresses``, which should be a list of PCI addresses
corresponding to the physical GPU(s) to assign to this type.

If one or more sections are missing (meaning that a specific type is not wanted
to use for at least one physical GPU) or if no device addresses are provided,
then Nova will only use the first type that was provided by
``[devices]/enabled_vgpu_types``.

If the same PCI address is provided for two different types, nova-compute will
return an InvalidLibvirtGPUConfig exception at restart.

An example is as the following::

[devices]
enabled_vgpu_types = GRID K100,Intel GVT-g,MxGPU.2,nvidia-11
enabled_vgpu_types = nvidia-35, nvidia-36

[vgpu_nvidia-35]
device_addresses = 0000:84:00.0,0000:85:00.0

[vgpu_nvidia-36]
device_addresses = 0000:86:00.0

""")
]

@@ -39,5 +60,20 @@ def register_opts(conf):
conf.register_opts(vgpu_opts, group=devices_group)


def register_dynamic_opts(conf):
"""Register dynamically-generated options and groups.

This must be called by the service that wishes to use the options **after**
the initial configuration has been loaded.
"""
opt = cfg.ListOpt('device_addresses', default=[],
item_type=cfg.types.String())

# Register the '[vgpu_$(VGPU_TYPE)]/device_addresses' opts, implicitly
# registering the '[vgpu_$(VGPU_TYPE)]' groups in the process
for vgpu_type in conf.devices.enabled_vgpu_types:
conf.register_opt(opt, group='vgpu_%s' % vgpu_type)


def list_opts():
return {devices_group: vgpu_opts}

+ 4
- 0
nova/exception.py View File

@@ -2302,3 +2302,7 @@ class DeviceProfileError(NovaException):

class AcceleratorRequestOpFailed(NovaException):
msg_fmt = _("Failed to %(op)s accelerator requests: %(msg)s")


class InvalidLibvirtGPUConfig(NovaException):
msg_fmt = _('Invalid configuration for GPU devices: %(reason)s')

+ 10
- 4
nova/tests/functional/libvirt/test_reshape.py View File

@@ -87,6 +87,16 @@ class VGPUReshapeTests(base.ServersTestBase):
'/resource_providers/%s/inventories' % compute_rp_uuid,
inventories)

# enabled vgpu support
self.flags(
enabled_vgpu_types=fakelibvirt.NVIDIA_11_VGPU_TYPE,
group='devices')
# We don't want to restart the compute service or it would call for
# a reshape but we still want to accept some vGPU types so we call
# directly the needed method
self.compute.driver.supported_vgpu_types = (
self.compute.driver._get_supported_vgpu_types())

# now we boot two servers with vgpu
extra_spec = {"resources:VGPU": 1}
flavor_id = self._create_flavor(extra_spec=extra_spec)
@@ -139,10 +149,6 @@ class VGPUReshapeTests(base.ServersTestBase):
{'DISK_GB': 20, 'MEMORY_MB': 2048, 'VCPU': 2, 'VGPU': 1},
allocations[compute_rp_uuid]['resources'])

# enabled vgpu support
self.flags(
enabled_vgpu_types=fakelibvirt.NVIDIA_11_VGPU_TYPE,
group='devices')
# restart compute which will trigger a reshape
self.compute = self.restart_compute_service(self.compute)



+ 34
- 0
nova/tests/unit/conf/test_devices.py View File

@@ -0,0 +1,34 @@
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.

import nova.conf
from nova import test


CONF = nova.conf.CONF


class DevicesConfTestCase(test.NoDBTestCase):

def test_register_dynamic_opts(self):
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')

self.assertNotIn('vgpu_nvidia-11', CONF)
self.assertNotIn('vgpu_nvidia-12', CONF)

nova.conf.devices.register_dynamic_opts(CONF)

self.assertIn('vgpu_nvidia-11', CONF)
self.assertIn('vgpu_nvidia-12', CONF)
self.assertEqual([], getattr(CONF, 'vgpu_nvidia-11').device_addresses)
self.assertEqual([], getattr(CONF, 'vgpu_nvidia-12').device_addresses)

+ 2
- 0
nova/tests/unit/conf_fixture.py View File

@@ -17,6 +17,7 @@
from oslo_config import fixture as config_fixture
from oslo_policy import opts as policy_opts

from nova.conf import devices
from nova.conf import neutron
from nova.conf import paths
from nova import config
@@ -64,3 +65,4 @@ class ConfFixture(config_fixture.Config):
init_rpc=False)
policy_opts.set_defaults(self.conf)
neutron.register_dynamic_opts(self.conf)
devices.register_dynamic_opts(self.conf)

+ 229
- 13
nova/tests/unit/virt/libvirt/test_driver.py View File

@@ -23275,7 +23275,8 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
'._get_mediated_devices')
@mock.patch('nova.virt.libvirt.driver.LibvirtDriver'
'._get_mdev_capable_devices')
def test_get_gpu_inventories(self, get_mdev_capable_devs,
def _test_get_gpu_inventories(self, drvr, expected, expected_types,
get_mdev_capable_devs,
get_mediated_devices):
get_mdev_capable_devs.return_value = [
{"dev_id": "pci_0000_06_00_0",
@@ -23290,6 +23291,9 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
"types": {'nvidia-11': {'availableInstances': 7,
'name': 'GRID M60-0B',
'deviceAPI': 'vfio-pci'},
'nvidia-12': {'availableInstances': 10,
'name': 'GRID M60-8Q',
'deviceAPI': 'vfio-pci'},
}
},
]
@@ -23303,13 +23307,19 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
'parent': "pci_0000_07_00_0",
'type': 'nvidia-11',
'iommu_group': 1}]
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)

self.assertEqual(expected, drvr._get_gpu_inventories())
get_mdev_capable_devs.assert_called_once_with(types=expected_types)
get_mediated_devices.assert_called_once_with(types=expected_types)

def test_get_gpu_inventories_with_a_single_type(self):
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
# If the operator doesn't provide GPU types
self.assertEqual({}, drvr._get_gpu_inventories())

# Now, set a specific GPU type
# Now, set a specific GPU type and restart the driver
self.flags(enabled_vgpu_types=['nvidia-11'], group='devices')
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
expected = {
# the first GPU also has one mdev allocated against it
'pci_0000_06_00_0': {'total': 15 + 1,
@@ -23328,9 +23338,158 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
'allocation_ratio': 1.0,
},
}
self.assertEqual(expected, drvr._get_gpu_inventories())
get_mdev_capable_devs.assert_called_once_with(types=['nvidia-11'])
get_mediated_devices.assert_called_once_with(types=['nvidia-11'])
self._test_get_gpu_inventories(drvr, expected, ['nvidia-11'])

@mock.patch('nova.virt.libvirt.driver.LibvirtDriver'
'._get_mdev_capable_devices')
def test_get_gpu_inventories_with_two_types(self, get_mdev_capable_devs):
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:06:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:07:00.0'], group='vgpu_nvidia-12')
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
expected = {
# the first GPU supports nvidia-11 and has one mdev with this type
'pci_0000_06_00_0': {'total': 15 + 1,
'max_unit': 15 + 1,
'min_unit': 1,
'step_size': 1,
'reserved': 0,
'allocation_ratio': 1.0,
},
# the second GPU supports nvidia-12 but the existing mdev is not
# using this type, so we only count the availableInstances value
# for nvidia-12.
'pci_0000_07_00_0': {'total': 10,
'max_unit': 10,
'min_unit': 1,
'step_size': 1,
'reserved': 0,
'allocation_ratio': 1.0,
},
}
self._test_get_gpu_inventories(drvr, expected, ['nvidia-11',
'nvidia-12'])

@mock.patch.object(libvirt_driver.LOG, 'warning')
def test_get_supported_vgpu_types(self, mock_warning):
# Verify that by default we don't support vGPU types
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
self.assertEqual([], drvr._get_supported_vgpu_types())

# Now, provide only one supported vGPU type
self.flags(enabled_vgpu_types=['nvidia-11'], group='devices')
self.assertEqual(['nvidia-11'], drvr._get_supported_vgpu_types())
# Given we only support one vGPU type, we don't have any map for PCI
# devices *yet*
self.assertEqual({}, drvr.pgpu_type_mapping)
# Since the operator wanted to only support one type, it's fine to not
# provide config groups
mock_warning.assert_not_called()
# For further checking
mock_warning.reset_mock()

# Now two types without forgetting to provide the pGPU addresses
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-11')
self.assertEqual(['nvidia-11'], drvr._get_supported_vgpu_types())
self.assertEqual({}, drvr.pgpu_type_mapping)
msg = ("The vGPU type '%(type)s' was listed in '[devices] "
"enabled_vgpu_types' but no corresponding "
"'[vgpu_%(type)s]' group or "
"'[vgpu_%(type)s] device_addresses' "
"option was defined. Only the first type '%(ftype)s' "
"will be used." % {'type': 'nvidia-12',
'ftype': 'nvidia-11'})
mock_warning.assert_called_once_with(msg)
# For further checking
mock_warning.reset_mock()

# And now do it correctly !
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:85:00.0'], group='vgpu_nvidia-12')
self.assertEqual(['nvidia-11', 'nvidia-12'],
drvr._get_supported_vgpu_types())
self.assertEqual({'0000:84:00.0': 'nvidia-11',
'0000:85:00.0': 'nvidia-12'}, drvr.pgpu_type_mapping)
mock_warning.assert_not_called()

def test_get_supported_vgpu_types_with_duplicate_types(self):
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
# Provide the same pGPU PCI ID for two different types
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-12')
self.assertRaises(exception.InvalidLibvirtGPUConfig,
libvirt_driver.LibvirtDriver,
fake.FakeVirtAPI(), False)

def test_get_supported_vgpu_types_with_invalid_pci_address(self):
self.flags(enabled_vgpu_types=['nvidia-11'], group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
# Fat-finger the PCI address
self.flags(device_addresses=['whoops'], group='vgpu_nvidia-11')
self.assertRaises(exception.InvalidLibvirtGPUConfig,
libvirt_driver.LibvirtDriver,
fake.FakeVirtAPI(), False)

def test_get_vgpu_type_per_pgpu(self):
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
device = 'pci_0000_84_00_0'
self.assertIsNone(drvr._get_vgpu_type_per_pgpu(device))

# BY default, we return the first type if we only support one.
self.flags(enabled_vgpu_types=['nvidia-11'], group='devices')
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
self.assertEqual('nvidia-11', drvr._get_vgpu_type_per_pgpu(device))

# Now, make sure we provide the right vGPU type for the device
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:85:00.0'], group='vgpu_nvidia-12')
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
# the libvirt name pci_0000_84_00_0 matches 0000:84:00.0
self.assertEqual('nvidia-11', drvr._get_vgpu_type_per_pgpu(device))

def test_get_vgpu_type_per_pgpu_with_incorrect_pci_addr(self):
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:85:00.0'], group='vgpu_nvidia-12')
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
# 'whoops' is not a correct libvirt name corresponding to a PCI address
self.assertIsNone(drvr._get_vgpu_type_per_pgpu('whoops'))

def test_get_vgpu_type_per_pgpu_with_unconfigured_pgpu(self):
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:84:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:85:00.0'], group='vgpu_nvidia-12')
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
# 0000:86:00.0 wasn't configured
self.assertIsNone(drvr._get_vgpu_type_per_pgpu('pci_0000_86_00_0'))

@mock.patch.object(host.Host, 'device_lookup_by_name')
@mock.patch.object(host.Host, 'list_mdev_capable_devices')
@@ -23528,7 +23687,13 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
unallocated_mdevs,
get_mdev_capable_devs,
privsep_create_mdev):
self.flags(enabled_vgpu_types=['nvidia-11'], group='devices')
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:06:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:07:00.0'], group='vgpu_nvidia-12')
allocations = {
uuids.rp1: {
'resources': {
@@ -23540,7 +23705,12 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
get_mdev_capable_devs.return_value = [
{"dev_id": "pci_0000_06_00_0",
"vendor_id": 0x10de,
"types": {'nvidia-11': {'availableInstances': 16,
# This pGPU can support both types but the operator only wanted
# to use nvidia-11 for it.
"types": {'nvidia-10': {'availableInstances': 16,
'name': 'GRID M60-8Q',
'deviceAPI': 'vfio-pci'},
'nvidia-11': {'availableInstances': 16,
'name': 'GRID M60-0B',
'deviceAPI': 'vfio-pci'},
}
@@ -23621,11 +23791,13 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
"pGPU device name %(name)s can't be guessed from the ProviderTree "
"roots %(roots)s", {'name': 'oops_I_did_it_again', 'roots': 'cn'})

@mock.patch.object(libvirt_driver.LibvirtDriver, '_get_vgpu_type_per_pgpu')
@mock.patch.object(libvirt_driver.LibvirtDriver, '_get_mediated_devices')
@mock.patch.object(libvirt_driver.LibvirtDriver,
'_get_all_assigned_mediated_devices')
def test_get_existing_mdevs_not_assigned(self, get_all_assigned_mdevs,
get_mediated_devices):
get_mediated_devices,
get_vgpu_type_per_pgpu):
# mdev2 is assigned to instance1
get_all_assigned_mdevs.return_value = {uuids.mdev2: uuids.inst1}
# there is a total of 2 mdevs, mdev1 and mdev2
@@ -23638,10 +23810,22 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
'uuid': uuids.mdev2,
'parent': "pci_some",
'type': 'nvidia-11',
'iommu_group': 1}]
'iommu_group': 1},
{'dev_id': 'mdev_some_uuid3',
'uuid': uuids.mdev3,
'parent': "pci_some",
'type': 'nvidia-12',
'iommu_group': 1},
]

def _fake_get_vgpu_type_per_pgpu(parent_addr):
# Always return the same vGPU type so we avoid mdev3
return 'nvidia-11'
get_vgpu_type_per_pgpu.side_effect = _fake_get_vgpu_type_per_pgpu

drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
# Since mdev2 is assigned to inst1, only mdev1 is available
# Since mdev3 type is not supported and mdev2 is assigned to inst1,
# only mdev1 is available
self.assertEqual(set([uuids.mdev1]),
drvr._get_existing_mdevs_not_assigned(parent=None))

@@ -23658,7 +23842,13 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
def test_recreate_mediated_device_on_init_host(
self, get_all_assigned_mdevs, exists, mock_get_mdev_info,
get_mdev_capable_devs, privsep_create_mdev):
self.flags(enabled_vgpu_types=['nvidia-11'], group='devices')
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
# we need to call the below again to ensure the updated
# 'device_addresses' value is read and the new groups created
nova.conf.devices.register_dynamic_opts(CONF)
self.flags(device_addresses=['0000:06:00.0'], group='vgpu_nvidia-11')
self.flags(device_addresses=['0000:07:00.0'], group='vgpu_nvidia-12')
get_all_assigned_mdevs.return_value = {uuids.mdev1: uuids.inst1,
uuids.mdev2: uuids.inst2}

@@ -23675,7 +23865,7 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):
exists.side_effect = _exists
mock_get_mdev_info.side_effect = [
{"dev_id": "mdev_fake",
"uuid": uuids.mdev1,
"uuid": uuids.mdev2,
"parent": "pci_0000_06_00_0",
"type": "nvidia-11",
"iommu_group": 12
@@ -23691,9 +23881,35 @@ class LibvirtDriverTestCase(test.NoDBTestCase, TraitsComparisonMixin):

drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), True)
drvr.init_host(host='foo')
# Only mdev2 will be recreated as mdev1 already exists.
privsep_create_mdev.assert_called_once_with(
"0000:06:00.0", 'nvidia-11', uuid=uuids.mdev2)

@mock.patch('nova.virt.libvirt.driver.LibvirtDriver.'
'_get_mediated_device_information')
@mock.patch.object(os.path, 'exists')
@mock.patch.object(libvirt_driver.LibvirtDriver,
'_get_all_assigned_mediated_devices')
def test_recreate_mediated_device_on_init_host_with_wrong_config(
self, get_all_assigned_mdevs, exists, mock_get_mdev_info):
self.flags(enabled_vgpu_types=['nvidia-11', 'nvidia-12'],
group='devices')
get_all_assigned_mdevs.return_value = {uuids.mdev1: uuids.inst1}
# We pretend this mdev doesn't exist hence it needs recreation
exists.return_value = False
mock_get_mdev_info.side_effect = [
{"dev_id": "mdev_fake",
"uuid": uuids.mdev1,
"parent": "pci_0000_06_00_0",
"type": "nvidia-99",
"iommu_group": 12
}]
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), True)
# mdev1 was originally created for nvidia-99 but the operator messed up
# the configuration by removing this type, we want to hardstop.
self.assertRaises(exception.InvalidLibvirtGPUConfig,
drvr.init_host, host='foo')

@mock.patch.object(libvirt_guest.Guest, 'detach_device')
def _test_detach_mediated_devices(self, side_effect, detach_device):



+ 122
- 39
nova/virt/libvirt/driver.py View File

@@ -429,6 +429,10 @@ class LibvirtDriver(driver.ComputeDriver):
self._vpmems_by_name, self._vpmems_by_rc = self._discover_vpmems(
vpmem_conf=CONF.libvirt.pmem_namespaces)

# We default to not support vGPUs unless the configuration is set.
self.pgpu_type_mapping = collections.defaultdict(str)
self.supported_vgpu_types = self._get_supported_vgpu_types()

def _discover_vpmems(self, vpmem_conf=None):
"""Discover vpmems on host and configuration.

@@ -791,6 +795,21 @@ class LibvirtDriver(driver.ComputeDriver):
dev_name = libvirt_utils.mdev_uuid2name(mdev_uuid)
dev_info = self._get_mediated_device_information(dev_name)
parent = dev_info['parent']
parent_type = self._get_vgpu_type_per_pgpu(parent)
if dev_info['type'] != parent_type:
# NOTE(sbauza): The mdev was created by using a different
# vGPU type. We can't recreate the mdev until the operator
# modifies the configuration.
parent = "{}:{}:{}.{}".format(*parent[4:].split('_'))
msg = ("The instance UUID %(inst)s uses a VGPU that "
"its parent pGPU %(parent)s no longer "
"supports as the instance vGPU type %(type)s "
"is not accepted for the pGPU. Please correct "
"the configuration accordingly." %
{'inst': instance_uuid,
'parent': parent,
'type': dev_info['type']})
raise exception.InvalidLibvirtGPUConfig(reason=msg)
self._create_new_mediated_device(parent, uuid=mdev_uuid)

def _set_multiattach_support(self):
@@ -6481,12 +6500,80 @@ class LibvirtDriver(driver.ComputeDriver):
def _get_supported_vgpu_types(self):
if not CONF.devices.enabled_vgpu_types:
return []
# TODO(sbauza): Move this check up to compute_manager.init_host
if len(CONF.devices.enabled_vgpu_types) > 1:
LOG.warning('libvirt only supports one GPU type per compute node,'
' only first type will be used.')
requested_types = CONF.devices.enabled_vgpu_types[:1]
return requested_types

for vgpu_type in CONF.devices.enabled_vgpu_types:
group = getattr(CONF, 'vgpu_%s' % vgpu_type, None)
if group is None or not group.device_addresses:
first_type = CONF.devices.enabled_vgpu_types[0]
if len(CONF.devices.enabled_vgpu_types) > 1:
# Only provide the warning if the operator provided more
# than one type as it's not needed to provide groups
# if you only use one vGPU type.
msg = ("The vGPU type '%(type)s' was listed in '[devices] "
"enabled_vgpu_types' but no corresponding "
"'[vgpu_%(type)s]' group or "
"'[vgpu_%(type)s] device_addresses' "
"option was defined. Only the first type "
"'%(ftype)s' will be used." % {'type': vgpu_type,
'ftype': first_type})
LOG.warning(msg)
# We need to reset the mapping table that we started to provide
# keys and values from previously processed vGPUs but since
# there is a problem for this vGPU type, we only want to
# support only the first type.
self.pgpu_type_mapping.clear()
return [first_type]
for device_address in group.device_addresses:
if device_address in self.pgpu_type_mapping:
raise exception.InvalidLibvirtGPUConfig(
reason="duplicate types for PCI ID %s" % device_address
)
# Just checking whether the operator fat-fingered the address.
# If it's wrong, it will return an exception
try:
pci_utils.parse_address(device_address)
except exception.PciDeviceWrongAddressFormat:
raise exception.InvalidLibvirtGPUConfig(
reason="incorrect PCI address: %s" % device_address
)
self.pgpu_type_mapping[device_address] = vgpu_type
return CONF.devices.enabled_vgpu_types

def _get_vgpu_type_per_pgpu(self, device_address):
"""Provides the vGPU type the pGPU supports.

:param device_address: the libvirt PCI device name,
eg.'pci_0000_84_00_0'
"""
# Bail out quickly if we don't support vGPUs
if not self.supported_vgpu_types:
return

if len(self.supported_vgpu_types) == 1:
# The operator wanted to only support one single type so we can
# blindly return it for every single pGPU
return self.supported_vgpu_types[0]
# The libvirt name is like 'pci_0000_84_00_0'
try:
device_address = "{}:{}:{}.{}".format(
*device_address[4:].split('_'))
# Validates whether it's a PCI ID...
pci_utils.parse_address(device_address)
# .format() can return IndexError
except (exception.PciDeviceWrongAddressFormat, IndexError):
# this is not a valid PCI address
LOG.warning("The PCI address %s was invalid for getting the"
"related vGPU type", device_address)
return
try:
return self.pgpu_type_mapping.get(device_address)
except KeyError:
LOG.warning("No vGPU type was configured for PCI address: %s",
device_address)
# We accept to return None instead of raising an exception
# because we prefer the callers to return the existing exceptions
# in case we can't find a specific pGPU
return

def _count_mediated_devices(self, enabled_vgpu_types):
"""Counts the sysfs objects (handles) that represent a mediated device
@@ -6502,6 +6589,12 @@ class LibvirtDriver(driver.ComputeDriver):
counts_per_parent = collections.defaultdict(int)
mediated_devices = self._get_mediated_devices(types=enabled_vgpu_types)
for mdev in mediated_devices:
parent_vgpu_type = self._get_vgpu_type_per_pgpu(mdev['parent'])
if mdev['type'] != parent_vgpu_type:
# Even if some mdev was created for another vGPU type, just
# verify all the mdevs related to the type that their pGPU
# has
continue
counts_per_parent[mdev['parent']] += 1
return counts_per_parent

@@ -6520,10 +6613,13 @@ class LibvirtDriver(driver.ComputeDriver):
# dev_id is the libvirt name for the PCI device,
# eg. pci_0000_84_00_0 which matches a PCI address of 0000:84:00.0
dev_name = dev['dev_id']
dev_supported_type = self._get_vgpu_type_per_pgpu(dev_name)
for _type in dev['types']:
if _type != dev_supported_type:
# This is not the type the operator wanted to support for
# this physical GPU
continue
available = dev['types'][_type]['availableInstances']
# TODO(sbauza): Once we support multiple types, check which
# PCI devices are set for this type
# NOTE(sbauza): Even if we support multiple types, Nova will
# only use one per physical GPU.
counts_per_dev[dev_name] += available
@@ -6546,7 +6642,7 @@ class LibvirtDriver(driver.ComputeDriver):
"""

# Bail out early if operator doesn't care about providing vGPUs
enabled_vgpu_types = self._get_supported_vgpu_types()
enabled_vgpu_types = self.supported_vgpu_types
if not enabled_vgpu_types:
return {}
inventories = {}
@@ -6907,6 +7003,11 @@ class LibvirtDriver(driver.ComputeDriver):
mdevs = self._get_mediated_devices(requested_types)
available_mdevs = set()
for mdev in mdevs:
parent_vgpu_type = self._get_vgpu_type_per_pgpu(mdev['parent'])
if mdev['type'] != parent_vgpu_type:
# This mdev is using a vGPU type that is not supported by the
# configuration option for its pGPU parent, so we can't use it.
continue
# FIXME(sbauza): No longer accept the parent value to be nullable
# once we fix the reshape functional test
if parent is None or mdev['parent'] == parent:
@@ -6924,7 +7025,7 @@ class LibvirtDriver(driver.ComputeDriver):

:returns: the newly created mdev UUID or None if not possible
"""
supported_types = self._get_supported_vgpu_types()
supported_types = self.supported_vgpu_types
# Try to see if we can still create a new mediated device
devices = self._get_mdev_capable_devices(supported_types)
for device in devices:
@@ -6935,19 +7036,15 @@ class LibvirtDriver(driver.ComputeDriver):
# The device is not the one that was called, not creating
# the mdev
continue
# For the moment, the libvirt driver only supports one
# type per host
# TODO(sbauza): Once we support more than one type, make
# sure we lookup which type the parent pGPU supports.
asked_type = supported_types[0]
if device['types'][asked_type]['availableInstances'] > 0:
dev_supported_type = self._get_vgpu_type_per_pgpu(dev_name)
if dev_supported_type and device['types'][
dev_supported_type]['availableInstances'] > 0:
# That physical GPU has enough room for a new mdev
# We need the PCI address, not the libvirt name
# The libvirt name is like 'pci_0000_84_00_0'
pci_addr = "{}:{}:{}.{}".format(*dev_name[4:].split('_'))
chosen_mdev = nova.privsep.libvirt.create_mdev(pci_addr,
asked_type,
uuid=uuid)
chosen_mdev = nova.privsep.libvirt.create_mdev(
pci_addr, dev_supported_type, uuid=uuid)
return chosen_mdev

@utils.synchronized(VGPU_RESOURCE_SEMAPHORE)
@@ -6968,12 +7065,8 @@ class LibvirtDriver(driver.ComputeDriver):
vgpu_allocations = self._vgpu_allocations(allocations)
if not vgpu_allocations:
return
# TODO(sbauza): Once we have nested resource providers, find which one
# is having the related allocation for the specific VGPU type.
# For the moment, we should only have one allocation for
# ResourceProvider.
# TODO(sbauza): Iterate over all the allocations once we have
# nested Resource Providers. For the moment, just take the first.
# TODO(sbauza): For the moment, we only support allocations for only
# one pGPU.
if len(vgpu_allocations) > 1:
LOG.warning('More than one allocation was passed over to libvirt '
'while at the moment libvirt only supports one. Only '
@@ -7022,7 +7115,7 @@ class LibvirtDriver(driver.ComputeDriver):
raise exception.ComputeResourcesUnavailable(
reason='vGPU resource is not available')

supported_types = self._get_supported_vgpu_types()
supported_types = self.supported_vgpu_types
# Which mediated devices are created but not assigned to a guest ?
mdevs_available = self._get_existing_mdevs_not_assigned(
parent_device, supported_types)
@@ -7383,12 +7476,9 @@ class LibvirtDriver(driver.ComputeDriver):
'reserved': self._get_reserved_host_disk_gb_from_config(),
}

# NOTE(sbauza): For the moment, the libvirt driver only supports
# providing the total number of virtual GPUs for a single GPU type. If
# you have multiple physical GPUs, each of them providing multiple GPU
# types, only one type will be used for each of the physical GPUs.
# If one of the pGPUs doesn't support this type, it won't be used.
# TODO(sbauza): Use traits to make a better world.
# TODO(sbauza): Use traits to providing vGPU types. For the moment,
# it will be only documentation support by explaining to use
# osc-placement to create custom traits for each of the pGPU RPs.
self._update_provider_tree_for_vgpu(
provider_tree, nodename, allocations=allocations)

@@ -7527,13 +7617,6 @@ class LibvirtDriver(driver.ComputeDriver):
representing that resource provider in the tree
"""
# Create the VGPU child providers if they do not already exist.
# TODO(mriedem): For the moment, _get_supported_vgpu_types() only
# returns one single type but that will be changed once we support
# multiple types.
# Note that we can't support multiple vgpu types until a reshape has
# been performed on the vgpu resources provided by the root provider,
# if any.

# Dict of PGPU RPs keyed by their libvirt PCI name
pgpu_rps = {}
for pgpu_dev_id, inventory in inventories_dict.items():


+ 8
- 0
releasenotes/notes/vgpu-multiple-types-2b1ded7d1cc28880.yaml View File

@@ -0,0 +1,8 @@
---
features:
- |
The libvirt driver now supports defining different virtual GPU types for
each physical GPU. See the ``[devices]/enabled_vgpu_types`` configuration
option for knowing how to do it. Please refer to
https://docs.openstack.org/nova/latest/admin/virtual-gpu.html for further
documentation.

Loading…
Cancel
Save