nova/doc/source/admin/virtual-gpu.rst
Sylvain Bauza 8abc7b47fd Modify the mdevs in the migrate XML
Now the destination returns the list of the needed mdevs for the
migration, we can change the XML.

Note: this is the last patch of the feature branch.
I'll work on adding mtty support in the next patches in the series
but that's not a feature usage.

Change-Id: Ib448444be09df50c3db5ccda8a49bfd882c18edf
Implements: blueprint libvirt-mdev-live-migrate
2024-02-28 15:53:49 +01:00

18 KiB

Attaching virtual GPU devices to guests

Important

The functionality described below is only supported by the libvirt/KVM driver.

The virtual GPU feature in Nova allows a deployment to provide specific GPU types for instances using physical GPUs that can provide virtual devices.

For example, a single Intel GVT-g or a NVIDIA GRID vGPU physical Graphics Processing Unit (pGPU) can be virtualized as multiple virtual Graphics Processing Units (vGPUs) if the hypervisor supports the hardware driver and has the capability to create guests using those virtual devices.

This feature is highly dependent on the version of libvirt and the physical devices present on the host. In addition, the vendor's vGPU driver software must be installed and configured on the host at the same time.

Caveats are mentioned in the Caveats section.

To enable virtual GPUs, follow the steps below:

  1. Enable GPU types (Compute)
  2. Configure a flavor (Controller)

Enable GPU types (Compute)

  1. Specify which specific GPU type(s) the instances would get.

    Edit :oslo.configdevices.enabled_mdev_types:

    [devices]
    enabled_mdev_types = nvidia-35

    If you want to support more than a single GPU type, you need to provide a separate configuration section for each device. For example:

    [devices]
    enabled_mdev_types = nvidia-35, nvidia-36
    
    [mdev_nvidia-35]
    device_addresses = 0000:84:00.0,0000:85:00.0
    
    [mdev_nvidia-36]
    device_addresses = 0000:86:00.0

    where you have to define which physical GPUs are supported per GPU type.

    If the same PCI address is provided for two different types, nova-compute will refuse to start and issue a specific error in the logs.

    To know which specific type(s) to mention, please refer to How to discover a GPU type.

    21.0.0

    Supporting multiple GPU types is only supported by the Ussuri release and later versions.

  2. Restart the nova-compute service.

    Warning

    Changing the type is possible but since existing physical GPUs can't address multiple guests having different types, that will make Nova return you a NoValidHost if existing instances with the original type still exist. Accordingly, it's highly recommended to instead deploy the new type to new compute nodes that don't already have workloads and rebuild instances on the nodes that need to change types.

Configure a flavor (Controller)

Configure a flavor to request one virtual GPU:

$ openstack flavor set vgpu_1 --property "resources:VGPU=1"

Note

As of the Queens release, all hypervisors that support virtual GPUs only accept a single virtual GPU per instance.

The enabled vGPU types on the compute hosts are not exposed to API users. Flavors configured for vGPU support can be tied to host aggregates as a means to properly schedule those flavors onto the compute hosts that support them. See /admin/aggregates for more information.

Create instances with virtual GPU devices

The nova-scheduler selects a destination host that has vGPU devices available by calling the Placement API for a specific VGPU resource class provided by compute nodes.

$ openstack server create --flavor vgpu_1 --image cirros-0.3.5-x86_64-uec --wait test-vgpu

How to discover a GPU type

Virtual GPUs are seen as mediated devices. Physical PCI devices (the graphic card here) supporting virtual GPUs propose mediated device (mdev) types. Since mediated devices are supported by the Linux kernel through sysfs files after installing the vendor's virtual GPUs driver software, you can see the required properties as follows:

$ ls /sys/class/mdev_bus/*/mdev_supported_types
/sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

/sys/class/mdev_bus/0000:85:00.0/mdev_supported_types:
nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

/sys/class/mdev_bus/0000:86:00.0/mdev_supported_types:
nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

/sys/class/mdev_bus/0000:87:00.0/mdev_supported_types:
nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

Checking allocations and inventories for virtual GPUs

Note

The information below is only valid from the 19.0.0 Stein release. Before this release, inventories and allocations related to a VGPU resource class are still on the root resource provider related to the compute node. If upgrading from Rocky and using the libvirt driver, VGPU inventory and allocations are moved to child resource providers that represent actual physical GPUs.

The examples you will see are using the osc-placement plugin for OpenStackClient. For details on specific commands, see its documentation.

  1. Get the list of resource providers

    $ openstack resource provider list
    +--------------------------------------+---------------------------------------------------------+------------+
    | uuid                                 | name                                                    | generation |
    +--------------------------------------+---------------------------------------------------------+------------+
    | 5958a366-3cad-416a-a2c9-cfbb5a472287 | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx                  |          7 |
    | fc9b9287-ef5e-4408-aced-d5577560160c | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_86_00_0 |          2 |
    | e2f8607b-0683-4141-a8af-f5e20682e28c | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_85_00_0 |          3 |
    | 85dd4837-76f9-41f2-9f19-df386017d8a0 | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_87_00_0 |          2 |
    | 7033d860-8d8a-4963-8555-0aa902a08653 | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_84_00_0 |          2 |
    +--------------------------------------+---------------------------------------------------------+------------+

    In this example, we see the root resource provider 5958a366-3cad-416a-a2c9-cfbb5a472287 with four other resource providers that are its children and where each of them corresponds to a single physical GPU.

  2. Check the inventory of each resource provider to see resource classes

    $ openstack resource provider inventory list 5958a366-3cad-416a-a2c9-cfbb5a472287
    +----------------+------------------+----------+----------+-----------+----------+-------+
    | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
    +----------------+------------------+----------+----------+-----------+----------+-------+
    | VCPU           |             16.0 |       48 |        0 |         1 |        1 |    48 |
    | MEMORY_MB      |              1.5 |    65442 |      512 |         1 |        1 | 65442 |
    | DISK_GB        |              1.0 |       49 |        0 |         1 |        1 |    49 |
    +----------------+------------------+----------+----------+-----------+----------+-------+
    $ openstack resource provider inventory list e2f8607b-0683-4141-a8af-f5e20682e28c
    +----------------+------------------+----------+----------+-----------+----------+-------+
    | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
    +----------------+------------------+----------+----------+-----------+----------+-------+
    | VGPU           |              1.0 |       16 |        0 |         1 |        1 |    16 |
    +----------------+------------------+----------+----------+-----------+----------+-------+

    Here you can see a VGPU inventory on the child resource provider while other resource class inventories are still located on the root resource provider.

  3. Check allocations for each server that is using virtual GPUs

    $ openstack server list
    +--------------------------------------+-------+--------+---------------------------------------------------------+--------------------------+--------+
    | ID                                   | Name  | Status | Networks                                                | Image                    | Flavor |
    +--------------------------------------+-------+--------+---------------------------------------------------------+--------------------------+--------+
    | 5294f726-33d5-472a-bef1-9e19bb41626d | vgpu2 | ACTIVE | private=10.0.0.14, fd45:cdad:c431:0:f816:3eff:fe78:a748 | cirros-0.4.0-x86_64-disk | vgpu   |
    | a6811fc2-cec8-4f1d-baea-e2c6339a9697 | vgpu1 | ACTIVE | private=10.0.0.34, fd45:cdad:c431:0:f816:3eff:fe54:cc8f | cirros-0.4.0-x86_64-disk | vgpu   |
    +--------------------------------------+-------+--------+---------------------------------------------------------+--------------------------+--------+
    
    $ openstack resource provider allocation show 5294f726-33d5-472a-bef1-9e19bb41626d
    +--------------------------------------+------------+------------------------------------------------+
    | resource_provider                    | generation | resources                                      |
    +--------------------------------------+------------+------------------------------------------------+
    | 5958a366-3cad-416a-a2c9-cfbb5a472287 |          8 | {u'VCPU': 1, u'MEMORY_MB': 512, u'DISK_GB': 1} |
    | 7033d860-8d8a-4963-8555-0aa902a08653 |          3 | {u'VGPU': 1}                                   |
    +--------------------------------------+------------+------------------------------------------------+
    
    $ openstack resource provider allocation show a6811fc2-cec8-4f1d-baea-e2c6339a9697
    +--------------------------------------+------------+------------------------------------------------+
    | resource_provider                    | generation | resources                                      |
    +--------------------------------------+------------+------------------------------------------------+
    | e2f8607b-0683-4141-a8af-f5e20682e28c |          3 | {u'VGPU': 1}                                   |
    | 5958a366-3cad-416a-a2c9-cfbb5a472287 |          8 | {u'VCPU': 1, u'MEMORY_MB': 512, u'DISK_GB': 1} |
    +--------------------------------------+------------+------------------------------------------------+

    In this example, two servers were created using a flavor asking for 1 VGPU, so when looking at the allocations for each consumer UUID (which is the server UUID), you can see that VGPU allocation is against the child resource provider while other allocations are for the root resource provider. Here, that means that the virtual GPU used by a6811fc2-cec8-4f1d-baea-e2c6339a9697 is actually provided by the physical GPU having the PCI ID 0000:85:00.0.

(Optional) Provide custom traits for multiple GPU types

Since operators want to support different GPU types per compute, it would be nice to have flavors asking for a specific GPU type. This is now possible using custom traits by decorating child Resource Providers that correspond to physical GPUs.

Note

Possible improvements in a future release could consist of providing automatic tagging of Resource Providers with standard traits corresponding to versioned mapping of public GPU types. For the moment, this has to be done manually.

  1. Get the list of resource providers

    See Checking allocations and inventories for virtual GPUs first for getting the list of Resource Providers that support a VGPU resource class.

  2. Define custom traits that will correspond for each to a GPU type

    $ openstack --os-placement-api-version 1.6 trait create CUSTOM_NVIDIA_11

    In this example, we ask to create a custom trait named CUSTOM_NVIDIA_11.

  3. Add the corresponding trait to the Resource Provider matching the GPU

    $ openstack --os-placement-api-version 1.6 resource provider trait set \
        --trait CUSTOM_NVIDIA_11 e2f8607b-0683-4141-a8af-f5e20682e28c

    In this case, the trait CUSTOM_NVIDIA_11 will be added to the Resource Provider with the UUID e2f8607b-0683-4141-a8af-f5e20682e28c that corresponds to the PCI address 0000:85:00:0 as shown above.

  4. Amend the flavor to add a requested trait

    $ openstack flavor set --property trait:CUSTOM_NVIDIA_11=required vgpu_1

    In this example, we add the CUSTOM_NVIDIA_11 trait as a required information for the vgpu_1 flavor we created earlier.

    This will allow the Placement service to only return the Resource Providers matching this trait so only the GPUs that were decorated with will be checked for this flavor.

Caveats

Note

This information is correct as of the 17.0.0 Queens release. Where improvements have been made or issues fixed, they are noted per item.

  • When live-migrating an instance using vGPUs, the libvirt guest domain XML isn't updated with the new mediated device UUID to use for the target.

    29.0.0

    In the 2024.2 Caracal release, Nova now supports vGPU live-migrations. In order to do this, both the source and target compute service need to have minimum versions of libvirt-8.6.0, QEMU-8.1.0 and Linux kernel 5.18.0. You need to ensure that either you use only single common vGPU type between two computes. Where multiple mdev types are configured on the source and destination host, custom traits or custom resource classes must be configured, reported by the host and requested by the instance to make sure that the Placement API correctly returns the supported GPU using the right vGPU type for a migration. Last but not least, if you want to live-migrate nVidia mediated devices, you need to update :oslo.configlibvirt.live_migration_downtime, :oslo.configlibvirt.live_migration_downtime_steps and :oslo.configlibvirt.live_migration_downtime_delay:

    live_migration_downtime = 500000
    live_migration_downtime_steps = 3
    live_migration_downtime_delay = 3

    You can see an example of a working live-migration here__.

  • Suspending a guest that has vGPUs doesn't yet work because of a libvirt limitation (it can't hot-unplug mediated devices from a guest). Workarounds using other instance actions (like snapshotting the instance or shelving it) are recommended until libvirt gains mdev hot-unplug support. If a user attempts to suspend the instance, the libvirt driver will raise an exception that will cause the instance to be set back to ACTIVE. The suspend action in the os-instance-actions API will have an Error state.

    25.0.0

    This has been resolved in the Yoga release. See bug 1948705.

  • Resizing an instance with a new flavor that has vGPU resources doesn't allocate those vGPUs to the instance (the instance is created without vGPU resources). The proposed workaround is to rebuild the instance after resizing it. The rebuild operation allocates vGPUS to the instance.

    21.0.0

    This has been resolved in the Ussuri release. See bug 1778563.

  • Cold migrating an instance to another host will have the same problem as resize. If you want to migrate an instance, make sure to rebuild it after the migration.

    21.0.0

    This has been resolved in the Ussuri release. See bug 1778563.

  • Rescue images do not use vGPUs. An instance being rescued does not keep its vGPUs during rescue. During that time, another instance can receive those vGPUs. This is a known issue. The recommended workaround is to rebuild an instance immediately after rescue. However, rebuilding the rescued instance only helps if there are other free vGPUs on the host.

    18.0.0

    This has been resolved in the Rocky release. See bug 1762688.

For nested vGPUs:

Note

This information is correct as of the 21.0.0 Ussuri release. Where improvements have been made or issues fixed, they are noted per item.

  • If creating servers with a flavor asking for vGPUs and the user wants multi-create (i.e. say --max 2) then the scheduler could be returning a NoValidHosts exception even if each physical GPU can support at least one specific instance, if the total wanted capacity is not supported by only one physical GPU. (See bug 1874664.)

    For example, creating servers with a flavor asking for vGPUs, if two children RPs have 4 vGPU inventories each:

    • You can ask for a flavor with 2 vGPU with --max 2.
    • But you can't ask for a flavor with 4 vGPU and --max 2.