nova/doc/source/admin/virtual-gpu.rst

=======================================
Attaching virtual GPU devices to guests
=======================================

.. important::

   The functionality described below is only supported by the libvirt/KVM
   driver.

The virtual GPU feature in Nova allows a deployment to provide specific GPU
types for instances using physical GPUs that can provide virtual devices.

For example, a single `Intel GVT-g`_  or a `NVIDIA GRID vGPU`_ physical
Graphics Processing Unit (pGPU) can be virtualized as multiple virtual Graphics
Processing Units (vGPUs) if the hypervisor supports the hardware driver and has
the capability to create guests using those virtual devices.

This feature is highly dependent on the version of libvirt and the physical
devices present on the host. In addition, the vendor's vGPU driver software
must be installed and configured on the host at the same time.

Caveats are mentioned in the `Caveats`_ section.

To enable virtual GPUs, follow the steps below:

#. `Enable GPU types (Compute)`_

#. `Configure a flavor (Controller)`_


Enable GPU types (Compute)
--------------------------

#. For NVIDIA GPUs that support SR-IOV, enable the virtual functions.

   .. code-block:: bash

      $ /usr/lib/nvidia/sriov-manage -e slot:bus:domain.function

   For example, to enable the virtual functions for the GPU with
   slot ``0000``, bus ``41``, domain ``00``, and function ``0``:

   .. code-block:: bash

      $ /usr/lib/nvidia/sriov-manage -e 0000:41:00.0

   You may want to automate this process as it has to be done on each boot of
   the host.

   Given an example ``systemd`` template unit file named
   ``nvidia-sriov-manage@.service``:

   .. code-block:: text

      [Unit]
      After = nvidia-vgpu-mgr.service
      After = nvidia-vgpud.service
      Description = Enable Nvidia GPU virtual functions

      [Service]
      Type = oneshot
      User = root
      Group = root
      ExecStart = /usr/lib/nvidia/sriov-manage -e %i
      # Give a reasonable amount of time for the server to start up/shut down
      TimeoutSec = 120
      # This creates a specific slice which all services will operate from
      #  The accounting options give us the ability to see resource usage
      #  through the `systemd-cgtop` command.
      Slice = system.slice
      # Set Accounting
      CPUAccounting = True
      BlockIOAccounting = True
      MemoryAccounting = True
      TasksAccounting = True
      RemainAfterExit = True
      ExecStartPre = /usr/bin/sleep 30

      [Install]
      WantedBy = multi-user.target

   To enable the virtual functions for the GPU with slot ``0000``, bus ``41``,
   domain ``00``, and function ``0``:

   .. code-block:: bash

      $ systemctl enable nvidia-sriov-manage@0000:41:00.0.service

   .. note::

      This is only an example and it is important to consult the relevant
      vendor documentation for the specific devices that you have.

#. Specify which specific GPU type(s) the instances would get.

   Edit :oslo.config:option:`devices.enabled_mdev_types`:

   .. code-block:: ini

      [devices]
      enabled_mdev_types = nvidia-35

   If you want to support more than a single GPU type, you need to provide a
   separate configuration section for each device. For example:

   .. code-block:: ini

      [devices]
      enabled_mdev_types = nvidia-35, nvidia-36

      [mdev_nvidia-35]
      device_addresses = 0000:84:00.0,0000:85:00.0

      [mdev_nvidia-36]
      device_addresses = 0000:86:00.0

   where you have to define which physical GPUs are supported per GPU type.

   If the same PCI address is provided for two different types, nova-compute
   will refuse to start and issue a specific error in the logs.

   To know which specific type(s) to mention, please refer to `How to discover
   a GPU type`_.

   .. versionchanged:: 21.0.0

      Supporting multiple GPU types is only supported by the Ussuri release and
      later versions.

#. Restart the ``nova-compute`` service.


   .. warning::

        Changing the type is possible but since existing physical GPUs can't
        address multiple guests having different types, that will make Nova
        return you a NoValidHost if existing instances with the original type
        still exist. Accordingly, it's highly recommended to instead deploy the
        new type to new compute nodes that don't already have workloads and
        rebuild instances on the nodes that need to change types.


Configure a flavor (Controller)
-------------------------------

Configure a flavor to request one virtual GPU:

.. code-block:: console

   $ openstack flavor set vgpu_1 --property "resources:VGPU=1"

.. note::

    As of the Queens release, all hypervisors that support virtual GPUs
    only accept a single virtual GPU per instance.

The enabled vGPU types on the compute hosts are not exposed to API users.
Flavors configured for vGPU support can be tied to host aggregates as a means
to properly schedule those flavors onto the compute hosts that support them.
See :doc:`/admin/aggregates` for more information.


Create instances with virtual GPU devices
-----------------------------------------

The ``nova-scheduler`` selects a destination host that has vGPU devices
available by calling the Placement API for a specific VGPU resource class
provided by compute nodes.

.. code-block:: console

   $ openstack server create --flavor vgpu_1 --image cirros-0.3.5-x86_64-uec --wait test-vgpu


How to discover a GPU type
--------------------------

Virtual GPUs are seen as mediated devices. Physical PCI devices (the graphic
card here) supporting virtual GPUs propose mediated device (mdev) types. Since
mediated devices are supported by the Linux kernel through sysfs files after
installing the vendor's virtual GPUs driver software, you can see the required
properties as follows:

.. code-block:: console

    $ ls /sys/class/mdev_bus/*/mdev_supported_types
    /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
    nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

    /sys/class/mdev_bus/0000:85:00.0/mdev_supported_types:
    nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

    /sys/class/mdev_bus/0000:86:00.0/mdev_supported_types:
    nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45

    /sys/class/mdev_bus/0000:87:00.0/mdev_supported_types:
    nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-45


Checking allocations and inventories for virtual GPUs
-----------------------------------------------------

.. note::

   The information below is only valid from the 19.0.0 Stein release.  Before
   this release, inventories and allocations related to a ``VGPU`` resource
   class are still on the root resource provider related to the compute node.
   If upgrading from Rocky and using the libvirt driver, ``VGPU`` inventory and
   allocations are moved to child resource providers that represent actual
   physical GPUs.

The examples you will see are using the `osc-placement plugin`_ for
OpenStackClient. For details on specific commands, see its documentation.

#. Get the list of resource providers

   .. code-block:: console

     $ openstack resource provider list
     +--------------------------------------+---------------------------------------------------------+------------+
     | uuid                                 | name                                                    | generation |
     +--------------------------------------+---------------------------------------------------------+------------+
     | 5958a366-3cad-416a-a2c9-cfbb5a472287 | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx                  |          7 |
     | fc9b9287-ef5e-4408-aced-d5577560160c | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_86_00_0 |          2 |
     | e2f8607b-0683-4141-a8af-f5e20682e28c | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_85_00_0 |          3 |
     | 85dd4837-76f9-41f2-9f19-df386017d8a0 | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_87_00_0 |          2 |
     | 7033d860-8d8a-4963-8555-0aa902a08653 | virtlab606.xxxxxxxxxxxxxxxxxxxxxxxxxxx_pci_0000_84_00_0 |          2 |
     +--------------------------------------+---------------------------------------------------------+------------+

   In this example, we see the root resource provider
   ``5958a366-3cad-416a-a2c9-cfbb5a472287`` with four other resource providers
   that are its children and where each of them corresponds to a single
   physical GPU.

#. Check the inventory of each resource provider to see resource classes

   .. code-block:: console

     $ openstack resource provider inventory list 5958a366-3cad-416a-a2c9-cfbb5a472287
     +----------------+------------------+----------+----------+-----------+----------+-------+
     | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
     +----------------+------------------+----------+----------+-----------+----------+-------+
     | VCPU           |             16.0 |       48 |        0 |         1 |        1 |    48 |
     | MEMORY_MB      |              1.5 |    65442 |      512 |         1 |        1 | 65442 |
     | DISK_GB        |              1.0 |       49 |        0 |         1 |        1 |    49 |
     +----------------+------------------+----------+----------+-----------+----------+-------+
     $ openstack resource provider inventory list e2f8607b-0683-4141-a8af-f5e20682e28c
     +----------------+------------------+----------+----------+-----------+----------+-------+
     | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
     +----------------+------------------+----------+----------+-----------+----------+-------+
     | VGPU           |              1.0 |       16 |        0 |         1 |        1 |    16 |
     +----------------+------------------+----------+----------+-----------+----------+-------+

   Here you can see a ``VGPU`` inventory on the child resource provider while
   other resource class inventories are still located on the root resource
   provider.

#. Check allocations for each server that is using virtual GPUs

   .. code-block:: console

     $ openstack server list
     +--------------------------------------+-------+--------+---------------------------------------------------------+--------------------------+--------+
     | ID                                   | Name  | Status | Networks                                                | Image                    | Flavor |
     +--------------------------------------+-------+--------+---------------------------------------------------------+--------------------------+--------+
     | 5294f726-33d5-472a-bef1-9e19bb41626d | vgpu2 | ACTIVE | private=10.0.0.14, fd45:cdad:c431:0:f816:3eff:fe78:a748 | cirros-0.4.0-x86_64-disk | vgpu   |
     | a6811fc2-cec8-4f1d-baea-e2c6339a9697 | vgpu1 | ACTIVE | private=10.0.0.34, fd45:cdad:c431:0:f816:3eff:fe54:cc8f | cirros-0.4.0-x86_64-disk | vgpu   |
     +--------------------------------------+-------+--------+---------------------------------------------------------+--------------------------+--------+

     $ openstack resource provider allocation show 5294f726-33d5-472a-bef1-9e19bb41626d
     +--------------------------------------+------------+------------------------------------------------+
     | resource_provider                    | generation | resources                                      |
     +--------------------------------------+------------+------------------------------------------------+
     | 5958a366-3cad-416a-a2c9-cfbb5a472287 |          8 | {u'VCPU': 1, u'MEMORY_MB': 512, u'DISK_GB': 1} |
     | 7033d860-8d8a-4963-8555-0aa902a08653 |          3 | {u'VGPU': 1}                                   |
     +--------------------------------------+------------+------------------------------------------------+

     $ openstack resource provider allocation show a6811fc2-cec8-4f1d-baea-e2c6339a9697
     +--------------------------------------+------------+------------------------------------------------+
     | resource_provider                    | generation | resources                                      |
     +--------------------------------------+------------+------------------------------------------------+
     | e2f8607b-0683-4141-a8af-f5e20682e28c |          3 | {u'VGPU': 1}                                   |
     | 5958a366-3cad-416a-a2c9-cfbb5a472287 |          8 | {u'VCPU': 1, u'MEMORY_MB': 512, u'DISK_GB': 1} |
     +--------------------------------------+------------+------------------------------------------------+

   In this example, two servers were created using a flavor asking for 1
   ``VGPU``, so when looking at the allocations for each consumer UUID (which
   is the server UUID), you can see that VGPU allocation is against the child
   resource provider while other allocations are for the root resource
   provider. Here, that means that the virtual GPU used by
   ``a6811fc2-cec8-4f1d-baea-e2c6339a9697`` is actually provided by the
   physical GPU having the PCI ID ``0000:85:00.0``.


(Optional) Provide custom traits for multiple GPU types
-------------------------------------------------------

Since operators want to support different GPU types per compute, it would be
nice to have flavors asking for a specific GPU type. This is now possible
using custom traits by decorating child Resource Providers that correspond
to physical GPUs.

.. note::

   Possible improvements in a future release could consist of providing
   automatic tagging of Resource Providers with standard traits corresponding
   to versioned mapping of public GPU types. For the moment, this has to be
   done manually.

#. Get the list of resource providers

   See `Checking allocations and inventories for virtual GPUs`_ first for getting
   the list of Resource Providers that support a ``VGPU`` resource class.

#. Define custom traits that will correspond for each to a GPU type

   .. code-block:: console

      $ openstack --os-placement-api-version 1.6 trait create CUSTOM_NVIDIA_11

   In this example, we ask to create a custom trait named ``CUSTOM_NVIDIA_11``.

#. Add the corresponding trait to the Resource Provider matching the GPU

   .. code-block:: console

      $ openstack --os-placement-api-version 1.6 resource provider trait set \
          --trait CUSTOM_NVIDIA_11 e2f8607b-0683-4141-a8af-f5e20682e28c

   In this case, the trait ``CUSTOM_NVIDIA_11`` will be added to the Resource
   Provider with the UUID ``e2f8607b-0683-4141-a8af-f5e20682e28c`` that
   corresponds to the PCI address ``0000:85:00:0`` as shown above.

#. Amend the flavor to add a requested trait

   .. code-block:: console

      $ openstack flavor set --property trait:CUSTOM_NVIDIA_11=required vgpu_1

   In this example, we add the ``CUSTOM_NVIDIA_11`` trait as a required
   information for the ``vgpu_1`` flavor we created earlier.

   This will allow the Placement service to only return the Resource Providers
   matching this trait so only the GPUs that were decorated with will be checked
   for this flavor.


Caveats
-------

.. note::

   This information is correct as of the 17.0.0 Queens release. Where
   improvements have been made or issues fixed, they are noted per item.

* After installing the NVIDIA driver on compute nodes, if ``mdev`` are not
  visible but VF devices are present under a path like
  ``/sys/bus/pci/devices/0000:25:00.4/nvidia``, this indicates that the
  **kernel variant driver** is in use.

  This most likely occurs on **Ubuntu Noble** or **RHEL 10**.

  .. versionchanged:: 31.0.0

   Please refer to the `PCI passthrough documentation`_ for proper
   configuration.

* When using recent nVidia GPU architectures like Ampere or newer GPUs which
  have SR-IOV feature, Nova can't know how many vGPUs can be used by a specific
  type. You then need to create virtual functions and then provide the list
  of the virtual functions per GPUs that can be used by setting
  ``device_addresses``.

  .. versionchanged:: 29.0.0

   By the 2024.1 Caracal release, if you use those hardware, you need to
   provide a new configuration option named
   ``max_instances`` in the related mdev type group (eg. ``mdev_nvidia-35``)
   where the value of that option would be the number of vGPUs that the type
   can create.

   As an example for the `A40-2Q nVidia GPU type`__ which can create up to 24
   vGPUs, please provide the below configuration :

   .. __: https://docs.nvidia.com/vgpu/16.0/grid-vgpu-user-guide/index.html#vgpu-types-nvidia-a40

   .. code-block:: ini

      [devices]
      enabled_mdev_types = nvidia-558

      [mdev_nvidia-558]
      max_instances = 24

   As a side note, you can see that we don't use ``device_addresses`` in the
   ``mdev_nvidia-558`` section, as we don't need to tell which exact virtual
   functions we want to use for that type.


* When live-migrating an instance using vGPUs, the libvirt guest domain XML
  isn't updated with the new mediated device UUID to use for the target.

  .. versionchanged:: 29.0.0

   In the 2024.1 Caracal release, Nova now `supports vGPU live-migrations`_. In
   order to do this, both the source and target compute service need to have
   minimum versions of libvirt-8.6.0, QEMU-8.1.0 and Linux kernel 5.18.0. You
   need to ensure that either you use only single common vGPU type between two
   computes. Where multiple mdev types are configured on the source and
   destination host, custom traits or custom resource classes must be
   configured, reported by the host and requested by the instance to make sure
   that the Placement API correctly returns the supported GPU using the right
   vGPU type for a migration. Last but not least, if you want to live-migrate
   nVidia mediated devices, you need to update
   :oslo.config:option:`libvirt.live_migration_downtime`,
   :oslo.config:option:`libvirt.live_migration_downtime_steps` and
   :oslo.config:option:`libvirt.live_migration_downtime_delay`:

   .. code-block:: ini

      live_migration_downtime = 500000
      live_migration_downtime_steps = 3
      live_migration_downtime_delay = 3

   You can see an example of a working live-migration `here`__.

   .. __: http://sbauza.github.io/vgpu/vgpu_live_migration.html


* Suspending a guest that has vGPUs doesn't yet work because of a libvirt
  limitation (it can't hot-unplug mediated devices from a guest). Workarounds
  using other instance actions (like snapshotting the instance or shelving it)
  are recommended until libvirt gains mdev hot-unplug support. If a user
  attempts to suspend the instance, the libvirt driver will raise an exception
  that will cause the instance to be set back to ACTIVE. The ``suspend`` action
  in the ``os-instance-actions`` API will have an *Error* state.

  .. versionchanged:: 25.0.0

   This has been resolved in the Yoga release. See `bug 1948705`_.

* Resizing an instance with a new flavor that has vGPU resources doesn't
  allocate those vGPUs to the instance (the instance is created without
  vGPU resources). The proposed workaround is to rebuild the instance after
  resizing it. The rebuild operation allocates vGPUS to the instance.

  .. versionchanged:: 21.0.0

     This has been resolved in the Ussuri release. See `bug 1778563`_.

* Cold migrating an instance to another host will have the same problem as
  resize. If you want to migrate an instance, make sure to rebuild it after the
  migration.

  .. versionchanged:: 21.0.0

     This has been resolved in the Ussuri release. See `bug 1778563`_.

* Rescue images do not use vGPUs. An instance being rescued does not keep its
  vGPUs during rescue. During that time, another instance can receive those
  vGPUs. This is a known issue. The recommended workaround is to rebuild an
  instance immediately after rescue. However, rebuilding the rescued instance
  only helps if there are other free vGPUs on the host.

  .. versionchanged:: 18.0.0

     This has been resolved in the Rocky release. See `bug 1762688`_.

For nested vGPUs:

.. note::

   This information is correct as of the 21.0.0 Ussuri release. Where
   improvements have been made or issues fixed, they are noted per item.

* If creating servers with a flavor asking for vGPUs and the user wants
  multi-create (i.e. say --max 2) then the scheduler could be returning
  a NoValidHosts exception even if each physical GPU can support at least
  one specific instance, if the total wanted capacity is not supported by
  only one physical GPU.
  (See `bug 1874664 <https://bugs.launchpad.net/nova/+bug/1874664>`_.)

  For example, creating servers with a flavor asking for vGPUs, if two
  children RPs have 4 vGPU inventories each:

    - You can ask for a flavor with 2 vGPU with --max 2.
    - But you can't ask for a flavor with 4 vGPU and --max 2.

.. _bug 1778563: https://bugs.launchpad.net/nova/+bug/1778563
.. _bug 1762688: https://bugs.launchpad.net/nova/+bug/1762688
.. _bug 1948705: https://bugs.launchpad.net/nova/+bug/1948705
.. _supports vGPU live-migrations: https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/libvirt-mdev-live-migrate.html

.. Links
.. _Intel GVT-g: https://01.org/igvt-g
.. _NVIDIA GRID vGPU: http://docs.nvidia.com/grid/5.0/pdf/grid-vgpu-user-guide.pdf
.. _osc-placement plugin: https://docs.openstack.org/osc-placement/latest/index.html
.. _PCI passthrough documentation: https://docs.openstack.org/nova/latest/admin/pci-passthrough.html