nova/releasenotes/notes/add-support-for-vgpu-libvir...

---
features:
  - |
    The libvirt driver now supports booting instances by asking for virtual
    GPUs.
    In order to support that, the operators should specify the enabled vGPU
    types in the nova-compute configuration file by using the configuration
    option ``[devices]/enabled_vgpu_types``. Only the enabled vGPU types can be
    used by instances.

    For knowing which types the physical GPU driver supports for libvirt, the
    operator can look at the sysfs by doing::

      ls /sys/class/mdev_bus/<device>/mdev_supported_types

    Operators can specify a VGPU resource in a flavor by adding in the flavor's
    extra specs::

      nova flavor-key <flavor-id> set resources:VGPU=1

    That said, Nova currently has some caveats for using vGPUs.

    * For the moment, only a single type can be supported across one compute
      node, which means that libvirt will create the vGPU by using that
      specific type only. It's also possible to have two compute nodes having
      different types but there is no possibility yet to specify in the flavor
      which specific type we want to use for that instance.

    * Suspending a guest having vGPUs doesn't work yet given a libvirt concern
      (it can't hot-unplug mediated devices from a guest). Workarounds using
      other instance actions (like snapshotting the instance or shelving it)
      are recommended until libvirt supports that. If a user asks to suspend
      the instance, Nova will get an exception that will set the instance state
      back to ``ACTIVE``, and you can see the suspend action in
      ``os-instance-action`` API will be Error.

    * Resizing an instance with a new flavor that has vGPU resources doesn't
      allocate those vGPUs to the instance (the instance is created without
      vGPU resources). We propose to work around this problem by rebuilding the
      instance once it has been resized so then it will have allocated vGPUs.

    * Migrating an instance to another host will have the same problem as
      resize. In case you want to migrate an instance, make sure to rebuild
      it.

    * Rescuing an instance having vGPUs will mean that the rescue image won't
      use the existing vGPUs. When unrescuing, it will use again the existing
      vGPUs that were allocated to the instance. That said, given Nova looks
      at all the allocated vGPUs when trying to find unallocated ones, there
      could be a race condition if an instance is rescued at the moment a new
      instance asking for vGPUs is created, because both instances could use
      the same vGPUs. If you want to rescue an instance, make sure to disable
      the host until we fix that in Nova.

    * Mediated devices that are created by the libvirt driver are not persisted
      upon reboot. Consequently, a guest startup would fail since the virtual
      device wouldn't exist. In order to prevent that issue, when restarting
      the compute service, the libvirt driver now looks at all the guest XMLs
      to check if they have mediated devices, and if the mediated device no
      longer exists, then Nova recreates it by using the same UUID.

    * If you use NVIDIA GRID cards, please know that there is a limitation with
      the NVIDIA driver that prevents one guest to have more than one virtual
      GPU from the same physical card. One guest can have two or more virtual
      GPUs but then it requires each vGPU to be hosted by a separate physical
      card. Until that limitation is removed, please avoid creating flavors
      asking for more than one vGPU.

    We are working actively to remove or workaround those caveats, but please
    understand that for the moment this feature is experimental given all the
    above.