Support virtual GPU resources

Add spec to support virtual GPU resources. Change-Id: I103da607c70d6097988fec5329819e5e89f8b4b7 Blueprint: add-support-for-vgpu
2017-03-27 08:30:08 +00:00
parent f49397be06
commit 5906ce4bad
1 changed files with 370 additions and 0 deletions
--- a/specs/queens/approved/virt-add-support-for-vgpu.rst
+++ b/specs/queens/approved/virt-add-support-for-vgpu.rst
@@ -0,0 +1,370 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=============================
+Support virtual GPU resources
+=============================
+
+https://blueprints.launchpad.net/nova/+spec/add-support-for-vgpu
+
+Add support for virtual GPU (vGPU) resources.
+
+Problem description
+===================
+
+With some graphics virtualization solutions e.g. `Intel's GVT-g`_ and
+`NVIDIA GRID vGPU`_, a single physical Graphics Processing Unit (pGPU)
+can be virtualized as multiple virtual Graphics Processing Units (vGPU).
+Some hypervisors support to boot VMs with vGPU to accelerate graphics
+processing. But presently Nova can't support vGPU.
+
+The compute node may have one or multiple pGPUs and each pGPU could support
+multiple vGPUs. Some pGPUs (e.g. NVIDIA GRID K1) support several different
+vGPU types and each vGPU type has a fixed amount of frame buffer, number of
+supported display heads and maximum resolutions and are targeted at different
+classes of workload. Due to their different resource requirements, the maximum
+number of vGPUs that can be created simultaneously on a pGPU varies
+according to the vGPU type.
+
+The following are examples for different vGPU types:
+
+.. rubric:: Example 1: vGPUs on NVIDIA GRID K1
+
+::
+
+ +----------------+---------------------------------------+
+ | Card Type      | NVIDIA GRID K1                        |
+ +----------------+---------------------------------------+
+ | No. of pGPUs   | 4                                     |
+ +----------------+---------------------------------------+
+ | FB size (MB)   | 4096  | 2048  | 1024  | 512   | 256   |
+ +----------------+-------+-------+-------+-------+-------+
+ | Max heads      |   4   |  4    |  2    |  2    |  2    |
+ +----------------+-------+-------+-------+-------+-------+
+ | vGPU model     | K180Q | K160Q | K140Q | K120Q | K100  |
+ +----------------+-----------------------+---------------+
+ | Max Resolution |    2560x1600          | 1920×1200     |
+ +----------------+-----------------------+---------------+
+ | vGPUs per GPU  |  1    |  2    | 4     | 8     | 8     |
+ +----------------+-------+-------+-------+-------+-------+
+
+.. rubric:: Example 2: Intel GVT-g vGPUs on Intel(R) Xeon(R) CPU E3-1285 v4
+
+::
+
+ +----------------+------------------------------------+
+ | pGPU model     | Iris Pro Graphics P6300            |
+ +----------------+------------------------------------+
+ | vGPU model     | Intel GVT-g                        |
+ +----------------+------------------------------------+
+ |Framebuffer size| 128 MB                             |
+ +----------------+------------------------------------+
+ | Max heads      | 1                                  |
+ +----------------+------------------------------------+
+ | Max Resolution | 1920x1080                          |
+ +----------------+------------------------------------+
+ | No. of vGPUs   |                                    |
+ |    per GPU     | 7                                  |
+ +----------------+------------------------------------+
+
+In this spec, we will define a model to track vGPU resources.
+
+Use Cases
+----------
+
+* As a cloud administrator, I should be able to define flavors which request
+  an amount of vGPU resources.
+
+* As a cloud administrator, I should be able to specify the supported display
+  heads number and resolutions for vGPUs defined in the flavors; end users can
+  choose a proper flavor with the expected performance.
+
+* As a cloud administrator, I should be able to define flavors which request
+  vGPUs that support some special features e.g. `OpenGL`_ to achieve
+  hardware-accelerated rendering.
+
+* As an end user, I should be allowed to boot VMs which have vGPUs by using
+  the pre-defined flavor.
+
+Proposed change
+===============
+
+# Define resource tracking model for vGPU: There are both **quantitative**
+  and **qualitative** aspects need to be tracked for vGPU resources:
+
+  * Tracking **quantitative** aspects of the vGPU resource:
+
+    * Define a new standard resource class `resource-classes`_ to track the
+      amount of vGPUs (``ResourceClass.VGPU``) and define another resource
+      class (``ResourceClass.VGPU_DISPLAY_HEAD``) to track the display
+      heads in the resource providers.
+
+    * Generate the resource provider(RP) tree to track the amount of vGPUs
+      available and the number of vGPU display heads.
+      The resource tracking model is as the following::
+
+       resource provider:                  compute_node
+                                       /        |          \
+       resource provider:            RP_1      RP_2   ...  RP_n
+                                    /           |              \
+       inventory:          vGPU_inv_1       vGPU_inv_2   ...    vGPU_inv_n
+                           DIS_HEAD_inv_1  DIS_HEAD_inv_2 ... DIS_HEAD_inv_n
+
+      In virt driver (in the function of ``get_inventory()``), it would ask
+      the hypervisor to get the existing pGPUs, their capacity for vGPUs and
+      the number of vGPU display heads. Note: if the hypervisor doesn't return
+      the number of display heads, then the VGPU_DISPLAY_HEAD inventory record
+      should not be created.
+      With the inventory data, virt driver makes resource providers for each
+      pGPU or each pGPU group (depend on how the pGPUs are managed by
+      hypervisors). These resource providers will be associated as the
+      compute_node's children `nested-resource-providers`_.
+
+       * *RP for GPU*: For example, libvirt will report the available vGPU
+         number for each pGPU. In this way, if there are multiple pGPUs (same
+         model), it can create one type of vGPUs on a pGPU and create other
+         types of vGPUs on the remaining pGPUs.
+
+       * *RP for pGPU group*: XenServer uses pGPU groups to manage pGPUs. A
+         pGPU group is a collection of pGPUs which belong to the same model.
+         On creating vGPU, it will search the target group for a GPU which can
+         supply the requested vGPU. In another word, **it is not possible to
+         specify which pGPU the vGPU to be created on**. So XenAPI (the virt
+         of XenServer) should make RP for each pGPU group. And the amount of
+         in the inventory should be total number of vGPUs which can be supplied
+         by pGPUs belong the group.
+
+      As described above, some pGPUs (e.g. NVIDIA GRID K1) support different
+      sized vGPU types. The capacity for different vGPU types varies. In order
+      to make resource tracking easier, we need to make sure the number of the
+      vGPU is predictable. So we will add a new whitelist in nova.conf to
+      specify the enabled vGPU types to ensure each resource provider of vGPUs
+      only has one type of vGPUs. The whitelist is defined as the following::
+
+       enabled_vgpu_types = [ str_vgpu_type_1, str_vgpu_type_2, ... ]
+
+      Note: the str_vgpu_type_x is a string representing a vGPU type. Different
+      hypervisors may expose the vGPU types with different strings. The virt
+      driver should handle that properly and map the whitelist to the correct
+      vGPUs types.
+
+      For example, NVIDIA's vGPU type M60-0B is exposed with the type id:
+      "nvidia-11" in libvirt; but that's exposed in XenServer with the type name:
+      "GRID M60-0B". If we want to enable this vGPU type::
+
+      * the whitelist when libvirt is the hypervisor should be:
+
+        enabled_vgpu_types = [ "nvidia-11" ]
+
+      * the whitelist when XenServer is the hypervisor should be:
+
+        enabled_vgpu_types = [ "GRID M60-0B" ]
+
+      The vGPU resource number should be 8 (4 GPU per card * 2 vGPU per GPU);
+      The display heads's total number is 32 (4 heads per vGPU * 8 vGPUs).
+      And the inventory data for the resource provider for vGPUs should be as::
+
+       {
+           obj_fields.ResourceClass.vGPU: {
+               "total": 8,
+               "reserved": 0,
+               "min_unit": 1,
+               "max_unit": 1,
+               "step_size": 1,
+               "allocation_ratio": 1.0
+           },
+           obj_fields.ResourceClass.GPU_DISPLAY_HEADS: {
+               "total":32,
+               "reserved": 0,
+               "min_unit": 4,
+               "max_unit": 4,
+               "step_size": 4,
+               "allocation_ratio": 1.0
+           }
+       }
+
+  * Tracking **qualitative** aspects of the vGPU resources:
+    The feature of traits is targeted to support representing *qualitative*
+    aspects for resources to differentiate their characteristics(`os-traits`_)
+    GPUs also have different characteristics: e.g. the maximum resolutions,
+    supported features.
+    We need define traits for GPUs. In virt driver (in the function of
+    ``get_inventory()``), it should query for the **qualitative** aspects of
+    the vGPU resources; map them to the defined traits and associate these
+    traits to the resource providers.
+
+    * Define traits in os-traits
+      Note: `_gpu-traits` the following two trait types for vGPU have already
+      be merged to os-traits.
+
+      * `supported-resolutions`_
+
+      * `supported-features`_
+
+# Define flavor: allow the cloud administrator to create different flavors
+  to specify the required amount of vGPU and/or a set of required traits to
+  meet different users' demands.
+
+# Scheduler: Basing on the amount of vGPU and the required traits, the resource
+  providers which can meet the conditions will be filtered out.
+
+# At spawning an instance, the virt drivers should retrieve the vGPU
+  resource specs from the instance request specs and map them to the proper
+  information (e.g. the GPU group in XenAPI) which is needed to create a vGPU;
+  then create and/or associating vGPU to the instance.
+
+Alternatives
+------------
+
+* It has been attempted to support vGPU by creating fake SRIOV-VF PCIs for
+  vGPUs and then passthrough PCI devices `vGPU-passthrough-PCI`_. But there is
+  problem to populate the fake PCI's address. And it can't reflect the real
+  situation that some vGPUs are not really PCI devices.
+
+Data model impact
+-----------------
+
+No particular data model changes needed, but it depends on the data model
+defined in `custom-resource-classes`_ and `nested-resource-providers`_.
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+In order to enable the vGPU feature:
+
+* the operators should change the nova configure settings to enable the vGPU
+  type for each pGPU model which will provide vGPU capabilites.
+
+* the operators should create new or update existing flavors to specify the
+  amount of vGPU to be requested, the expected amount of display heads, and
+  other expected traits (e.g. the dispaly resolutions, features), so that users
+  can use different flavor to request vGPUs basing on their graphics processing
+  demands.
+
+* for rolling upgrads, the operators should create or update flavors requesting
+  vGPU after they rolled out all of their nodes into release where this spec
+  got implemented.
+
+Developer impact
+----------------
+
+None
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  jianghuaw
+
+Other contributors:
+
+Work Items
+----------
+
+# Define standard traits into os-traits for GPUs;
+# In virt driver, add code to:
+
+  * add whitelist for enabled vGPU types in the config file
+  * query needed data for enabled vGPU types
+  * generate the nested resource providers
+  * generate the inventory data in resource providers
+  * mapping GPU characteristics to the traits defined in os-traits
+  * associate these traits to the resource providers
+  * mapping traits in the boot request spec to the required metadata
+  * create or/and attach vGPU to the instance basing on the metadata
+
+Dependencies
+============
+This spec depends on the following specs to be implemented:
+
+* custom-resource-classes-pike: https://blueprints.launchpad.net/nova/+spec/custom-resource-classes-pike
+
+* nested-resource-providers: https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/nested-resource-providers.html
+
+* resource-provider-traits: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/resource-provider-traits.html
+
+Testing
+=======
+
+* Unit tests.
+
+Documentation Impact
+====================
+
+Need document the configuration for vGPU.
+
+References
+==========
+
+.. _Intel's GVT-g: https://01.org/igvt-g
+
+.. _NVIDIA GRID vGPU: http://images.nvidia.com/content/grid/pdf/GRID-vGPU-User-Guide.pdf
+
+.. _resource-classes: http://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/resource-classes.html
+
+.. _custom-resource-classes: https://blueprints.launchpad.net/nova/+spec/custom-resource-classes
+
+.. _resource-provider: https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/resource-providers.html
+
+.. _resource-provider-traits: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/resource-provider-traits.html
+
+
+.. _Resource-providers-scheduler: https://blueprints.launchpad.net/nova/+spec/resource-providers-scheduler-db-filters
+
+.. _nested-resource-providers: https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/nested-resource-providers.html
+
+.. _OpenGL: https://en.wikipedia.org/wiki/OpenGL
+
+.. _vGPU-passthrough-PCI: https://review.openstack.org/#/c/280099/17
+
+.. _os-traits: http://docs.openstack.org/developer/os-traits
+
+.. _gpu-traits: https://github.com/openstack/os-traits/tree/master/os_traits/hw/gpu
+
+.. _supported-resolutions: https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/resolution.py
+
+.. _supported-features: https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/api.py
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced
+