Maintain specs of approved and completed of Antelop version
Change-Id: I821474c6e8c19765b68cc35d5a4822c8c89e9919
This commit is contained in:
parent
e4b326e4fe
commit
eb76508050
301
specs/2023.1/approved/attribute-api-support.rst
Normal file
301
specs/2023.1/approved/attribute-api-support.rst
Normal file
@ -0,0 +1,301 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================
|
||||
Support attribute API
|
||||
=====================
|
||||
|
||||
This spec adds a new group of APIs to manage the lifecycle of accelerator's
|
||||
attributes.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Attribute is designed for describing customized information of an accelerator.
|
||||
Now they are generated by drivers, users can not add/delete/update them, it's
|
||||
not applicable to our scenarios now.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
An admin or operator needs a group of APIs to manage his accelerator's
|
||||
attributes.
|
||||
Here are some useful scenarios:
|
||||
|
||||
* For a NIC accelerator, we need to add a phys_net attribute, it's should be
|
||||
created by deployer or other components.
|
||||
* For some Function Volatile Accelerators, we can create the Function name as
|
||||
an attribute.
|
||||
* Also for some information, such as Function_UUID is machine readable.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
None
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
* Add attribute object to deployable object.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
URL: ``/v2/deployable/{uuid}/attribute``
|
||||
|
||||
METHOD: ``GET``
|
||||
|
||||
List all attributes of specified deployable.
|
||||
|
||||
Normal response code (200) and body::
|
||||
|
||||
{
|
||||
"attributes":[{
|
||||
"key":"key1",
|
||||
"value":"value1",
|
||||
"uuid":"uuid1"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Error response code and body:
|
||||
|
||||
* 401 (Unauthorized): Unauthorized
|
||||
|
||||
* 403 (Forbidden): RBAC check failed
|
||||
|
||||
* No response body
|
||||
|
||||
|
||||
URL: ``/v2/deployable/{uuid}/attribute/{uuid_or_key}``
|
||||
|
||||
METHOD: ``GET``
|
||||
|
||||
GET specified attribute of specified deployable.
|
||||
|
||||
Query Parameters: None
|
||||
|
||||
Normal response code (200) and body::
|
||||
|
||||
{
|
||||
"attribute":
|
||||
{
|
||||
"key":"key1",
|
||||
"value":"value1",
|
||||
"uuid":"uuid1",
|
||||
"created_at":"2020-05-28T03:03:20",
|
||||
"updated_at":"2020-05-28T03:03:20"
|
||||
}
|
||||
}
|
||||
|
||||
Error response code and body:
|
||||
|
||||
* 401 (Unauthorized): Unauthorized
|
||||
|
||||
* 403 (Forbidden): RBAC check failed
|
||||
|
||||
* 404 (NotFound): No deployable of that UUID or no attribute of that UUID
|
||||
exists
|
||||
|
||||
* No response body
|
||||
|
||||
|
||||
URL: ``/v2/deployable/{uuid}/attribute``
|
||||
|
||||
METHOD: ``POST``
|
||||
|
||||
Create one or more deployable attribute(s).
|
||||
|
||||
Request body::
|
||||
|
||||
[
|
||||
{
|
||||
"key": "key1",
|
||||
"value": "value1"
|
||||
},
|
||||
{
|
||||
"key": "key2",
|
||||
"value": "value2"
|
||||
},
|
||||
...
|
||||
]
|
||||
|
||||
Normal response code and body:
|
||||
|
||||
* 204 (No content)
|
||||
|
||||
* No response body
|
||||
|
||||
Error response code:
|
||||
|
||||
* 401 (Unauthorized): Unauthorized
|
||||
|
||||
* 403 (Forbidden): RBAC check failed
|
||||
|
||||
* 409 (Conflict): Bad input or key is not unique
|
||||
|
||||
Error response body::
|
||||
|
||||
{"error": "error-string"}
|
||||
|
||||
|
||||
URL: ``/v2/deployable/{uuid}/attribute/{uuid_or_key}``
|
||||
|
||||
METHOD: ``DELETE``
|
||||
|
||||
Delete an exist deployable attribute.
|
||||
|
||||
Query Parameters: None
|
||||
|
||||
Normal response code and body:
|
||||
|
||||
* 204 (No content)
|
||||
|
||||
* No response body
|
||||
|
||||
Error response code:
|
||||
|
||||
* 401 (Unauthorized): Unauthorized
|
||||
|
||||
* 403 (Forbidden): RBAC check failed
|
||||
|
||||
* 404 (NotFound): No deployable of that UUID or no attribute of that UUID
|
||||
exists
|
||||
|
||||
Error response body::
|
||||
|
||||
{"error": "error-string"}
|
||||
|
||||
|
||||
URL: ``/v2/deployable/{uuid}/attribute``
|
||||
|
||||
METHOD: ``DELETE``
|
||||
|
||||
Delete all attributes of a deployable.
|
||||
|
||||
Query Parameters: None
|
||||
|
||||
Normal response code and body:
|
||||
|
||||
* 204 (No content)
|
||||
|
||||
* No response body
|
||||
|
||||
Error response code:
|
||||
|
||||
* 401 (Unauthorized): Unauthorized
|
||||
|
||||
* 403 (Forbidden): RBAC check failed
|
||||
|
||||
Error response body::
|
||||
|
||||
{"error": "error-string"}
|
||||
|
||||
|
||||
URL: ``/v2/deployable/{uuid}/attribute/{uuid_or_key}``
|
||||
|
||||
METHOD: ``PUT``
|
||||
|
||||
Update an exist deployable attribute.
|
||||
|
||||
Query Parameters: None
|
||||
|
||||
Request body (Value of deployable attribute)::
|
||||
|
||||
{"value": "value1"}
|
||||
|
||||
Normal response code and body:
|
||||
|
||||
* 204 (No content)
|
||||
|
||||
* No response body
|
||||
|
||||
Error response code and body:
|
||||
|
||||
* 401 (Unauthorized): Unauthorized
|
||||
|
||||
* 403 (Forbidden): RBAC check failed
|
||||
|
||||
* 404 (NotFound): No deployable of that UUID or no attribute of that UUID
|
||||
exists
|
||||
|
||||
Error response body::
|
||||
|
||||
{"error": "error-string"}
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
* Change Cyborg Attribute table.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
* If the user want to use these feature, they should upgrade their Cyborg
|
||||
* project to latest to support these changes.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
hejunli
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Change Cyborg REST APIs.
|
||||
* Change Cyborg Attribute table.
|
||||
* Change Cyborg deployable object.
|
||||
* Change cyborgclient to support Attribute management action.
|
||||
* Add related tests.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
Appropriate unit and functional tests should be added.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
* Need a documentation to record microversion history.
|
||||
* Need a documentaiton to explain api usage.
|
||||
|
||||
References
|
||||
==========
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Antelope
|
||||
- Introduced
|
221
specs/2023.1/approved/disable-enable-device.rst
Normal file
221
specs/2023.1/approved/disable-enable-device.rst
Normal file
@ -0,0 +1,221 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================
|
||||
Add disable/enable device API
|
||||
=============================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/disable-enable-device
|
||||
|
||||
Nowadays, Cyborg discovers the device on compute node by each driver. All
|
||||
devices matching the spec of driver are discovered and reported to the
|
||||
Placement service as an accelerator resources.
|
||||
This spec proposes a set of new APIs which allow admin users to
|
||||
disable/enable a device.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Cyborg maintains a configuration file to configure the enabled drivers. Once
|
||||
the driver is enabled, the agent will discover all devices whose vendor ID,
|
||||
device ID match the driver's requirement. If admin user do not want all devices
|
||||
to be used by virtual machine, there is no way to disable a device currently.
|
||||
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
* Alice is an admin user, she wants some FPGAs to be reserved for its own use
|
||||
and not allow them to be allocated to a VM at the time. For example, she
|
||||
wants to program the FPGA device and use it as the OVS agent running on
|
||||
the host.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
We propose to add new API in order to enable/disable a device. If the device is
|
||||
disabled, Cyborg will report this device as a reserved resource to Placement,
|
||||
so that Nova can not schedule to this device. On the contrary, if the device is
|
||||
enabled, the device should become available and the 'reserved' field in
|
||||
Placement shoule be set to 0.
|
||||
* Since the API layer is modified, a new microversion should be introduced.
|
||||
* It also need a new field "is_maintaining" in Device object and data model to
|
||||
indicate whether the device is disbaled. If one device is disabled, the
|
||||
"is_maintaining" field should be set to "True", and if the device is enabled,
|
||||
the field should be set to "False". The default value should be "False".
|
||||
* Cyborg need call Placement API to update the "reserved" field for the
|
||||
device in this API.
|
||||
* Add "is_maintaining" field's value check during conductor's periodic report.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
A new column `is_maintaining` should be added in Device's data model.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
A microversion need to be introduced since the Device API changed.
|
||||
|
||||
List Device API
|
||||
^^^^^^^^^^^^^^^
|
||||
* Return a device list
|
||||
URL: ``/devices``
|
||||
METHOD: ``GET``
|
||||
Return: 200
|
||||
|
||||
.. code-block::
|
||||
|
||||
{
|
||||
"devices": [
|
||||
{
|
||||
"uuid": "d2446439-0142-40b7-9eee-82d855f453d9",
|
||||
"type": "FPGA",
|
||||
"vendor": "0xABCD",
|
||||
"model": "miss model info",
|
||||
"std_board_info": "{"device_id": "0xabcd", "class": "Fake class"}",
|
||||
"vendor_board_info": "fake_vendor_info",
|
||||
"hostname": "devstack01",
|
||||
"links": [
|
||||
{
|
||||
"href": "http://172.23.97.140/accelerator/v2/devices/d2446439-0142-40b7-9eee-82d855f453d9",
|
||||
"rel": "self"
|
||||
}
|
||||
],
|
||||
"created_at": "2021-11-03T08:48:43+00:00",
|
||||
"updated_at": null
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Get Device API
|
||||
^^^^^^^^^^^^^^
|
||||
* Get a device by uuid and return the details
|
||||
URL: ``/devices/{uuid}``
|
||||
METHOD: ``GET``
|
||||
Return: 200
|
||||
|
||||
.. code-block::
|
||||
|
||||
{
|
||||
"uuid": "d2446439-0142-40b7-9eee-82d855f453d9",
|
||||
"type": "FPGA",
|
||||
"vendor": "0xABCD",
|
||||
"model": "miss model info",
|
||||
"std_board_info": "{"device_id": "0xabcd", "class": "Fake class"}",
|
||||
"vendor_board_info": "fake_vendor_info",
|
||||
"hostname": "devstack01",
|
||||
"links": [
|
||||
{
|
||||
"href": "http://172.23.97.140/accelerator/v2/devices/d2446439-0142-40b7-9eee-82d855f453d9",
|
||||
"rel": "self"
|
||||
}
|
||||
],
|
||||
"created_at": "2021-11-03T08:48:43+00:00",
|
||||
"updated_at": null
|
||||
}
|
||||
|
||||
|
||||
Disable Device API
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
* Disable a device
|
||||
URL: ``/devices/disable/{device_uuid}``
|
||||
METHOD: ``POST``
|
||||
Return: 200
|
||||
Error Code: 404(the device is not found),403(the role is not admin)
|
||||
|
||||
Enable Device API
|
||||
^^^^^^^^^^^^^^^^^
|
||||
* Enable a device
|
||||
URL: ``/devices/enable/{device_uuid}``
|
||||
METHOD: ``POST``
|
||||
Return: 200
|
||||
Error Code: 404(the device is not found),403(the role is not admin)
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
The deployer need update Cyborg to the microversion which supports
|
||||
disable/enable API. Otherwise the disable/enable API will be rejected.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
Xinran Wang(xin-ran.wang@intel.com)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Add new column `is_maintaining` for device table.
|
||||
* Add disable/enable API in DeviceController.
|
||||
* Update the RP `reserved` field according to the operation. For `disable`
|
||||
oparation, the `reserved` field need be set by the same value as the
|
||||
`total` field, and for `enable` operation, the `reserved` field will be set
|
||||
to zero.
|
||||
* Update GET/LIST device API with `is_maintaining` field added in returned
|
||||
value.
|
||||
* Add disable/enable operation in cyborgclient.
|
||||
* Add unit tests.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
Need add unit test, and tempest test if needed.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
Need add related docs.
|
||||
|
||||
References
|
||||
==========
|
||||
None
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Xena
|
||||
- Introduced
|
||||
* - Yoga
|
||||
- Reproposed
|
195
specs/2023.1/approved/pmem-namespace-support.rst
Normal file
195
specs/2023.1/approved/pmem-namespace-support.rst
Normal file
@ -0,0 +1,195 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=================================
|
||||
Cyborg Intel PMEM Driver Proposal
|
||||
=================================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/add-pmem-driver
|
||||
|
||||
This spec proposes to provide the initial design for Cyborg's Intel PMEM
|
||||
driver.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
This spec will add Intel PMEM driver for Cyborg to manage specific Intel
|
||||
PMEM devices.
|
||||
|
||||
PMEM devices can be used as a large pool of low latency high bandwidth memory
|
||||
where they could store data for computation. This can improve the performance
|
||||
of the instance.
|
||||
|
||||
PMEM must be partitioned into PMEM namespaces [1]_ for applications to use.
|
||||
This vPMEM feature only uses PMEM namespaces in devdax mode as QEMU vPMEM
|
||||
backends [2]_. If you want to dive into related notions, the document NVDIMM
|
||||
Linux kernel document [3]_ is recommended.
|
||||
|
||||
Starting in the 20.0.0 (Train) release, the virtual persistent memory (vPMEM)
|
||||
feature in Nova allows a deployment using the libvirt compute driver to provide
|
||||
vPMEMs for instances using physical persistent memory (PMEM) that can provide
|
||||
virtual devices [4]_.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
* As an operator, I would like to use Cyborg agent managing PMEM resource
|
||||
and checking periodically, the Cyborg Intel PMEM driver should provide
|
||||
``discover()`` function to enumerate the list of the Intel PMEM devices,
|
||||
and report the details of all available Intel PMEM accelerators on the
|
||||
host, such as PID(Product id), VID(Vendor id), Device ID.
|
||||
|
||||
* As a user, I would like to boot up a VM with Intel PMEM Device attached in
|
||||
order to accelerate compute ability. Cyborg should be able to manage this
|
||||
kind of acceleration resources and assign it to the VM(binding).
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
1. In general, the goal is to develop a Intel PMEM Device driver that supports
|
||||
discover interfaces for Intel PMEM accelerator framework. The driver should
|
||||
include the ``discover()`` function. This function works excuting "ndctl list"
|
||||
command that reports devices' raw info sample as following::
|
||||
|
||||
[
|
||||
{
|
||||
"vendor": "8086",
|
||||
"product": "ns200_0",
|
||||
"device": "dax0.0"
|
||||
}
|
||||
]
|
||||
|
||||
2. Generate Cyborg specific driver objects and resource provider modeling
|
||||
for the PMEM device. Below is the objects to describe a PMEM devices which
|
||||
complies with the Cyborg database mode and Placement data model.
|
||||
|
||||
::
|
||||
|
||||
Hardware Driver objects Placement data model
|
||||
| | |
|
||||
1 PMEM 1 device |
|
||||
| | |
|
||||
| 1 deployable ---> resource_provider
|
||||
| | ---> parent resource_provider: compute node
|
||||
| | |
|
||||
n Namespace n attach_handle ---> inventories(total:n)
|
||||
|
||||
3. Need add the "enable_driver=intel_pmem_driver" in the Cyborg Agent
|
||||
configure file.
|
||||
|
||||
4. Need add the "pmem_namespaces=$LABEL:$NSNAME|$NSNAME,$LABEL:$NSNAME|$NSNAME"
|
||||
in the Cyborg Agent configure file as:
|
||||
"pmem_namespaces = 6GB:ns0|ns1|ns2,LARGE:ns3"
|
||||
|
||||
5. Resource class follows standard resources classes as:
|
||||
"CUSTOM_PMEM_NAMESPACE_$LABEL"
|
||||
|
||||
6. Traits follows the placement custom trait format. In the Cyborg driver, it
|
||||
will report two traits for PMEM accelerator using the format below:
|
||||
trait1:"CUSTOM_PMEM_NAMESPACE_$LABEL1"
|
||||
trait2:"CUSTOM_PMEM_NAMESPACE_$LABEL2"
|
||||
|
||||
|
||||
7. Before cyborg discover the namespaces, they should be created. How to create
|
||||
the namespce can reference [5]_ and [6]_.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Need add new type such as PMEM in devices and attach_handle tables.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
User can manage Intel PMEM Device by Cyborg Intel PMEM driver. Such as list
|
||||
of the Intel PMEM devices, report the details of all available Intel PMEM
|
||||
accelerators on the host, binding with Intel PMEM and so on.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
qiujunting(qiujunting@inspur.com)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement Intel PMEM driver in Cyborg
|
||||
* Add related test cases.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
========
|
||||
|
||||
* Unit tests will be added to test this driver.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document Intel PMEM driver in Cyborg project.
|
||||
Add test report in cyborg wiki.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] https://pmem.io/ndctl/ndctl-create-namespace.html
|
||||
.. [2] https://github.com/qemu/qemu/blob/19b599f7664b2ebfd0f405fb79c14dd241557452/docs/nvdimm.txt#L145
|
||||
.. [3] https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
|
||||
.. [4] https://docs.openstack.org/nova/latest/admin/virtual-persistent-memory.html
|
||||
.. [5] https://docs.openstack.org/nova/latest/admin/virtual-persistent-memory.html#configure-pmem-namespaces-compute
|
||||
.. [6] https://pmem.io/ndctl/ndctl-create-namespace.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Yoga
|
||||
- Introduced
|
||||
|
337
specs/2023.1/implemented/vgpu-driver-proposal.rst
Normal file
337
specs/2023.1/implemented/vgpu-driver-proposal.rst
Normal file
@ -0,0 +1,337 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================================
|
||||
Cyborg NVIDIA GPU Driver support vGPU management
|
||||
================================================
|
||||
|
||||
The Cyborg NVIDIA GPU Driver has implemented pGPU management in the Train
|
||||
release, this spec proposes the specification of supporting vGPU management
|
||||
in the same driver.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
GPU devices can provide supercomputing capabilities, and can replace the CPU
|
||||
to provide users with more efficient computing power at a lower cost. GPU cloud
|
||||
servers have great value in the following application scenarios, including:
|
||||
video encoding and decoding, scientific research and artificial intelligence
|
||||
(deep learning, machine learning).
|
||||
|
||||
In the OpenStack ecosystem, users can now use Nova to pass gpu resources to
|
||||
guest by two methods:
|
||||
|
||||
* Pass the GPU hardware to the guest (PCI pass-through).
|
||||
|
||||
* Pass the Mediated Device(vGPU) to the guest.
|
||||
|
||||
With the long-term goal that Cyborg will manage heterogeneous accelerators
|
||||
including GPUs, Cyborg needs to support GPU management and integrate with Nova
|
||||
to provide users with gpu resources allocation in the aforementioned methods.
|
||||
The existing Cyborg GPU driver, NVIDIA GPU Driver, has supported the first
|
||||
method (PCI pass-through), while the second method is not yet supported.
|
||||
Please see ref [1]_ for Nova-Cyborg vGPU integration spec.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* When the user is using Cyborg to manage GPU devices, he/she wants to boot
|
||||
up a VM with Nvidia GPU (pGPU or vGPU) attached in order to accelerate the
|
||||
video coding and decoding, Cyborg should be able to manage this kind of
|
||||
acceleration resources and to assign it to the VM(binding).
|
||||
|
||||
Proposed changes
|
||||
================
|
||||
|
||||
To be clear, in the following, we will describe the whole process of how does
|
||||
the NVIDIA GPU Driver discover, generate Cyborg specific driver objects of the
|
||||
vGPU devices(comply with Cyborg Database Model), and report it to cyborg-db
|
||||
and Placement by cyborg-conductor. Features that are aleady supported in
|
||||
current branch is marked as DONE, new changes are marked as NEW CHANGES.
|
||||
|
||||
1. Collect raw info of GPU devices from compute node by "lspci" and grep
|
||||
nvidia related keyword.(DONE)
|
||||
|
||||
2. Parsing details from each record including ``vendor_id``, ``product_id``
|
||||
and ``pci_address``.(DONE)
|
||||
|
||||
3. Generate Cyborg specific driver objects and resource provider modeling
|
||||
for the GPU device as well as its mdiated devices. Below is the objects to
|
||||
describe a vGPU devices which complies with the Cyborg database mode [4]_
|
||||
and placement data model [5]_.(NEW CHANGE)
|
||||
|
||||
::
|
||||
|
||||
Hardware Driver objects Placement data model
|
||||
| | |
|
||||
1 GPU 1 device |
|
||||
| | |
|
||||
| 1 deployable ---> resource_provider
|
||||
| | ---> parent resource_provider: compute node
|
||||
| | |
|
||||
4 vGPUs 4 attach_handles ---> inventories(total:4)
|
||||
|
||||
4. Supporting set the vGPU type for a specific GPU device in cyborg.conf. The
|
||||
implementation is similar to that in Nova [9]_.(NEW CHANGE)
|
||||
|
||||
* Firstly, we propose [gpu_devices]/enabled_vgpu_types to define which vgpu
|
||||
type Cyborg driver can use:
|
||||
|
||||
::
|
||||
|
||||
[gpu_devices]
|
||||
enabled_vgpu_types = [str_vgpu_type_1, str_vgpu_type_2, ...]
|
||||
|
||||
* And also, we propose that Cyborg driver will accept configuration sections
|
||||
that are related to the [gpu_devices]/enabled_vgpu_types and specifies which
|
||||
exact pGPUs are related to the enabled vGPU types and will have a
|
||||
device_addresses option defined like this:
|
||||
|
||||
::
|
||||
|
||||
cfg.ListOpt('device_addresses',
|
||||
default=[],
|
||||
help="""
|
||||
List of physical PCI addresses to associate with a specific vGPU type.
|
||||
|
||||
The particular physical GPU device address needs to be mapped to the vendor
|
||||
vGPU type which that physical GPU is configured to accept. In order to
|
||||
provide this mapping, there will be a CONF section with a name
|
||||
corresponding to the following template: "vgpu_%(vgpu_type_name)s
|
||||
|
||||
The vGPU type to associate with the PCI devices has to be the section name
|
||||
prefixed by ``vgpu_``. For example, for 'nvidia-11', you would declare
|
||||
``[vgpu_nvidia-11]/device_addresses``.
|
||||
|
||||
Each vGPU type also has to be declared in ``[gpu_devices]/enabled_vgpu_types``.
|
||||
|
||||
Related options:
|
||||
|
||||
* ``[gpu_devices]/enabled_vgpu_types``
|
||||
"""),
|
||||
|
||||
For example, it would be set in cyborg.conf
|
||||
|
||||
::
|
||||
|
||||
[gpu_devices]
|
||||
enabled_vgpu_types = nvidia-223,nvidia-224
|
||||
[vgpu_nvidia-223]
|
||||
device_addresses = 0000:af:00.0,0000:86:00.0
|
||||
[vgpu_nvidia-224]
|
||||
device_addresses = 0000:87:00.0
|
||||
|
||||
5. Generate resource_class and traits for device, which later will also be
|
||||
reported to Placement, and used by nova-scheduler to filter appropriate
|
||||
accelerators.(NEW CHANGE)
|
||||
|
||||
* ``resource class`` follows standard resources classes used by OpenStack [6]_.
|
||||
Pass-through GPU device will report 'PGPU' as its resource class,
|
||||
Virtualized GPU device will report 'VGPU' as its resource class.
|
||||
|
||||
* ``traits`` follows the placement custom trait format [7]_. In the Cyborg
|
||||
driver, it will report two traits for vGPU accelerator using the format
|
||||
below:
|
||||
|
||||
trait1: **OWNER_CYBORG**.
|
||||
|
||||
trait2: **CUSTOM_<VENDOR_NAME>_<PRODUCT_ID>_<Virtual_GPU_Type>**.
|
||||
|
||||
Meaning of each parameter is listed below.
|
||||
|
||||
* OWNER_CYBORG: a new namespace in os-traits to remark that a device is
|
||||
reported by Cyborg when the inventory is reported to placement. It is used
|
||||
to distinguish GPU devices reported by Nova.
|
||||
|
||||
* VENDOR_NAME: vendor name of the GPU device.
|
||||
|
||||
* PRODUCT_ID: product ID of the GPU device.
|
||||
|
||||
* Virtual_GPU_Type: this parameter is actually another format of the
|
||||
enabled_vgpu_types for a specific device set by admin in cyborg.conf.
|
||||
In order to generate this param, driver will first retrieve
|
||||
``enabled_vgpu_type`` and then map it to Virtual_GPU_Type by the way
|
||||
showed below. The name is exactly the Virtual_GPU_Type that will be
|
||||
reported in traits. For more details about the valid Virtual GPU Types
|
||||
for supported GPUs, please refer to [8]_.
|
||||
|
||||
::
|
||||
|
||||
# find mapping relation between Virtual_GPU_Type and enabled_vgpu_type.
|
||||
# The value in "name" file contains its corresponding Virtual_GPU_Type.
|
||||
cat /sys/class/mdev_bus/{device_address}/mdev_supported_types/{enabled_vgpu_type}/name
|
||||
|
||||
* Here is a example to show the traits of a GPU device in the real world.
|
||||
|
||||
* A Nvidia Tesla T4 device has been successfully installed on host,
|
||||
device address is 0000:af:00.0. In addition, the vendor’s vGPU driver
|
||||
software must be installed and configured on the host at the same time.
|
||||
|
||||
::
|
||||
|
||||
[vtu@ubuntudbs ~]# lspci -nnn -D|grep 1eb8
|
||||
0000:af:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
|
||||
|
||||
* Enable GPU types (Accelerator)
|
||||
|
||||
1. Specify which specific GPU type(s) the instances would get from this
|
||||
specific device.
|
||||
|
||||
Edit devices.enabled_vgpu_types and device_address in cyborg.conf:
|
||||
|
||||
::
|
||||
|
||||
[gpu]
|
||||
enabled_vgpu_types=nvidia-223
|
||||
[vgpu_nvidia-223]
|
||||
device_addresses = 0000:af:00.0
|
||||
|
||||
2. Restart the cyborg-agent service.
|
||||
|
||||
* Finally, traits reported for this device(RP) will be:
|
||||
|
||||
**OWNER_CYBORG** and **CUSTOM_NVIDIA_1EB8_T4_2B**
|
||||
|
||||
.. NOTE::
|
||||
|
||||
For the last parameter "T4_2B" (<Virtual_GPU_type>), we can validate the
|
||||
mapping relation between "nvidia-223" and "T4_2B" by check from the mdev
|
||||
sys path:
|
||||
|
||||
::
|
||||
|
||||
[vtu@ubuntudbs mdev_supported_types]$ pwd
|
||||
/sys/class/mdev_bus/0000:af:00.0/mdev_supported_types
|
||||
[vtu@ubuntudbs mdev_supported_types]$ ls
|
||||
nvidia-222 nvidia-225 nvidia-228 nvidia-231 nvidia-234 nvidia-320
|
||||
nvidia-223 nvidia-226 nvidia-229 nvidia-232 nvidia-252 nvidia-321
|
||||
nvidia-224 nvidia-227 nvidia-230 nvidia-233 nvidia-319
|
||||
[vtu@ubuntudbs mdev_supported_types]$ cat nvidia-223/name
|
||||
GRID T4-2B
|
||||
|
||||
6. Generate ``controlpath_id``, ``deployable``, ``attach_handle``,
|
||||
``attribute`` for vGPU.(NEW CHANGE)
|
||||
|
||||
7. Create a mdev device in the sys by echo its UUID (actually is the
|
||||
attach_handle UUID) to the create file when vgpu is bind to a VM.(NEW CHANGE)
|
||||
|
||||
create_file_path=
|
||||
/sys/class/mdev_bus/{pci_address}/mdev_supported_types/{type-id}/create
|
||||
|
||||
8. Delete a mdev device from sys by echo "1" to the remove file when vgpu is
|
||||
unbind from a VM.(NEW CHANGE)
|
||||
|
||||
remove_file_path=
|
||||
/sys/class/mdev_bus/{pci_address}/mdev_supported_types/{type-id}/UUID/remove
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Using Nova to manage vGPU device [10]_.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
This feature is highly dependent on the version of libvirt and the physical
|
||||
devices present on the host.
|
||||
|
||||
For vGPU management, deployers need to make sure that the GPU device has been
|
||||
successfully virtualized. Otherwise, Cyborg will report it as a pGPU device.
|
||||
|
||||
Please see ref [2]_ and [3]_ for how to install the Virtual GPU Manager package
|
||||
to virtualize your GPU devices.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
<yumeng-bao>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement NVIDIA GPU Driver enhancement in Cyborg
|
||||
* Add related test cases.
|
||||
* Add test report to wiki and update the supported driver doc page
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
========
|
||||
|
||||
* Unit tests will be added to test this driver.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document Nvidia GPU driver in Cyborg project.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] https://review.opendev.org/#/c/750116/
|
||||
.. [2] https://docs.nvidia.com/grid/6.0/grid-vgpu-user-guide/index.html
|
||||
.. [3] https://docs.nvidia.com/grid/6.0/grid-vgpu-user-guide/index.html#install-vgpu-package-generic-linux-kvm
|
||||
.. [4] https://specs.openstack.org/openstack/cyborg-specs/specs/stein/implemented/cyborg-database-model-proposal.html
|
||||
.. [5] https://docs.openstack.org/nova/rocky/user/placement.html#references
|
||||
.. [6] https://github.com/openstack/os-resource-classes/blob/master/os_resource_classes/__init__.py#L41
|
||||
.. [7] https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/resource-provider-traits.html
|
||||
.. [8] https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
|
||||
.. [9] https://specs.openstack.org/openstack/nova-specs/specs/ussuri/implemented/vgpu-multiple-types.html
|
||||
.. [10] https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Wallaby
|
||||
- Introduced
|
Loading…
Reference in New Issue
Block a user