Add spec for PCI Groups

blueprint pci-passthrough-groups Change-Id: I22635e453ab1ddfdec65844053ababc6931df200
2023-10-31 17:00:27 +00:00 · 2023-10-31 17:00:27 +00:00 · 4bd55ded70
commit 4bd55ded70
parent f1ec291df7
1 changed files with 246 additions and 0 deletions
--- a/specs/2024.1/approved/pci-passthrough-groups.rst
+++ b/specs/2024.1/approved/pci-passthrough-groups.rst
@ -0,0 +1,246 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+PCI Passthrough Groups
+==========================================
+
+https://blueprints.launchpad.net/nova/+spec/pci-passthrough-groups
+
+This spec allows operators to create a flavor using a PCI alias to
+request a group of PCI devices. These groups of PCI devices are tracked
+as a single indivisible unit within Placement. The default custom
+resource class used to track these PCI groups is derived from the
+PCI group type name, and the name of the inventory is derived from
+the PCI group name. The pci_alias config already supports mapping
+to a specific placement resource class.
+
+Problem description
+===================
+
+Some PCI devices only make sense to be consumed as a group.
+When you assign the grouped PCI devices to a VM, all of the
+devices in the group as always consumed together by a single VM.
+Currently Nova does not understand any grouping other than
+NUMA affinity.
+
+While there are some cases where a device could be consumed by
+multiple different groups, that are dynamically picked based on
+demand, we are ignoring these use cases for now.
+In particular, we make the simplifying restriction
+that a tracked PCI device can only be a member of a single group,
+and when a PCI device is a member of a group, it can only be used
+as part of that PCI group.
+
+Use Cases
+---------
+
+Some GPUs expose both a graphics physical function and an audio
+function. In order to support passing through both devices, we need
+to ensure that we pass through a matching pair of devices.
+This spec would allow a device group to be created such that
+operators configure the matching pairs of audio and graphics
+devices, and users can request one (or more) of those pairs via
+the usual PCI alias.
+
+Note, we are currently excluding the use case of users requesting
+either the pair of devices or just the graphics device, as that
+would result in additional complexity that should be considered
+in a separate follow on specification.
+
+Let us consider the specific case of the Graphcore C200 device,
+where a set of PCI cards are connected together via IPU-Link:
+https://docs.graphcore.ai/projects/C600-datasheet/en/latest/product-description.html#ipu-link-cables
+
+Each physical card presents two PCI devices. The card can be used
+independently of other cards if a matched pair of devices are
+presented to the VM. PCI groups allows this device to be correctly
+passed through to VMs by ensuring a matched pair of PCI devices are
+always assigned to each VM.
+
+In addition, some servers can be statically configured to group
+either two devices, four devices or eight devices as a single group.
+These can all be statically configured using PCI group to ensure
+we always respect the non-PCI connectivity between those PCI devices.
+
+Proposed change
+===============
+
+The key parts of this change include:
+
+* extend `[pci]device_spec` to model groups of PCI devices
+* devices are linked by both a group type name, and a specific group name
+* the group type name is used to generate a custom resource class,
+  i.e. `CUSTOM_PCI_GROUP<group_type_name>`. Note this is just the default
+  that changes when you specify a group type name, and it can be
+  overrriden by explicitly specifying a different resource_class tag.
+* Each group is registered in placement, in a similar way to a device.
+  Each group being a separate resource provider with a single inventory
+  item for the associated group type custom resource type, with a name
+  that is generated from the group_name rather than the PCI device address
+* extend `[pci]alias` simply mapps to the resource class mentioned
+  above, such as `CUSTOM_PCI_GROUP_<group_type_name>`.
+* PCI tracker will have the group_name and group_type_name added to
+  each device that is being tracked, such that we can look up a group
+  of devices associated with each specific named group tracked
+  in placement.
+
+There will be configuration validation checks:
+
+* pci groups are only supported when PCI devices are tracked in placement
+* all device groups must have two or more PCI devices
+* each physical PCI device can only be in one group,
+  and must only be tracked in placement once
+
+For example, lets consider the following PCI devices:
+
+* 4e:00.0 Processing accelerators: Graphcore Ltd Device 0003
+* 4f:00.0 Processing accelerators: Graphcore Ltd Device 0003
+* 89:00.0 Processing accelerators: Graphcore Ltd Device 0003
+* 8a:00.0 Processing accelerators: Graphcore Ltd Device 0003
+
+The two physical cards, spread across two NUMA nodes can be presented
+in two possible ways: either two groups or a single group, depending on
+the use cases. For example, two separate devices would be:::
+
+    [pci]
+    device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x1"}
+    device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x1"}
+    device_spec = {"address": ":4e:00.0", group_name:"graphcore_2", group_type:"c200_x1"}
+    device_spec = {"address": ":4f:00.0", group_name:"graphcore_2", group_type:"c200_x1"}
+    alias = {"name":"c200_x1", resource_class:"CUSTOM_PCI_GROUP_C200_X1"}
+
+But exposing the two cards, exposed as four PCI devices,
+as a single unit of 4 PCI devices, would look like this:::
+
+    [pci]
+    device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
+    device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
+    device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
+    device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
+    alias = {"name":"c200_x2", resource_class:"CUSTOM_PCI_GROUP_C200_X2"}
+
+Alternatives
+------------
+
+For some simple cases, NUMA affinity can simulate what is required.
+But currently hardware like Graphcore C200 does not work well with Nova.
+
+Data model impact
+-----------------
+
+PCI tracker needs to be extended to include group_name and group_type
+for each PCI device.
+
+REST API impact
+---------------
+
+No impact
+
+Security impact
+---------------
+
+No impact
+
+Notifications impact
+--------------------
+
+No impact
+
+Other end user impact
+---------------------
+
+No impact
+
+Performance Impact
+------------------
+
+No impact
+
+Other deployer impact
+---------------------
+
+The device spec configuration gets some extra options to help
+define groups, and the default resource class changes when you
+use the new device_group tags, as discussed above.
+
+Developer impact
+----------------
+
+None
+
+Upgrade impact
+--------------
+
+Devices that are exposed as a group must be not currently
+tracked in placement when starting to expose them as a group.
+
+Once new compute nodes will report the new resoruce classes,
+which should naturally gate the need for older compute nodes
+to know what to do with the new PCI device configuration.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  johngarbutt
+
+Other contributors:
+  nathanharper
+
+Feature Liaison
+---------------
+
+Feature liaison:
+  gibi?
+
+Work Items
+----------
+
+* Update pci device config to support pci groups
+* Update PCI device tracker to know about pci groups
+* Attach groups of devices when device alias requests
+  a resource class that maps to a PCI device group
+* Update placement with the avilable resources
+  from the described pci groups
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Add a functional test, similar to vgpu tests.
+
+Documentation Impact
+====================
+
+Configuration changes need to be documented correctly.
+
+References
+==========
+
+None
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - 2024.1 Caracal
+     - Introduced