Deployers want to be able to attach accelerators to their VMs. Today in Nova this is possible only in very restricted circumstances. The goal of this blueprint is to enable generic passthrough of devices for consumers of the nova-powervm driver. blueprint: device-passthrough Change-Id: Iba1757fe2e62611def4882aad45508a3a1f1dfb1
15 KiB
Device Passthrough
https://blueprints.launchpad.net/nova-powervm/+spec/device-passthrough
Provide a generic way to identify hardware devices such as GPUs and attach them to VMs.
Problem description
Deployers want to be able to attach accelerators and other adapters to their VMs. Today in Nova this is possible only in very restricted circumstances. The goal of this blueprint is to enable generic passthrough of devices for consumers of the nova-powervm driver.
While these efforts may enable more, and should be extensible going forward, the primary goal for the current release is to pass through entire physical GPUs. That is, we are not attempting to pass through:
- Physical functions, virtual functions, regions, etc. I.e. granularity smaller than "whole adapter". This requires device type-specific support at the platform level to perform operations such as discovery/inventorying, configuration, and attach/detach.
- Devices with "a wire out the back" - i.e. those which are physically connected to anything (networks, storage, etc.) external to the host. These will require the operator to understand and be able to specify/select specific connection parameters for proper placement.
Use Cases
As an admin, I wish to be able to configure my host and flavors to allow passthrough of whole physical GPUs to VMs.
As a user, I wish to make use of appropriate flavors to create VMs with GPUs attached.
Proposed change
Device Identification and Whitelisting
The administrator can identify and allow (explicitly) or deny (by omission) passthrough of devices by way of a YAML file per compute host.
Note
Future: We may someday figure out a way to support a config file on the controller. This would allow e.g. cloud-wide whitelisting and specification for particular device types by vendor/product ID, which could then be overridden (or not) by the files on the compute nodes.
The path to the config will be hardcoded as
/etc/nova/inventory.yaml.
The file shall contain paragraphs, each of which will:
- Identify zero or more devices based on information available on the
IOSlotNovaLink REST object. In pypowervm, given a ManagedSystem wrappersys_w, a list ofIOSlotwrappers is available viasys_w.asio_config.io_slots. See identification. Any device not identified by any paragraph in the file is denied for passthrough. But see the allow section for future plans around supporting explicit denials. - Name the resource class to associate with the resource provider
inventory unit by which the device will be exposed in the driver. If not
specified,
CUSTOM_IOSLOTis used. See resource_class. - List traits to include on the resource provider in addition to those generated automatically. See traits.
A formal schema is proposed for review.
Here is a summary description of each section.
Name
Each paragraph will be introduced by a key which is a human-readable name for the paragraph. The name has no programmatic significance other than to separate paragraphs. Each paragraph's name must be unique within the file.
identification
Each paragraph will have an identification section,
which is an object containing one or more keys corresponding to
IOSlot properties, as follows:
YAML key IOSlot property Description vendor_id pci_vendor_id X{4} (four uppercase hex digits) device_id pci_dev_id X{4} " subsys_vendor_id pci_subsys_vendor_id X{4} " subsys_device_id pci_subsys_dev_id X{4} " class pci_class X{4} " revision_id pci_rev_id X{2} (two uppercase hex digits) drc_index drc_index X{8} (eight uppercase hex digits) drc_name drc_name String (physical location code)
The values are expected to match those produced by
pvmctl ioslot list -d <property> for a given
property.
The identification section is required, and must contain
at least one of the above keys.
When multiple keys are provided in a paragraph, they are matched with
AND logic.
Note
It is a stretch goal of this blueprint to allow wildcards in (some
of) the values. E.g. drc_name: U78CB.001.WZS0JZB-P1-* would
allow everything on the P1 planar of the
U78CB.001.WZS0JZB enclosure. If we get that far, a spec
amendment will be proposed with the specifics (what syntax, which
fields, etc.).
allow
Note
The allow section will not be supported initially, but
is documented here because we thought through what it should look like.
In the initial implementation, any device encompassed by a paragraph is
allowed for passthrough.
Each paragraph will support a boolean allow keyword.
If omitted, the default is true - i.e. devices
identified by this paragraph's identification section are
permitted for passthrough. (Note, however, that devices not encompassed
by the union of all the identification paragraphs in the
file are denied for passthrough.)
If allow is false, the only other section
allowed is identification, since the rest don't make
sense.
A given device can only be represented once across all
allow=true paragraphs (implicit or explicit); an "allowed"
device found more than once will result in an error.
A given device can be represented zero or more times across all
allow=false paragraphs.
We will first apply the allow=true paragraphs to
construct a preliminary list of devices; and then apply each
allow=false paragraph and remove explicitly denied devices
from that list.
Note
Again, we're not going to support the allow section at
all initially. It will be a stretch goal to add it as part of this
release, or it may be added in a subsequent release.
resource_class
If allow is omitted or true, an optional
resource_class key is supported. Its string value allows
the author to designate the resource class to be used for the inventory
unit representing the device on the resource provider. If omitted,
CUSTOM_IOSLOT will be used as the default.
Note
Future: We may be able to get smarter about dynamically defaulting the resource class based on inspecting the device metadata. For now, we have to rely on the author of the config file to tell us what kind of device we're looking at.
traits
If allow is omitted or true, an optional
traits subsection is supported. Its value is an array of
strings, each of which is the name of a trait to be added to the
resource providers of each device represented by this paragraph. If the
traits section is included, it must have at least one value
in the list. (If no additional traits are desired, omit the
section.)
The values must be valid trait names (either standard from
os-traits or custom, matching
CUSTOM_[A-Z0-9_]*). These will be in addition to the traits
automatically added by the driver - see Generated Traits below. Traits which
conflict with automatically-generated traits will result in an error:
the driver must be the single source of truth for the traits it
generates.
Traits may be used to indicate any static attribute of a device - for
example, a capability (CUSTOM_CAPABILITY_WHIZBANG) not
otherwise indicated by Generated
Traits.
Resource Providers
The driver shall create nested resource providers, one per device (slot), as children of the compute node provider generated by Nova.
The provider name shall be generated as
PowerVM IOSlot %(drc_index)08X e.g.
PowerVM IOSlot 1C0FFEE1. We shall let the placement service
generate the UUID. This naming scheme allows us to identify the full set
of providers we "own". This includes identifying providers we may have
created on a previous iteration (potentially in a different process)
which now need to be purged (e.g. because the slot no longer exists on
the system). It also helps us provide a clear migration path in the
future, if, for example, Cyborg takes over generating these providers.
It also paves the way for providers corresponding to things smaller than
a slot; e.g. PFs might be namespaced
PowerVM PF %(drc_index)08X.
Inventory
Each device RP shall have an inventory of:
total: 1
reserved: 0
min_unit: 1
max_unit: 1
step_size: 1
allocation_ratio: 1.0
of the resource_class specified in the
config file for the paragraph matching this device
(CUSTOM_IOSLOT by default).
Note
Future: Some day we will provide SR-IOV VFs, vGPUs, FPGA regions/functions, etc. At that point we will conceivably have inventory of multiple units of multiple resource classes, etc.
Generated Traits
The provider for a device shall be decorated with the following automatically-generated traits:
CUSTOM_POWERVM_IOSLOT_VENDOR_ID_%(vendor_id)04XCUSTOM_POWERVM_IOSLOT_DEVICE_ID_%(device_id)04XCUSTOM_POWERVM_IOSLOT_SUBSYS_VENDOR_ID_%(subsys_vendor_id)04XCUSTOM_POWERVM_IOSLOT_SUBSYS_DEVICE_ID_%(subsys_device_id)04XCUSTOM_POWERVM_IOSLOT_CLASS_%(class)04XCUSTOM_POWERVM_IOSLOT_REVISION_ID_%(revision_id)02XCUSTOM_POWERVM_IOSLOT_DRC_INDEX_%(drc_index)08XCUSTOM_POWERVM_IOSLOT_DRC_NAME_%(drc_name)swheredrc_nameis normalized viaos_traits.normalize_name.
In addition, the driver shall decorate the provider with any traits specified in the config file paragraph identifying this device. If that paragraph specifies any of the above generated traits, an exception shall be raised (we'll blow up the compute service).
update_provider_tree
The above provider tree structure/data shall be provided to Nova by
overriding the ComputeDriver.update_provider_tree method.
The algorithm shall be as follows:
- Parse the config file.
- Discover devices (
GET /ManagedSystem, pull out.asio_config.io_slots). - Merge the config data with the discovered devices to produce a list of devices to pass through, along with inventory of the appropriate resource class name, and traits (generated and specified).
- Ensure the tree contains entries according to this calculated passthrough list, with appropriate inventory and traits.
- Set-subtract the names of the providers in the calculated
passthrough list from those in the provider tree whose names are
prefixed with
PowerVM IOSlotand delete the resulting "orphans".
This is in addition to the standard update_provider_tree
contract of ensuring appropriate VCPU,
MEMORY_MB, and DISK_GB resources on the
compute node provider.
Note
It is a stretch goal of this blueprint to implement caching and/or other enhancements to the above algorithm to optimize performance by minimizing the need to call PowerVM REST and/or process whitelist files every time.
Flavor Support
Existing Nova support for generic resource specification via flavor extra specs should "just work". For example, a flavor requesting two GPUs might look like:
resources:VCPU=1
resources:MEMORY_MB=2048
resources:DISK_GB=100
resources1:CUSTOM_GPU=1
traits1:CUSTOM_POWERVM_IOSLOT_VENDOR_ID_G00D=required
traits1:CUSTOM_POWERVM_IOSLOT_PRODUCT_ID_F00D=required
resources2:CUSTOM_GPU=1
traits2:CUSTOM_POWERVM_IOSLOT_DRC_INDEX_1C0FFEE1=required
PowerVMDriver
spawn
During spawn, we will query placement to retrieve the
resource provider records listed in the allocations
parameter. Any provider names which are prefixed with
PowerVM IOSlot will be parsed to extract the DRC index (the
last eight characters of the provider name). The corresponding slots
will be extracted from the ManagedSystem payload and added
to the LogicalPartition payload for the instance as it is
being created.
destroy
IOSlots are detached automatically when we DELETE the
LogicalPartition, so no changes should be required
here.
Live Migration
Since we can't migrate the state of an active GPU, we will block live migration of a VM with an attached IOSlot.
Cold Migration, Rebuild, Remote Restart
We should get these for free, but need to make sure they're tested.
Hot plug/unplug
This is not in the scope of the current effort. For now, attaching/detaching devices to/from existing VMs can only be accomplished via resize (Cold Migration).
Alternatives
Use Nova's PCI passthrough subsystem. We've all agreed this sucks and is not the way forward.
Use oslo.config instead of a YAML file. Experience with the
[pci]passthrough_whitelist has led us to conclude that
config format is too restrictive/awkward. The direction for Nova (as
discussed in the Queens PTG in Denver) will be toward some kind of YAML
format; we're going to be the pioneers on this front.
Security impact
It is the operator's responsibility to ensure that the passthrough YAML config file has appropriate permissions, and lists only devices which do not themselves pose a security risk if attached to a malicious VM.
End user impact
Users get acceleration for their workloads o/
Performance Impact
Discovery
For the update_provider_tree
flow, we're adding the step of loading and parsing the passthrough YAML
config file. This should be negligible compared to e.g. retrieving the
ManagedSystem object (which we're already doing, so no
impact there).
spawn/destroy
There's no impact from the community side. It may take longer to create or destroy a LogicalPartition with attached IOSlots.
Deployer impact
None.
Developer impact
None.
Upgrade impact
None.
Implementation
Assignee(s)
- Primary assignee:
-
efried
- Other contributors:
-
edmondsw, mdrabe
Work Items
See Proposed change.
Dependencies
os-traits 0.9.0 to pick up the normalize_name
method.
Testing
Testing this in the CI will be challenging, given that we are not likely to score GPUs for all of our nodes.
We will likely need to rely on manual testing and PowerVC to cover the code paths described under PowerVMDriver with a handful of various device configurations.
Documentation Impact
- Add a section to our support matrix for generic device passthrough.
- User documentation for:
- How to build the passthrough YAML file.
- How to construct flavors accordingly.
References
None.
History
| Release Name | Description |
|---|---|
| Rocky | Introduced |