From a11b0d5b4f33c277d97bf0c70aa789956b712c78 Mon Sep 17 00:00:00 2001 From: Sundar Nadathur Date: Tue, 17 Sep 2019 07:58:26 -0700 Subject: [PATCH] Re-proposed Nova Cyborg interaction specification. Describes the Nova - Cyborg interaction needed to create and manage instances with accelerators, and the changes needed in Nova to accomplish that. No changes from Train version except History. Previously-approved: Train Change-Id: I42d1829ab02db5f927a6bd63235cec667d416264 Blueprint: nova-cyborg-interaction --- .../approved/nova-cyborg-interaction.rst | 475 ++++++++++++++++++ 1 file changed, 475 insertions(+) create mode 100644 specs/ussuri/approved/nova-cyborg-interaction.rst diff --git a/specs/ussuri/approved/nova-cyborg-interaction.rst b/specs/ussuri/approved/nova-cyborg-interaction.rst new file mode 100644 index 000000000..6a9bc10ff --- /dev/null +++ b/specs/ussuri/approved/nova-cyborg-interaction.rst @@ -0,0 +1,475 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================= +Nova - Cyborg Interaction +========================= + +https://blueprints.launchpad.net/nova/+spec/nova-cyborg-interaction + +This specification describes the Nova - Cyborg interaction needed to create +and manage instances with accelerators, and the changes needed in Nova to +accomplish that. + +Problem description +=================== + +Scope +----- + +Nova and Cyborg need to interact in many areas for handling instances with +accelerators. While this spec covers the gamut, specific areas are covered in +detail in other specs. We list all the areas below, identify which specific +parts are covered by other specs, and describe what is covered in this spec. + +* Representation: Cyborg shall represent devices as nested resource providers + under the compute node (except possibly for disaggregated servers), + accelerator types as resource classes and accelerators as inventory in + Placement. The properties needed for scheduling are represented as traits. + This is specified by [#cy-nova-place]_. This spec does not + dwell on this topic. + +* Discovery and Updates: Among the devices discovered in a host, Cyborg + intends to claim only those that are not included under the PCI Whitelisting + mechanism. Cyborg shall update Placement in a way that is compatible with + the virt driver's update of Placement. These aspects are addressed in + sections `Coexistence with PCI whitelists`_ and `Placement update`_ + respectively. + +* User requests for accelerators: Users usually request compute resources via + flavors. However, since the requests for devices may be highly varied, + placing them in flavors may result in flavor explosion. We avoid that by + expressing device requests in a device profile [#dev-prof]_ . The + relationship between device profiles and flavors is explored in Section + `User requests`_. + + When an instance creation (boot) request is made, the contents of a device + profile shall be translated to request groups in the request spec; the + syntax in request groups is covered in Section `Updating the Request Spec`_. + +* Instance scheduling: Nova shall use the Placement data populated by Cyborg + to schedule instances. This spec does not dwell on this topic. + +* Assignment of accelerators: We introduce the concept of Accelerator Request + objects in Section `Accelerator Requests`_. The workflow to create and use + them is summarized in Section `Nova changes for Assignment workflow`_. The + same section also highlights the Nova changes needed. The details of the + Cyborg API implementation for this workflow is covered in Cyborg specs + ([#cy-api-impl]_). + +* Instance operations: The behavior with respect to accelerators for all + standard instance operations are defined in [#inst-ops]_. + This spec does not dwell on this topic. + +Use Cases +--------- + +* A user requests an instance with one or more accelerators of different + types assigned to it. +* An operator may provide users with both Device as a Service or + Accelerated Function as a Service in the same cluster (see + [#cy-nova-place]_). + +The following use cases are not addressed in Train but are of long term +interest: + +* A user requests to add one or more accelerators to an existing instance. +* Live migration with accelerators. + +Proposed change +=============== + +Coexistence with PCI whitelists +------------------------------- +The operator tells Nova which PCI devices to claim and use by configuring the +PCI Whitelists mechanism. In addition, the operator installs Cyborg drivers in +compute nodes and configures/enables them. Those drivers may then discover and +report some PCI devices. The operator must ensure that both configurations +are compatible. + +Ideally, there should be a single way for the operator to identify which PCI +devices should be claimed by Nova and which by Cyborg. This could be along the +lines suggested in [#generic-dev-disc]_ or [#kosamara]_. If such a mechanism +could be agreed upon by all stakeholders, Cyborg could adopt it. + +Until that point, the operator tells Cyborg which devices to claim by +using Cyborg's configuration file. The operator must ensure that this is +compatible with the PCI whitelists configured in Nova. + +Placement update +---------------- +Cyborg shall call Placement API directly to represent devices and +accelerators. Some of the intended use cases for the API invocation are: + +* Create or delete child RPs under the compute node RP. +* Create or delete custom RCs and custom traits. +* Associate traits with RPs or remove such association. +* Update RP inventory. + +Cyborg shall not modify the RPs created by any other component, such +as Nova virt drivers. + +User requests +------------- + +The user request for accelerators is encapsulated in a device profile +[#dev-prof]_, which is created and managed by the admin via the Cyborg API. + +A device profile may be viewed as a 'flavor for devices'. Accordingly, the +instance request should include both a flavor and a device profile. However, +that requires a change to the Nova API for instance creation. To mitigate the +impact of such changes on users and operators, we propose to do this +in phases. + +In the initial phase, Nova API remains as today. The device profile is folded +into the flavor as an extra spec by the operator, as below:: + + openstack flavor set --property 'accel:device_profile=' flavor + +Thus the standard Nova API can be used to create an instance with only the +flavor (without device profiles), like this:: + + openstack server create --flavor f .... # instance creation + +In the future, device profile may be used by itself to specify accelerator +resources for the instance creation API. + +Updating the Request Spec +------------------------- +When the user submits a request to create an instance, as described in Section +`User requests`_, Nova needs to call a Cyborg API, to get back the resource +request groups in the device profile and merge them into the request spec. +(This is along the lines of the scheme proposed for Neutron +[#req-spec-groups]_.) + +.. _cyborg-client-module: + +This call, like all the others that Nova would make to Cyborg APIs, is done +through a Keystone-based adapter that would locate the Cyborg service, similar +to the way Nova calls Placement. A new Cyborg client module shall be added to +Nova, to encapsulate such calls and to provide Cyborg-specific functionality. + +VM images in Glance may be associated with image properties (other than image +traits), such as bitstream/function IDs needed for that image. So, Nova should +pass the VM image UUID from the request spec to Cyborg. This is TBD. + +The groups in the device profile are numbered by Cyborg. The request groups +that are merged into the request spec are numbered by Nova. These numberings +would not be the same in general, i.e., the N-th device profile group may not +correspond to the N-th request group in the request spec. + +When the device profile request groups are added to other request groups in +the flavor, the ``group_policy`` of the flavor shall govern the overall +semantics of all request groups. + +Accelerator Requests +-------------------- +An accelerator request (ARQ) is an object that represents +the state of the request for an accelerator to be assigned to an instance. +The creation and management of ARQs are handled by Cyborg, and ARQs are +persisted in Cyborg database. + +An ARQ, by definition, represents a request for a single accelerator. The +device profile in the user request may have N request groups, each asking for +M accelerators; then ``N * M`` ARQs will be created for that device profile. + +When an ARQ is initially created by Cyborg, it is not yet associated with a +specific host name or a device resource provider. So it is said to be in an +unbound state. Subsequently, Nova calls Cyborg to bind the ARQ to a host name, +a device RP UUID and an instance UUID. If the instance fails to spawn, Nova +would unbind the ARQ without deleting it. On instance termination, Nova would +delete the ARQs after unbinding them. + +.. _match-rp: + +Each ARQ needs to be matched to the specific RP in the allocation candidate +that Nova has chosen, before the ARQ is bound. Since Placement does not match +RPs to request groups, this must be done in the Cyborg client module of Nova +(`cyborg-client-module`_). The matching is done using the requester_id field +in the ``RequestGroup`` object ([#requester-id]_) as below: + +* The order of request groups in a device profile is not significant, but it + is preserved by Cyborg. Thus, each device profile request group has a unique + index. +* When the device profile request groups returned by Cyborg are added to the + request spec, the requester_id field is set to 'device_profile_' for the + N-th device profile request group (starting from zero). The device profile + name need not be included here because there is only one device profile per + request spec. +* When Cyborg creates an ARQ for a device profile, it embeds the device + profile request group index in the ARQ before returning it to Nova. +* The matching is done in two steps: + + * Each ARQ is mapped to a specific request group in the request spec using + the requester_id field. + * Each request group is mapped to a specific RP using the same logic as the + Neutron bandwidth provider ([#map-rg-to-rp]_). + +Nova changes for Assignment workflow +------------------------------------ +This section summarizes the workflow details for Phase 1. The changes needed +in Nova are marked with NEW. + +NEW: A Cyborg client module is added to nova (`cyborg-client-module`_). All +Cyborg API calls are routed through that. + +#. The Nova API server receives a ``POST /servers`` API request with a flavor + that includes a device profile name. + +#. NEW: The Nova API server calls the Cyborg API ``GET + /v2/device_profiles?name=$device_profile_name`` and gets back the device + profile request groups. These are added to the request spec. + +#. The Nova scheduler invokes Placement and gets a list of allocation + candidates. It selects one of those candidates and makes + claim(s) in Placement. The Nova conductor then sends a RPC message + ``build_and_run_instances`` to the Nova compute manager. + +#. NEW: Nova calls the Cyborg API ``POST /v2/accelerator_requests`` with the + device profile name. Cyborg creates a set of unbound ARQs for that device + profile and returns them to Nova. (The call may originate from Nova + conductor or the compute manager; that will be settled in code review.) + +#. NEW: The Cyborg client in Nova matches each ARQ to the resource provider + picked for that accelerator. See `match-rp`_. + +#. NEW: The Nova compute manager calls the Cyborg API ``PATCH + /v2/accelerator_requests`` to bind the ARQ with the host name, device's RP + UUID and instance UUID. This is an asynchronous call which prepares or + reconfigures the device in the background. + +#. NEW: Cyborg, on completion of the bindings (successfully or otherwise), + calls Nova's ``POST /os-server-external-events`` API with:: + + { + "events": [ + { "name": "arq_resolved", + "tag": $arq_uuid, + "server_uuid": $instane_uuid, + "status": "ok" # or "failed" + }, + ... + ] + } + +#. NEW: The Nova virt driver waits for the notification, subject to the + timeout mentioned in Section `Other deployer impact`_. It then calls + the Cyborg REST API ``GET + /v2/accelerator_requests?instance=&bind_state=resolved``. + +#. NEW: The Nova virt driver uses the attach handles returned from the Cyborg + call to compose PCI passthrough devices into the VM's definition. + +#. NEW: If there is any error after binding has been initiated, Nova + must unbind the relevant ARQs by calling Cyborg API. It may then retry on + another host or delete the (unbound) ARQs for the instance. + +This flow is captured by the following sequence diagram, in which the Nova +conductor and scheduler are together represented as the Nova controller. The +ARQ creation is shown to happen in Nova compute manager only for concreteness; +it may be in the controller instead. + +.. seqdiag:: + + seqdiag { + edge_length = 200; + span_height = 15; + activation = none; + default_note_color = white; + 'Nova Controller'; 'Placement'; 'Cyborg'; 'Nova Compute'; + + 'Nova Controller' -> 'Cyborg' [label = + "GET /v2/device_profiles?name=mydp"]; + 'Nova Controller' <- 'Cyborg' [label = + '{"device_profiles": $device_profile}']; + 'Nova Controller' -> 'Nova Controller' [label= + 'Merge request groups into request_spec']; + 'Nova Controller' -> 'Placement' [label= + 'Get /allocation_candidates']; + 'Nova Controller' -> 'Placement' [label= + 'allocation candidates with nested RPs']; + 'Nova Controller' -> 'Nova Controller' [label= + 'Select a candidate']; + 'Nova Controller' -> 'Nova Compute' [label= + 'build_and_run_instances()']; + 'Nova Compute' -> 'Cyborg' [label= + 'POST /v2/accelerator_requests']; + 'Nova Compute' <- 'Cyborg' [label= + '{"arqs": [$arq, ...]']; + 'Nova Compute' -> 'Cyborg' [label= + 'PATCH /v2/accelerator_requests']; + 'Nova Compute' <- 'Cyborg' [label= + '{"arqs": [$arq, ...]']; + 'Cyborg' -> 'Nova Controller' [label= + 'POST /os-server-external-events']; + 'Nova Compute' -> 'Nova Compute' [label= + 'Wait for notification from Cyborg']; + 'Nova Compute' -> 'Cyborg' [label= + 'GET /v2/accelerator_requests? + instance=$uuid&bind_state=resolved']; + 'Nova Compute' <- 'Cyborg' [label= + '{"arqs": [$arq, ....]}']; + } + + +Alternatives +------------ +It is possible to have an external agent create ARQs from device profiles +by calling Cyborg, and then feed those pre-created ARQs to the Nova instance +creation API, analogous to Neutron ports. We do not take that approach yet +because it requires changes to Nova instance creation API. + +It is possible to have the Nova virt driver poll for the Cyborg ARQ binding +completion. That is not preferable, partly because that is not the pattern of +interaction with other services like Neutron. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None. A new extra_spec key ``accel:device_profile_name`` is added to +the flavor. + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +Nova may choose to add additional notifications for Cyborg API calls. + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +The extra calls to Cyborg REST API may potentially impact Nova +conductor/scheduler throughput. This has been mitigated by making some +critical Cyborg operations as asynchronous tasks. + +Other deployer impact +--------------------- + +The deployer needs to set up the ``clouds.yaml`` file so that Nova +can call the Cyborg REST API. + +The deployer needs to configure a new tunable in ``nova-cpu.conf``:: + + * arq_binding_timeout (integer): Time in seconds for Nova compute + manager to wait for Cyborg to notify that ARQ binding is done. + Timeout is fatal, i.e., VM startup is aborted with an exception. + Default: 300. + +Developer impact +---------------- + +Define two new standard resource classes: FPGA and PGPU. + +We have VGPU and VGPU_DISPLAY_HEAD RCs defined already. But we propose a +PGPU as a different RC for the following reasons: + + * Both VGPU and VGPU_DISPLAY_HEAD RCs specifically refer to virtual GPUs. + We need a different one for physical GPUs. + * It will be subject to separate quotas/limits in Keystone. + * Using PCI_DEVICE RC is too general: we want quotas for GPU RC + specifically. + +Upgrade impact +-------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Sundar Nadathur + +Work Items +---------- + +See the steps marked NEW in `Nova changes for Assignment workflow`_ section. + +Dependencies +============ + +* Specification for device profiles [#dev-prof]_. +* Cyborg API specification [#cy-api]_. + +Testing +======= +There need to be unit tests and functional tests for the Nova changes. +Specifically, there needs to be a functional test fixture that mocks the +Cyborg API calls. + +There need to be tempest tests for the end-to-end flow, including failure +modes. The tempest tests should be targeted at a fake driver (in addition to +real hardware, if any) and tied to the Nova Zuul gate. + +Documentation Impact +==================== +Device profile creation needs to be documented in Cyborg, as noted in +[#dev-prof]_. + +The need for operator to fold the device profile into the flavor needs to be +documented. + +References +========== + +.. [#cy-nova-place] `Specification for Cyborg Nova Placement + interaction `_ + +.. [#dev-prof] `Device profiles specification + `_ + +.. [#cy-api-impl] `Specification for Cyborg API implementation + `_ + +.. [#inst-ops] `Specification for instance operations with accelerators + `_ + +.. [#generic-dev-disc] `Generic device discovery + `_ + +.. [#kosamara] `Modelling passthrough devices for report to placement + `_ + +.. [#req-spec-groups] `Store RequestGroup objects in RequestSpec + `_ + +.. [#requester-id] `Requester_id field in RequestGroup + `_ + +.. [#map-rg-to-rp] `Map request groups to resource providers + `_ + +.. [#cy-api] 'Specification for Cyborg API Version 2 + `_ + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Train + - Introduced + * - Ussuri + - Re-proposed