From a11b0d5b4f33c277d97bf0c70aa789956b712c78 Mon Sep 17 00:00:00 2001
From: Sundar Nadathur <sundar.nadathur@intel.com>
Date: Tue, 17 Sep 2019 07:58:26 -0700
Subject: [PATCH] Re-proposed Nova Cyborg interaction specification.

Describes the Nova - Cyborg interaction needed to create and manage
instances with accelerators, and the changes needed in Nova to
accomplish that.

No changes from Train version except History.

Previously-approved: Train
Change-Id: I42d1829ab02db5f927a6bd63235cec667d416264
Blueprint: nova-cyborg-interaction
---
 .../approved/nova-cyborg-interaction.rst      | 475 ++++++++++++++++++
 1 file changed, 475 insertions(+)
 create mode 100644 specs/ussuri/approved/nova-cyborg-interaction.rst

diff --git a/specs/ussuri/approved/nova-cyborg-interaction.rst b/specs/ussuri/approved/nova-cyborg-interaction.rst
new file mode 100644
index 000000000..6a9bc10ff
--- /dev/null
+++ b/specs/ussuri/approved/nova-cyborg-interaction.rst
@@ -0,0 +1,475 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=========================
+Nova - Cyborg Interaction
+=========================
+
+https://blueprints.launchpad.net/nova/+spec/nova-cyborg-interaction
+
+This specification describes the Nova - Cyborg interaction needed to create
+and manage instances with accelerators, and the changes needed in Nova to
+accomplish that.
+
+Problem description
+===================
+
+Scope
+-----
+
+Nova and Cyborg need to interact in many areas for handling instances with
+accelerators. While this spec covers the gamut, specific areas are covered in
+detail in other specs. We list all the areas below, identify which specific
+parts are covered by other specs, and describe what is covered in this spec.
+
+* Representation: Cyborg shall represent devices as nested resource providers
+  under the compute node (except possibly for disaggregated servers),
+  accelerator types as resource classes and accelerators as inventory in
+  Placement. The properties needed for scheduling are represented as traits.
+  This is specified by [#cy-nova-place]_. This spec does not
+  dwell on this topic.
+
+* Discovery and Updates: Among the devices discovered in a host, Cyborg
+  intends to claim only those that are not included under the PCI Whitelisting
+  mechanism. Cyborg shall update Placement in a way that is compatible with
+  the virt driver's update of Placement. These aspects are addressed in
+  sections `Coexistence with PCI whitelists`_ and `Placement update`_
+  respectively.
+
+* User requests for accelerators: Users usually request compute resources via
+  flavors. However, since the requests for devices may be highly varied,
+  placing them in flavors may result in flavor explosion. We avoid that by
+  expressing device requests in a device profile [#dev-prof]_ . The
+  relationship between device profiles and flavors is explored in Section
+  `User requests`_.
+
+  When an instance creation (boot) request is made, the contents of a device
+  profile shall be translated to request groups in the request spec; the
+  syntax in request groups is covered in Section `Updating the Request Spec`_.
+
+* Instance scheduling: Nova shall use the Placement data populated by Cyborg
+  to schedule instances. This spec does not dwell on this topic.
+
+* Assignment of accelerators: We introduce the concept of Accelerator Request
+  objects in Section `Accelerator Requests`_.  The workflow to create and use
+  them is summarized in Section `Nova changes for Assignment workflow`_. The
+  same section also highlights the Nova changes needed. The details of the
+  Cyborg API implementation for this workflow is covered in Cyborg specs
+  ([#cy-api-impl]_).
+
+* Instance operations: The behavior with respect to accelerators for all
+  standard instance operations are defined in [#inst-ops]_.
+  This spec does not dwell on this topic.
+
+Use Cases
+---------
+
+* A user requests an instance with one or more accelerators of different
+  types assigned to it.
+* An operator may provide users with both Device as a Service or
+  Accelerated Function as a Service in the same cluster (see
+  [#cy-nova-place]_).
+
+The following use cases are not addressed in Train but are of long term
+interest:
+
+* A user requests to add one or more accelerators to an existing instance.
+* Live migration with accelerators.
+
+Proposed change
+===============
+
+Coexistence with PCI whitelists
+-------------------------------
+The operator tells Nova which PCI devices to claim and use by configuring the
+PCI Whitelists mechanism. In addition, the operator installs Cyborg drivers in
+compute nodes and configures/enables them. Those drivers may then discover and
+report some PCI devices. The operator must ensure that both configurations
+are compatible.
+
+Ideally, there should be a single way for the operator to identify which PCI
+devices should be claimed by Nova and which by Cyborg. This could be along the
+lines suggested in [#generic-dev-disc]_ or [#kosamara]_. If such a mechanism
+could be agreed upon by all stakeholders, Cyborg could adopt it.
+
+Until that point, the operator tells Cyborg which devices to claim by
+using Cyborg's configuration file. The operator must ensure that this is
+compatible with the PCI whitelists configured in Nova.
+
+Placement update
+----------------
+Cyborg shall call Placement API directly to represent devices and
+accelerators. Some of the intended use cases for the API invocation are:
+
+* Create or delete child RPs under the compute node RP.
+* Create or delete custom RCs and custom traits.
+* Associate traits with RPs or remove such association.
+* Update RP inventory.
+
+Cyborg shall not modify the RPs created by any other component, such
+as Nova virt drivers.
+
+User requests
+-------------
+
+The user request for accelerators is encapsulated in a device profile
+[#dev-prof]_, which is created and managed by the admin via the Cyborg API.
+
+A device profile may be viewed as a 'flavor for devices'. Accordingly, the
+instance request should include both a flavor and a device profile. However,
+that requires a change to the Nova API for instance creation. To mitigate the
+impact of such changes on users and operators, we propose to do this
+in phases.
+
+In the initial phase, Nova API remains as today. The device profile is folded
+into the flavor as an extra spec by the operator, as below::
+
+ openstack flavor set --property 'accel:device_profile=<profile_name>' flavor
+
+Thus the standard Nova API can be used to create an instance with only the
+flavor (without device profiles), like this::
+
+ openstack server create --flavor f ....  # instance creation
+
+In the future, device profile may be used by itself to specify accelerator
+resources for the instance creation API.
+
+Updating the Request Spec
+-------------------------
+When the user submits a request to create an instance, as described in Section
+`User requests`_, Nova needs to call a Cyborg API, to get back the resource
+request groups in the device profile and merge them into the request spec.
+(This is along the lines of the scheme proposed for Neutron
+[#req-spec-groups]_.)
+
+..  _cyborg-client-module:
+
+This call, like all the others that Nova would make to Cyborg APIs, is done
+through a Keystone-based adapter that would locate the Cyborg service, similar
+to the way Nova calls Placement. A new Cyborg client module shall be added to
+Nova, to encapsulate such calls and to provide Cyborg-specific functionality.
+
+VM images in Glance may be associated with image properties (other than image
+traits), such as bitstream/function IDs needed for that image. So, Nova should
+pass the VM image UUID from the request spec to Cyborg. This is TBD.
+
+The groups in the device profile are numbered by Cyborg. The request groups
+that are merged into the request spec are numbered by Nova. These numberings
+would not be the same in general, i.e., the N-th device profile group may not
+correspond to the N-th request group in the request spec.
+
+When the device profile request groups are added to other request groups in
+the flavor, the ``group_policy`` of the flavor shall govern the overall
+semantics of all request groups.
+
+Accelerator Requests
+--------------------
+An accelerator request (ARQ) is an object that represents
+the state of the request for an accelerator to be assigned to an instance.
+The creation and management of ARQs are handled by Cyborg, and ARQs are
+persisted in Cyborg database.
+
+An ARQ, by definition, represents a request for a single accelerator. The
+device profile in the user request may have N request groups, each asking for
+M accelerators; then ``N * M`` ARQs will be created for that device profile.
+
+When an ARQ is initially created by Cyborg, it is not yet associated with a
+specific host name or a device resource provider. So it is said to be in an
+unbound state. Subsequently, Nova calls Cyborg to bind the ARQ to a host name,
+a device RP UUID and an instance UUID. If the instance fails to spawn, Nova
+would unbind the ARQ without deleting it. On instance termination, Nova would
+delete the ARQs after unbinding them.
+
+.. _match-rp:
+
+Each ARQ needs to be matched to the specific RP in the allocation candidate
+that Nova has chosen, before the ARQ is bound. Since Placement does not match
+RPs to request groups, this must be done in the Cyborg client module of Nova
+(`cyborg-client-module`_). The matching is done using the requester_id field
+in the ``RequestGroup`` object ([#requester-id]_) as below:
+
+* The order of request groups in a device profile is not significant, but it
+  is preserved by Cyborg. Thus, each device profile request group has a unique
+  index.
+* When the device profile request groups returned by Cyborg are added to the
+  request spec, the requester_id field is set to 'device_profile_<N>' for the
+  N-th device profile request group (starting from zero). The device profile
+  name need not be included here because there is only one device profile per
+  request spec.
+* When Cyborg creates an ARQ for a device profile, it embeds the device
+  profile request group index in the ARQ before returning it to Nova.
+* The matching is done in two steps:
+
+  * Each ARQ is mapped to a specific request group in the request spec using
+    the requester_id field.
+  * Each request group is mapped to a specific RP using the same logic as the
+    Neutron bandwidth provider ([#map-rg-to-rp]_).
+
+Nova changes for Assignment workflow
+------------------------------------
+This section summarizes the workflow details for Phase 1. The changes needed
+in Nova are marked with NEW.
+
+NEW: A Cyborg client module is added to nova (`cyborg-client-module`_). All
+Cyborg API calls are routed through that.
+
+#. The Nova API server receives a ``POST /servers`` API request with a flavor
+   that includes a device profile name.
+
+#. NEW: The Nova API server calls the Cyborg API ``GET
+   /v2/device_profiles?name=$device_profile_name`` and gets back the device
+   profile request groups. These are added to the request spec.
+
+#. The Nova scheduler invokes Placement and gets a list of allocation
+   candidates. It selects one of those candidates and makes
+   claim(s) in Placement. The Nova conductor then sends a RPC message
+   ``build_and_run_instances`` to the Nova compute manager.
+
+#. NEW: Nova calls the Cyborg API ``POST /v2/accelerator_requests`` with the
+   device profile name. Cyborg creates a set of unbound ARQs for that device
+   profile and returns them to Nova. (The call may originate from Nova
+   conductor or the compute manager; that will be settled in code review.)
+
+#. NEW: The Cyborg client in Nova matches each ARQ to the resource provider
+   picked for that accelerator. See `match-rp`_.
+
+#. NEW: The Nova compute manager calls the Cyborg API ``PATCH
+   /v2/accelerator_requests`` to bind the ARQ with the host name, device's RP
+   UUID and instance UUID. This is an asynchronous call which prepares or
+   reconfigures the device in the background.
+
+#. NEW: Cyborg, on completion of the bindings (successfully or otherwise),
+   calls Nova's ``POST /os-server-external-events`` API with::
+
+    {
+       "events": [
+          { "name": "arq_resolved",
+            "tag": $arq_uuid,
+            "server_uuid": $instane_uuid,
+            "status": "ok" # or "failed"
+          },
+          ...
+       ]
+    }
+
+#. NEW: The Nova virt driver waits for the notification, subject to the
+   timeout mentioned in Section `Other deployer impact`_. It then calls
+   the Cyborg REST API ``GET
+   /v2/accelerator_requests?instance=<uuid>&bind_state=resolved``.
+
+#. NEW: The Nova virt driver uses the attach handles returned from the Cyborg
+   call to compose PCI passthrough devices into the VM's definition.
+
+#. NEW: If there is any error after binding has been initiated, Nova
+   must unbind the relevant ARQs by calling Cyborg API. It may then retry on
+   another host or delete the (unbound) ARQs for the instance.
+
+This flow is captured by the following sequence diagram, in which the Nova
+conductor and scheduler are together represented as the Nova controller. The
+ARQ creation is shown to happen in Nova compute manager only for concreteness;
+it may be in the controller instead.
+
+.. seqdiag::
+
+     seqdiag {
+         edge_length = 200;
+         span_height = 15;
+         activation = none;
+         default_note_color = white;
+         'Nova Controller'; 'Placement'; 'Cyborg'; 'Nova Compute';
+
+         'Nova Controller' -> 'Cyborg' [label =
+             "GET /v2/device_profiles?name=mydp"];
+         'Nova Controller' <- 'Cyborg' [label =
+             '{"device_profiles": $device_profile}'];
+         'Nova Controller' -> 'Nova Controller' [label=
+             'Merge request groups into request_spec'];
+         'Nova Controller' -> 'Placement' [label=
+             'Get /allocation_candidates'];
+         'Nova Controller' -> 'Placement' [label=
+             'allocation candidates with nested RPs'];
+         'Nova Controller' -> 'Nova Controller' [label=
+             'Select a candidate'];
+         'Nova Controller' -> 'Nova Compute' [label=
+             'build_and_run_instances()'];
+         'Nova Compute' -> 'Cyborg' [label=
+             'POST /v2/accelerator_requests'];
+         'Nova Compute' <- 'Cyborg' [label=
+             '{"arqs": [$arq, ...]'];
+         'Nova Compute' -> 'Cyborg' [label=
+             'PATCH /v2/accelerator_requests'];
+         'Nova Compute' <- 'Cyborg' [label=
+             '{"arqs": [$arq, ...]'];
+         'Cyborg' -> 'Nova Controller' [label=
+             'POST /os-server-external-events'];
+         'Nova Compute' -> 'Nova Compute' [label=
+             'Wait for notification from Cyborg'];
+         'Nova Compute' -> 'Cyborg' [label=
+             'GET /v2/accelerator_requests?
+             instance=$uuid&bind_state=resolved'];
+         'Nova Compute' <- 'Cyborg' [label=
+             '{"arqs": [$arq, ....]}'];
+     }
+
+
+Alternatives
+------------
+It is possible to have an external agent create ARQs from device profiles
+by calling Cyborg, and then feed those pre-created ARQs to the Nova instance
+creation API, analogous to Neutron ports. We do not take that approach yet
+because it requires changes to Nova instance creation API.
+
+It is possible to have the Nova virt driver poll for the Cyborg ARQ binding
+completion. That is not preferable, partly because that is not the pattern of
+interaction with other services like Neutron.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None. A new extra_spec key ``accel:device_profile_name`` is added to
+the flavor.
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+Nova may choose to add additional notifications for Cyborg API calls.
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+The extra calls to Cyborg REST API may potentially impact Nova
+conductor/scheduler throughput. This has been mitigated by making some
+critical Cyborg operations as asynchronous tasks.
+
+Other deployer impact
+---------------------
+
+The deployer needs to set up the ``clouds.yaml`` file so that Nova
+can call the Cyborg REST API.
+
+The deployer needs to configure a new tunable in ``nova-cpu.conf``::
+
+ * arq_binding_timeout (integer): Time in seconds for Nova compute
+   manager to wait for Cyborg to notify that ARQ binding is done.
+   Timeout is fatal, i.e., VM startup is aborted with an exception.
+   Default: 300.
+
+Developer impact
+----------------
+
+Define two new standard resource classes: FPGA and PGPU.
+
+We have VGPU and VGPU_DISPLAY_HEAD RCs defined already. But we propose a
+PGPU as a different RC for the following reasons:
+
+ * Both VGPU and VGPU_DISPLAY_HEAD RCs specifically refer to virtual GPUs.
+   We need a different one for physical GPUs.
+ * It will be subject to separate quotas/limits in Keystone.
+ * Using PCI_DEVICE RC is too general: we want quotas for GPU RC
+   specifically.
+
+Upgrade impact
+--------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Sundar Nadathur
+
+Work Items
+----------
+
+See the steps marked NEW in `Nova changes for Assignment workflow`_ section.
+
+Dependencies
+============
+
+* Specification for device profiles [#dev-prof]_.
+* Cyborg API specification [#cy-api]_.
+
+Testing
+=======
+There need to be unit tests and functional tests for the Nova changes.
+Specifically, there needs to be a functional test fixture that mocks the
+Cyborg API calls.
+
+There need to be tempest tests for the end-to-end flow, including failure
+modes. The tempest tests should be targeted at a fake driver (in addition to
+real hardware, if any) and tied to the Nova Zuul gate.
+
+Documentation Impact
+====================
+Device profile creation needs to be documented in Cyborg, as noted in
+[#dev-prof]_.
+
+The need for operator to fold the device profile into the flavor needs to be
+documented.
+
+References
+==========
+
+.. [#cy-nova-place] `Specification for Cyborg Nova Placement
+   interaction <https://review.openstack.org/#/c/603545/>`_
+
+.. [#dev-prof] `Device profiles specification
+   <https://review.openstack.org/602978>`_
+
+.. [#cy-api-impl] `Specification for Cyborg API implementation
+   <https://review.openstack.org/#/c/608624/>`_
+
+.. [#inst-ops] `Specification for instance operations with accelerators
+   <https://review.openstack.org/#/c/605237/>`_
+
+.. [#generic-dev-disc] `Generic device discovery
+   <https://review.openstack.org/#/c/603805/>`_
+
+.. [#kosamara] `Modelling passthrough devices for report to placement
+   <https://review.openstack.org/#/c/591037/>`_
+
+.. [#req-spec-groups] `Store RequestGroup objects in RequestSpec
+   <https://review.openstack.org/#/c/567267/>`_
+
+.. [#requester-id] `Requester_id field in RequestGroup
+   <https://git.openstack.org/cgit/openstack/nova/tree/nova/objects/request_spec.py?h=refs/changes/27/619527/16#n818/>`_
+
+.. [#map-rg-to-rp] `Map request groups to resource providers
+   <https://review.openstack.org/#/c/616239/33/nova/objects/request_spec.py/>`_
+
+.. [#cy-api] 'Specification for Cyborg API Version 2
+   <https://review.opendev.org/658263/>`_
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Train
+     - Introduced
+   * - Ussuri
+     - Re-proposed