N3000 FPGA device image update orchestration
The overall scenario is that we have an administrator operating in a central cloud, with hundreds or thousands of subclouds being managed from the central cloud. In each subcloud there will be one or more nodes with FPGA devices. These devices will need to be programmed with a number of types of bitstreams but to ensure that service standards are met they can't all be updated at the same time. Instead, the admin will create policies which govern which subclouds are updated when, and the orchestration framework will follow those policies to update the various subclouds. While initially intended to support the N3000 FPGA, we want to keep it sufficiently generic to handle other FPGAs, or NICs, or other hardware devices that might need firmware updates. Change-Id: Ib7f62e1cfc3585219c52892130b76b583e603a7f Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
This commit is contained in:
parent
1d362b2406
commit
461a39760b
|
@ -0,0 +1,845 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License. http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
|
||||
=========================================
|
||||
N3000 FPGA Bitstream Update Orchestration
|
||||
=========================================
|
||||
|
||||
Storyboard:
|
||||
https://storyboard.openstack.org/#!/story/2006740
|
||||
|
||||
The overall scenario is that we have an administrator operating in a central
|
||||
cloud, with hundreds or thousands of subclouds being managed from the central
|
||||
cloud. In each subcloud there will be one or more nodes with FPGA devices.
|
||||
These devices will need to be programmed with a number of types of bitstreams
|
||||
but to ensure that service standards are met they can't all be updated at the
|
||||
same time. Instead, the admin will create policies which govern which subclouds
|
||||
are updated when, and the orchestration framework will follow those policies to
|
||||
update the various subclouds.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In a distributed-cloud environment there may be hundreds or thousands of
|
||||
subclouds, each containing one or more hosts, some of which may have hardware
|
||||
devices on them (like NICs or FPGAs) which require image updates in order to
|
||||
properly provide service to applications which ultimately provide services for
|
||||
the end-user.
|
||||
|
||||
In order to simplify management of these hardware devices, we wish to support
|
||||
orchestration of device image updates in a distributed-cloud environment,
|
||||
starting with the Intel N3000 FPGA device (which is expected to be
|
||||
commonly-used for 5G) but designing the framework in such a way that we could
|
||||
extend it to deal with other types of device images (other FPGAs, or NIC
|
||||
firmware for example) as well.
|
||||
|
||||
For the case of the N3000 (and likely other FPGAs) there are a number of
|
||||
different image types that need to be supported, specifically one to set the
|
||||
root authentication key, one to update the FPGA core (signed with a signing
|
||||
key), and one to revoke a signing key. For the case of NICs, you'd typically
|
||||
have a single image type. In all cases, the image type would only be valid for
|
||||
a specific PCI vendor/device tuple.
|
||||
|
||||
Since updating device firmware will necessarily result in a service outage, we
|
||||
need the ability to control which subclouds (which typically would correspond
|
||||
to geographic areas) can be updated in parallel.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As a cloud admin, I want to push out a hardware device image update to
|
||||
hardware devices on a single host (possibly for test purposes).
|
||||
|
||||
As a cloud admin, I want to push out hardware device image updates to hardware
|
||||
devices on multiple hosts in a cloud.
|
||||
|
||||
As a distributed-cloud admin, I want to push out hardware device image updates
|
||||
to hardware devices on all hosts on a single subcloud (possibly for test
|
||||
purposes).
|
||||
|
||||
As a distributed-cloud admin, I want to push out hardware device image updates
|
||||
to hardware devices on multiple hosts on many subclouds. While doing this, I
|
||||
want to control which hosts and which subclouds can be updated in parallel
|
||||
since I want to avoid causing service outages while doing the update.
|
||||
|
||||
As a distributed-cloud admin, I want to be able to display whether each
|
||||
subcloud is using up-to-date device images.
|
||||
|
||||
As a distributed-cloud admin, I want to see the status of in-progress device
|
||||
image updates.
|
||||
|
||||
As a distributed-cloud or cloud admin, I want to be able to abort an
|
||||
orchestrated device image update such that currently-in-progress device writes
|
||||
will finish but no additional ones will be scheduled.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The overall architecture of the device image orchestration will be modelled
|
||||
after the existing software-patch handling. In a single-cloud environment we
|
||||
will support uploading device images, "applying" them (which just means marking
|
||||
them as something that should get written to the hardware), and then actually
|
||||
kicking off the write to the hardware.
|
||||
|
||||
In an distributed-cloud environment, when using device image orchestration we
|
||||
will first do the above in the SystemController region, and then use dcmanager
|
||||
to handle pushing the images down to the subcloud and kicking off the actual
|
||||
update in the subcloud. The VIM in each subcloud will decide when to update
|
||||
the device images on each host, and a sysinv agent on each host will handle
|
||||
writing the actual device images to the hardware.
|
||||
|
||||
In a distributed-cloud environment it will also be possible for the admin user
|
||||
to explicitly issue commands to the sysinv API endpoint for a single subcloud.
|
||||
This will essentially bypass the orchestration mechanism and behave the same
|
||||
as the single-cloud environment.
|
||||
|
||||
Hardware Background
|
||||
-------------------
|
||||
|
||||
The initial hardware that we want to support is the Intel N3000 [1]_, an FPGA that
|
||||
we expect will be used by 5G edge providers. This FPGA is a little unique in
|
||||
that it takes ~40min to write the functional image to the hardware (service
|
||||
can continue during this time, then a hardware reset is required to load the
|
||||
new image). Once the new image is loaded, the device will provide multiple
|
||||
VFs, which in turn will be exported to Kubernets as resources, where they will
|
||||
be consumed by applications running in Kubernetes containers. Because of the
|
||||
long write times, these devices must be pre-programmed rather than programmed
|
||||
at Kubernetes pod startup.
|
||||
|
||||
Hardware Security Model
|
||||
-----------------------
|
||||
|
||||
By default, the N3000 will accept any valid bitstream that is sent to it (signed
|
||||
or unsigned). The customer/vendor can generate a public/private root key pair,
|
||||
then create a *root-entry-hash* bitstream from the public key. If a
|
||||
*root-entry-hash* bitstream is written to the N3000 it will set the root entry
|
||||
hash on the device. From that point on, only bitstreams signed by a code
|
||||
signing key (CSK) which is in turn signed by the private root key will be
|
||||
accepted by the N3000. Once a *root-entry-hash* bitstream has been written to
|
||||
the hardware, it cannot be erased or changed without sending the hardware back
|
||||
to the vendor.
|
||||
|
||||
The customer/vendor can generate new *FPGA user* bitstreams. These may be
|
||||
unsigned or signed with a CSK. Typically each such bitstream would be signed
|
||||
by a different CSK. Writing a new *user* bitstream will cause the new code to be
|
||||
loaded on the next bootup of the N3000. Only one *user* bitstream can be stored
|
||||
in the N3000 at a time.
|
||||
|
||||
The customer/vendor can create a *CSK-ID-cancellation* bitstream (generated from
|
||||
the private root key). When written to the N3000 it will revoke a
|
||||
previously-used CSK and disallow loading any images signed with it. Multiple
|
||||
*CSK-ID-cancellation* bitstreams can be processed for each N3000. Most
|
||||
importantly, StarlingX will not deal directly with CSKs, only bitstreams.
|
||||
|
||||
Cloud/Subcloud Sysinv FPGA Agent
|
||||
--------------------------------
|
||||
|
||||
The low-level interactions with the physical FPGA devices will be performed by
|
||||
a new *sysinv FPGA agent* which will reside on each node with a *worker*
|
||||
subfunction. The agent will communicate bi-directionally with sysinv-conductor
|
||||
via RPC. The interactions with the N3000 FPGA will be performed using the OPAE
|
||||
tools [2]_ in the n3000-opae Docker image running in a container. (This will
|
||||
require the use of privileged containers due to the need to directly access
|
||||
hardware devices.)
|
||||
|
||||
On startup, the existing *sysinv-agent* will try to create a file under
|
||||
/run/sysinv. If the file did not yet exist, it will send an RPC message to
|
||||
sysinv-conductor indicating that the host just rebooted. Sysinv-conductor will
|
||||
then clear any "reboot needed" DB entry for that host if it was set. If there
|
||||
are no more "pending firmware update" entries in the DB for any host, and if no
|
||||
host has the "reboot needed" DB entry set, then the "*firmware update in
|
||||
progress*" alarm will be cleared.
|
||||
|
||||
On startup, the existing *sysinv-agent* will do an inventory of the PCI
|
||||
devices on each worker node. The new *sysinv-fpga-agent* will inventory the
|
||||
FPGA devices as well, including querying additional details from each FPGA
|
||||
device as per the *host-device-show* command. The FPGA agent will send an RPC
|
||||
message to *sysinv-conductor* to update the database with up-to-date FPGA device
|
||||
information.
|
||||
|
||||
If there are problems that need to be dealt with immediately (such as the FPGA
|
||||
booting the factory image when there should be a functional image) then
|
||||
*sysinv-conductor* will send an RPC message to *sysinv-fpga-agent* to trigger
|
||||
a *device-image-update* operation to ensure that the FPGA is up-to-date. This
|
||||
will also cause an alarm to be raised.
|
||||
|
||||
If the FPGA has a valid functional image but it's not the currently-active
|
||||
functional image, then we will alarm it but not trigger a *device-image-update*
|
||||
operation. In the future we may wish to extend this to check whether the
|
||||
functional image was signed with a cancelled CSK-ID and if so then trigger a
|
||||
*device-image-update* operation due to security risks.
|
||||
|
||||
On sysinv-conductor startup it will send out a request to all *sysinv FPGA
|
||||
agents* to report their hardware status. This is needed to deal with certain
|
||||
error scenarios.
|
||||
|
||||
In certain error scenarios it's possible that the *sysinv FPGA agent* will be
|
||||
unable to send a message to sysinv-conductor. It will need to handle this
|
||||
gracefully.
|
||||
|
||||
Subcloud Sysinv Operations
|
||||
--------------------------
|
||||
|
||||
At the single-cloud or subcloud level, the commands start out fairly typical.
|
||||
We plan to extend sysinv to introduce create/list/show/delete commands for the
|
||||
FPGA images, extend the existing *host device* commands to operate on the FPGA
|
||||
device, add new commands to *apply* or *remove* a device image, and finally add
|
||||
new commands to initiate or abort the firmware update.
|
||||
|
||||
The concept of *apply* is used because there are different types of bitstreams
|
||||
and it's possible to have more than one bitstream that needs to be downloaded
|
||||
to a newly-added FPGA. This will be discussed in more detail in the
|
||||
activation section below.
|
||||
|
||||
system device-image-create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Define a new image, specifying bitstream file, the bitstream type (root-key,
|
||||
functional, or key-revocation), an optional name/version/description, the
|
||||
applicable PCI vendor/device the image is for, and various
|
||||
bitstream-type-specific metadata such as the bitstream ID (for the
|
||||
FPGA functional image), the key signature (for the root-key image), the key ID
|
||||
being revoked, etc. To simplify the dcmanager code, this should allow
|
||||
specifying the UUID for the image. (Ideally we should be able to issue a GET
|
||||
in RegionOne and pass the results directly to a PUT to the same location in
|
||||
the subloud to create a new image in the subcloud. Alternatively, a POST could
|
||||
be used but we'd have to add the UUID to the request body.) If not specified,
|
||||
the system will create a UUID for the image, the bitstream file will be stored
|
||||
in a replicated filesystem on the controller node, and the metadata will be
|
||||
stored in the sysinv database.
|
||||
|
||||
system device-image-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display high-level image data for all known images. This would include image
|
||||
type (root-key, functional, key-revocation), UUID, version.
|
||||
|
||||
system device-image-show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display detailed image data for a single image (specified via UUID). This
|
||||
would include UUID, image type, description, key ID, bitstream ID, name,
|
||||
description, signing key signature, any activations (with device label) for the
|
||||
image, etc.
|
||||
|
||||
system device-image-delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete an image (specified by UUID). If an FPGA functional image is deleted
|
||||
due to a security issue, it would be wise to also upload and activate a
|
||||
key-revocation bitstream to prevent the image from being uploaded again either
|
||||
by accident or maliciously.
|
||||
|
||||
system device-image-apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Make the specified image *active*, but do not actually initiate writing to the
|
||||
hardware. This applies to a specific image, and optionally takes a device
|
||||
label key/value such that only devices with the specified label would be
|
||||
updated. Initially only *functional*, *root-key*, and *key-revocation*
|
||||
bitstreams are supported. Only one *root-key* bitstream can ever be written to
|
||||
an N3000, so having more than one such bitstream be active doesn't make sense.
|
||||
Applying a *functional* bitstream will *remove* all other functional bitstreams
|
||||
for that FPGA PCI vendor/device. There can be multiple *key-revocation*
|
||||
bitstreams active.
|
||||
|
||||
Note that it would be possible to make multiple images active, then issue a
|
||||
*host-device-image-update* command to trigger writing them all to the hardware.
|
||||
|
||||
When an image has been applied, a "device firmware update in progress" alarm
|
||||
will be raised, and will stay raised until all affected devices have had their
|
||||
firmware updated or until the device image is removed. This implies that a
|
||||
"pending firmware update" DB entry will be created for each affected device for
|
||||
each applied image to indicate that the image needs to be written to the
|
||||
device.
|
||||
|
||||
system device-image-remove
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Deactivate the specified image, optionally allowing specifying a device label
|
||||
to deactivate the image only for devices with the specified label. If you try
|
||||
to deactivate an image which is currently being written to the hardware it will
|
||||
succeed but will not abort the write.
|
||||
|
||||
When an image is deleted, all of its activation records will also be deleted.
|
||||
(The implementation of this operation could probably be left to the end as it
|
||||
is not critical.)
|
||||
|
||||
Removing an image will remove any "pending firmware update" DB entries for that
|
||||
image. If there are no remaining pending firmware updates, and no "reboot
|
||||
needed" DB entries for any host, then the "device firmware update in progress"
|
||||
alarm can be cleared.
|
||||
|
||||
system host-device-image-update
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Tell sysinv to update the specified device on the specified host with any
|
||||
active images which have not yet been written to the hardware. In this
|
||||
scenario, sysinv-conductor will tell the FPGA agent to write each
|
||||
active-but-not-written image to the device in turn until they've all been
|
||||
written. We would want to write the root-key bitstream first, then any
|
||||
key-revocation bitstreams, then the functional bitstream. If we have
|
||||
successfully written the functional bitstream, the admin user (or the VIM
|
||||
in the orchestrated update case) will need to lock/unlock the node to cause
|
||||
the new functional image to be loaded.
|
||||
|
||||
While writing an image to the FPGA, we would want to block the reboot of the
|
||||
host in question. We will only allow updating device images on unlocked hosts,
|
||||
and once the device image starts no host-lock commands will be accepted
|
||||
unless the *force* option is used. While the FPGA agent is writing the image
|
||||
to the hardware, it will *stop* the watchdog service from running, since we
|
||||
don't want an unrelated process to trigger a reboot while we're writing to the
|
||||
hardware. After the image has been written, the FPGA agent will restart the
|
||||
watchdog service.
|
||||
|
||||
After each image is written the FPGA agent would send an RPC message to
|
||||
sysinv-conductor to remove the "*pending firmware update*" entry from the DB
|
||||
and to set a "reboot needed" DB entry for that host.
|
||||
|
||||
system host-device-image-update-abort
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Abort any pending image updates for this host. Any in-progress device image
|
||||
updates will continue until completion or failure.
|
||||
|
||||
(The implementation of this operation could be left towards the end, as it is
|
||||
not necessary for the success path.)
|
||||
|
||||
system host-device-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Add support to the existing command so the FPGA device displays in the list.
|
||||
Add a new "needs firmware update" column.
|
||||
|
||||
system host-device-show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Extend the existing command to add new optional device-specific fields. For the
|
||||
N3000 this would include accelerator status, type of booted image
|
||||
(user/factory), booted image bitstream ID, cancelled CSK IDs, root entry hash,
|
||||
BMC versions, PCI device ID, onboard NIC devices, etc.
|
||||
|
||||
We might want to include device labels (see below) in the output.
|
||||
|
||||
System host-device-label-assign
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assign a *key: value* label to a PCI device. This takes as arguments the PCI
|
||||
device, the host, the key, and the value.
|
||||
|
||||
System host-device-label-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
List all labels for a given PCI device. This takes as arguments the PCI device
|
||||
and the host, and returns a list of all key/value labels for the device.
|
||||
(Alternatively could take the PCI device UUID, but the CLI doesn't expose that
|
||||
currently.)
|
||||
|
||||
System host-device-label-remove
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Remove a label from a PCI device. This takes as arguments the PCI device, the
|
||||
host, and the key.
|
||||
|
||||
System device-label-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
List all devices and their labels from all hosts in the system. Devices
|
||||
without any labels are not included. This is intended for use by dcmanager to
|
||||
determine whether an image should be created in a given subcloud.
|
||||
|
||||
system host lock/swact
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The *lock* operation would be blocked by default during device image update to
|
||||
prevent accidentally rebooting while in the middle of updating the FPGA image
|
||||
(since that would result in a service outage while the FPGA gets updated
|
||||
again). Since we will only start a device image update on an unlocked host,
|
||||
this should be sufficient.
|
||||
|
||||
If the *force* option is specified for this command, the action will
|
||||
proceed. (This may mean that the device ends up in a bad state if the host
|
||||
reboots while a device image update was in progress.)
|
||||
|
||||
The manual swact operation will be blocked during a device image update to
|
||||
reduce the chances that it will interfere with the image update. The image
|
||||
update code in the rest of the system will try to deal with temporary outages
|
||||
caused by a swact, but we may need to handle it as a failure if the outage
|
||||
lasts long enough.
|
||||
|
||||
|
||||
Subcloud VIM Operations
|
||||
-----------------------
|
||||
|
||||
All of these operations would be analogous to the existing sw-manager
|
||||
patch-strategy and update-strategy operations. We're using *firmware update*
|
||||
in the CLI to allow it to be potentially more generic in the future, but
|
||||
initially these would apply to the FPGA image update only.
|
||||
|
||||
The VIM will control the overall firmware update strategy for the subcloud. It
|
||||
will decide whether a firmware update is currently allowed to be kicked off (if
|
||||
there are alarms raised it might block the firmware update strategy apply
|
||||
depending on the strategy), control how many hosts can do a firmware update in
|
||||
parallel, trigger each host to begin the firmware update, and aggregate the
|
||||
status of the firmware update on the various hosts.
|
||||
|
||||
When the VIM decides to initiate a firmware update on a given host, it will
|
||||
issue the HTTP equivalent of the *system host-device-image-update* command to
|
||||
sysinv on that host to tell that host to write all *applied but not yet
|
||||
written* device images to the hardware.
|
||||
|
||||
sw-manager fw-update-strategy create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Check the system state and builds up a sequence of commands needed to bring the
|
||||
subcloud into alignment with the desired state of the system. This would take
|
||||
options such as how many hosts can do firmware update in parallel, whether to
|
||||
stop on failure, whether outstanding alarms should prevent upgrade, etc.
|
||||
|
||||
This step will loop over all hosts querying sysinv to see whether each host has
|
||||
any devices that need updating, then generate a series of steps to bring all
|
||||
relevant hosts in the subcloud up-to-date for their device images.
|
||||
|
||||
If there are no firmware updates to be applied, the strategy creation will
|
||||
fail with the reason "no firmware updates found".
|
||||
|
||||
sw-manager fw-update-strategy apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Execute the firmware update strategy. Probably want an option similar to the
|
||||
*stage-id* supported for applying a patching strategy up to a specific stage ID.
|
||||
|
||||
Apply the specified firmware update strategy to each host specified in the
|
||||
strategy (this would typically be all hosts which have devices which need a
|
||||
firmware update) following the policies in the strategy around serial/parallel
|
||||
updates. For each affected host, the VIM will use the sysinv REST API to
|
||||
trigger a *system host-device-image-update* operation and then periodically
|
||||
check for the status of the update operation.
|
||||
|
||||
sw-manager fw-update-strategy show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display update strategy, optionally with more details (like current status of
|
||||
the overall sequence as stored in the VIM database).
|
||||
|
||||
sw-manager fw-update-strategy abort
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Allow existing in-progress FPGA updates to complete, do not trigger any
|
||||
additional nodes to begin FPGA updates. Signal to sysinv to abort FPGA
|
||||
update; this will still allow in-progress FPGA updates to complete since we do
|
||||
not want to end up with a half-written image (which would require a new FPGA
|
||||
update operation to recover).
|
||||
|
||||
(The implementation of this may be left till the end as it is not needed for
|
||||
the success path.)
|
||||
|
||||
sw-manager fw-update-strategy delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete the strategy once no longer needed.
|
||||
|
||||
System Controller DC Manager Operations
|
||||
---------------------------------------
|
||||
|
||||
The DC Manager operations in the system controller are strongly related to the
|
||||
VIM operations in the subcloud, and most of them are equivalent to the
|
||||
operations for the existing sw-manager patch-strategy and update-strategy
|
||||
operations.
|
||||
|
||||
The DC manager will control when to trigger a firmware update in a given
|
||||
subcloud, which subclouds can be updated in parallel, and whether to stop on
|
||||
failure or not.
|
||||
|
||||
The DC manager will also handle creating/deleting device images in each
|
||||
subcloud as needed to keep the subcloud in sync with the SystemController
|
||||
region by talking to sysinv-api-proxy in the SystemController region and in
|
||||
each subcloud. The actual device image files will be stored by the
|
||||
sysinv-api-proxy in a well-known location where DC manager can access them when
|
||||
creating device images in the subclouds. DC Manager will only create device
|
||||
images in the subcloud if there is at least one device in the subcloud which
|
||||
will be updated with the device image in question (based on any labels
|
||||
specified via the "*system device-image-apply*" command).
|
||||
|
||||
As part of the dcmanager, there will be a periodic audit which scans a number
|
||||
of subclouds in parallel and checks whether the subcloud has all of the
|
||||
*applied* device images that it should have (based on the labels the images
|
||||
were applied against and the device labels in the subcloud), and whether all of
|
||||
the *applied* device images have been written to the devices that they
|
||||
should be. If either of these is not true, then the subcloud "*firmware image
|
||||
sync status*" is considered "*out of sync*". This will result in the subcloud
|
||||
as a whole being considered "*out of sync*".
|
||||
|
||||
When a device image is *applied* in sysinv, dcmanager will be notified and will
|
||||
set the "*firmware image sync status*" to *unknown* for all subclouds
|
||||
since it does not know at this point which subcloud(s) the image needs to be
|
||||
created/applied/updated in. On the next audit, this sync status will be
|
||||
updated to "*in sync*" or "*out of sync*" as applicable.
|
||||
|
||||
|
||||
dcmanager subcloud-group create/add-member/remove-member/delete/list/show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
These introduce the concept of a "*subcloud group*", which is a way of
|
||||
grouping subclouds together such that all subclouds in the group can potentially
|
||||
be upgraded in parallel. A given subcloud can only be a member of one subcloud
|
||||
group.
|
||||
|
||||
This is needed because the customer will likely want to ensure (as much as
|
||||
posible) that we don't update the functional image (which requires a service
|
||||
outage) on all subclouds that serve a certain geographic area (which could
|
||||
cause an outage for end-users in that area).
|
||||
|
||||
There will be controls over how many subclouds in a group can be updated at
|
||||
once. Dcmanager will only apply update strategies in one group at a time,
|
||||
and will update all subclouds in a group before moving on to the next subcloud
|
||||
group.
|
||||
|
||||
|
||||
dcmanager fw-update-strategy create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Create a new update strategy, with options for number of subclouds to upgrade
|
||||
in parallel, whether to stop on failure, etc. (Eventually we may want to
|
||||
specify a list of which subcloud groups to update but this will not be
|
||||
included in the initial version.) This will generate a UUID for the created
|
||||
strategy, and will generate a step for each subcloud that dcmanager thinks is
|
||||
out-of-sync that is in the specified subcloud group. If there are any subclouds
|
||||
with an "unknown" sync state in the subcloud group then we would disallow
|
||||
the creation of a firmware update strategy for that group.
|
||||
|
||||
|
||||
dcmanager fw-update-strategy list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
List the firmware update strategies, with the most important bits of
|
||||
information for each. This should include the overall update strategy status
|
||||
(i.e. "in progress" if we've asked for it to be applied).
|
||||
|
||||
dcmanager fw-update-strategy show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Show specified firmware update strategy. This would include all the metadata
|
||||
specified as part of the "create", the overall update strategy status, as
|
||||
well as the status (as reported by the subcloud VIM) of the firmware update
|
||||
strategy application for all the subclouds in the specified subcloud group.
|
||||
|
||||
dcmanager fw-update-strategy apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Execute each step (where each step roughly corresponds to a subcloud) of the
|
||||
specified firmware update strategy. This would look something like this:
|
||||
|
||||
* Queries sysinv in RegionOne for active FPGA images using REST API.
|
||||
* For each strategy step, use sysinv REST API to:
|
||||
|
||||
* Query subcloud for device labels.
|
||||
* Query subcloud for FPGA images.
|
||||
* Create/update/delete FPGA images in subcloud as needed to bring it into
|
||||
sync with the FPGA images in the SystemController. We don't do this via
|
||||
dcorch because we want to ensure the data is up to date when applying the
|
||||
update strategy. (This process could take some time on slow subcloud link.)
|
||||
* Apply the device image in the subcloud.
|
||||
* Create FPGA update strategy using VIM REST API.
|
||||
* Apply FPGA update strategy using VIM REST API.
|
||||
* Monitor progress by querying FPGA update strategy using VIM REST API.
|
||||
|
||||
dcmanager fw-update-strategy abort
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Pass the abort down to each subcloud, and do not process any more subclouds.
|
||||
|
||||
(Maybe leave this till last as it is not needed for the success path.)
|
||||
|
||||
dcmanager fw-update-strategy delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete firmware update strategy. This would also delete the firmware update
|
||||
strategy in the subcloud using the VIM REST API. It is not valid to delete an
|
||||
in-progress update strategy.
|
||||
|
||||
dcmanager strategy-step list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Extend if needed to list the strategy-steps and their state for the FPGA update
|
||||
strategy that is being applied.
|
||||
|
||||
dcmanager strategy-step show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Extend if needed to show step for a specific subcloud.
|
||||
|
||||
System Controller Sysinv Operations
|
||||
-----------------------------------
|
||||
The sysinv operations in the system controller essentially duplicate the
|
||||
image-related subset of the sysinv operations in the subcloud. We don't expect
|
||||
the system controller to have any FPGAs, so the device-image-update,
|
||||
host-device-list, and host-device-show commands are not relevant. In all cases
|
||||
the request is intercepted by sysinv-api-proxy in the SystemController region
|
||||
and forwarded to RegionOne. Unlike normal resources, dcorch will not be used
|
||||
to synchronize the FPGA image information.
|
||||
|
||||
system device-image-create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Define a new image, as described in `Subcloud Sysinv Operations`_. If
|
||||
successful, the sysinv API proxy will also save the image to
|
||||
/opt/device-image-vault, which will be a drbd-replicated filesystem analogous
|
||||
to how /opt/patch-vault is used to store patch files for orchestrated patching.
|
||||
|
||||
system device-image-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display high-level image data for all known images as per
|
||||
`Subcloud Sysinv Operations`_.
|
||||
|
||||
system device-image-show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display detailed image data for a single image as per
|
||||
`Subcloud Sysinv Operations`_.
|
||||
|
||||
system device-image-delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete an image as per `Subcloud Sysinv Operations`_. Remind the user about
|
||||
uploading and activating a key-revocation bitstream for security issues
|
||||
when deleting a functional image. Deleting an active image will not be
|
||||
allowed.
|
||||
|
||||
system device-image-apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Make the specified image *active* as per `Subcloud Sysinv Operations`_.
|
||||
|
||||
system device-image-remove
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Make the specified image *inactive* as per `Subcloud Sysinv Operations`_.
|
||||
|
||||
Fault Handling
|
||||
--------------
|
||||
|
||||
While a device is in the middle of updating its functional image, it's
|
||||
possible that a fault could occur that would normally result in the host being
|
||||
rebooted. If we reboot while updating the N3000 functional image it could
|
||||
result in a 40-minute outage on host startup while we flash the functional
|
||||
image again.
|
||||
|
||||
Given the above, the desired behavior while a device image update is in
|
||||
progress is to avoid rebooting on faults (critical process alarm, low memory
|
||||
alarm, etc.) as long as the fault is not something (like high temperature) that
|
||||
could actually damage the hardware.
|
||||
|
||||
This is less of an issue for AIO-SX since we're already suppressing mtce reboot
|
||||
actions.
|
||||
|
||||
The host watchdog will currently reset the host under certain circumstances.
|
||||
This is undesirable if we're in the middle of updating device images, so the
|
||||
sysinv FPGA agent will temporarily shut down the "hostwd" daemon during device
|
||||
image update and start it back up again after. (Later on we may want to
|
||||
modify it to stay running but emit logs instead of actually resetting the
|
||||
host.)
|
||||
|
||||
CLI Clients
|
||||
-----------
|
||||
|
||||
We will extend the existing *system*, *sw-manager*, and *dcmanager* clients to
|
||||
add the new commands and extend the existing commands where applicable.
|
||||
|
||||
Specifically for the case of system host-device-show the expectation is that
|
||||
the new FPGA-specific fields will only be returned by the server for FPGA
|
||||
devices. The client will need to be able to handle the variable set of fields
|
||||
rather than assuming a constant set of fields.
|
||||
|
||||
Web GUI
|
||||
-------
|
||||
|
||||
If we want to allow this to be handled entirely through the GUI we'd need to
|
||||
add support for all the system controller operations from sysinv and dcmanager.
|
||||
|
||||
This will not be implemented in the initial release.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Given our existing infrastructure, there aren't too many alternatives. We
|
||||
could extend the existing *sysinv-agent* instead of making a new FPGA-specific
|
||||
agent, but there's going to be a fair bit of hardware-specific code in the new
|
||||
agent so that might not make sense.
|
||||
|
||||
The VIM and dcmanager changes closely align with how we already support
|
||||
software patching and software upgrade, so this enables maximum code re-use.
|
||||
|
||||
Sysinv already talks to the hardware and deals with PCI devices, as well as
|
||||
controlling the lock/unlock/reboot operations, so it's the logical place to
|
||||
handle the interactions between those operations and the device image updates.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
The dcmanager DB will have a new "subcloud_group" table which maps subclouds
|
||||
into groups. Subclouds within a group can be updated in parallel, while
|
||||
subclouds from different groups cannot.
|
||||
|
||||
The sysinv DB will have a new *fpga_devices* table which will include new
|
||||
fields that are specific to the FPGA devices. Each row will be associated
|
||||
with a row in the *pci_devices* table.
|
||||
|
||||
The sysinv DB *pci_devices* table will get a new "needs firmware update"
|
||||
column.
|
||||
|
||||
The sysinv DB will get a new *device_images* table which stores all necessary
|
||||
information for each device image.
|
||||
|
||||
The dcmanager DB will get a number of new tables, (analogous to the ones used
|
||||
for software patching) which will track the strategy data for the device image
|
||||
update at the distributed-cloud SystemController level.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
TBD
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
The low-level implementation of the sysinv FPGA agent assumes the use of
|
||||
privileged containers to handle the actual low-level interaction with the
|
||||
physical hardware. We currently allow privileged containers, but we may want
|
||||
to lock things down further in the future. In that case we might need to
|
||||
install the OPAE tools as part of StarlingX rather than in a container.
|
||||
|
||||
This change does not directly deal with sensitive data. It deals with
|
||||
bitstreams which may represent sensitive data, but the bitstreams have already
|
||||
been signed before they're provided to StarlingX.
|
||||
|
||||
The biggest security impact would be an admin-user impact, since once an N3000
|
||||
device has had its root key programmed it cannot be changed short of sending
|
||||
the device back to the factory.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
When a device image is being updated, it's very likely that a hardware reset
|
||||
will be required, either of that specific device or of the whole host. This
|
||||
will necessarily cause a service outage on the device in question, as well as
|
||||
for any application containers making use of the device.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The new code is not expected to be called frequently. It is expected to be
|
||||
called more often during the initial phases of a customer network build-out as
|
||||
FPGA images are reworked to deal with teething issues.
|
||||
|
||||
There will be a periodic audit in dcmanager to check whether each subcloud is
|
||||
up-to-date on its hardware device images. We will generally only trigger
|
||||
device updates during a maintenance window, so this audit does not need to be
|
||||
frequent.
|
||||
|
||||
The API changes have been designed to minimize the number of calls required
|
||||
to perform this audit, since they involve communicating between the
|
||||
SystemController and the subclouds which may be geographically remote.
|
||||
|
||||
While performing firmware updates, the host in question will not be able to
|
||||
lock unless the admin forces the operation.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Nothing different than any other development in these areas of the code.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
All changes will be made in such a way as to support upgrades from the previous
|
||||
version.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
|
||||
Primary assignee:
|
||||
Chris Friesen
|
||||
|
||||
Other contributors:
|
||||
Al Bailey
|
||||
Eric MacDonald
|
||||
Teresa Ho
|
||||
|
||||
Repos Impacted
|
||||
--------------
|
||||
|
||||
`<https://opendev.org/starlingx/config>`_
|
||||
|
||||
`<https://opendev.org/starlingx/distcloud.git>`_
|
||||
|
||||
`<https://opendev.org/starlingx/nfv.git>`_
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Sysinv FPGA Agent changes.
|
||||
* Sysinv-api and Sysinv-conductor changes for triggering device image update.
|
||||
* Sysinv-api and Sysinv-conductor changes for managing device images.
|
||||
* VIM work for orchestrating firmware update at the subcloud level.
|
||||
* dcmanager work for orchestrating firmware update at the SystemController
|
||||
level.
|
||||
* sysinv-api-proxy work for proxying the device image management API to
|
||||
RegionOne and saving the device images in the vault for use by dcmanager.
|
||||
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit tests will be added for new/modified code and will be executed by tox,
|
||||
which is already supported for dcmanager, sysinv and VIM.
|
||||
|
||||
The expectation is that there will be 3rd party testing by Wind River with
|
||||
actual hardware to ensure that this works as expected.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The Cloud Platform Node Management guide, Cloud Platform Administration
|
||||
Tutorials, and Cloud Platform User Tutorials will likely need to be updated.
|
||||
|
||||
The VIM, dcmanager, and sysinv API documentation will be updated with new APIs
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/intel-fpga-pac-n3000/overview.html
|
||||
.. [2] https://opae.github.io/latest/index.html
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - stx-4.0
|
||||
- Introduced
|
Loading…
Reference in New Issue