Merge "N3000 FPGA device image update orchestration"
This commit is contained in:
commit
115a568cec
|
@ -0,0 +1,845 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License. http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
|
||||
=========================================
|
||||
N3000 FPGA Bitstream Update Orchestration
|
||||
=========================================
|
||||
|
||||
Storyboard:
|
||||
https://storyboard.openstack.org/#!/story/2006740
|
||||
|
||||
The overall scenario is that we have an administrator operating in a central
|
||||
cloud, with hundreds or thousands of subclouds being managed from the central
|
||||
cloud. In each subcloud there will be one or more nodes with FPGA devices.
|
||||
These devices will need to be programmed with a number of types of bitstreams
|
||||
but to ensure that service standards are met they can't all be updated at the
|
||||
same time. Instead, the admin will create policies which govern which subclouds
|
||||
are updated when, and the orchestration framework will follow those policies to
|
||||
update the various subclouds.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In a distributed-cloud environment there may be hundreds or thousands of
|
||||
subclouds, each containing one or more hosts, some of which may have hardware
|
||||
devices on them (like NICs or FPGAs) which require image updates in order to
|
||||
properly provide service to applications which ultimately provide services for
|
||||
the end-user.
|
||||
|
||||
In order to simplify management of these hardware devices, we wish to support
|
||||
orchestration of device image updates in a distributed-cloud environment,
|
||||
starting with the Intel N3000 FPGA device (which is expected to be
|
||||
commonly-used for 5G) but designing the framework in such a way that we could
|
||||
extend it to deal with other types of device images (other FPGAs, or NIC
|
||||
firmware for example) as well.
|
||||
|
||||
For the case of the N3000 (and likely other FPGAs) there are a number of
|
||||
different image types that need to be supported, specifically one to set the
|
||||
root authentication key, one to update the FPGA core (signed with a signing
|
||||
key), and one to revoke a signing key. For the case of NICs, you'd typically
|
||||
have a single image type. In all cases, the image type would only be valid for
|
||||
a specific PCI vendor/device tuple.
|
||||
|
||||
Since updating device firmware will necessarily result in a service outage, we
|
||||
need the ability to control which subclouds (which typically would correspond
|
||||
to geographic areas) can be updated in parallel.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As a cloud admin, I want to push out a hardware device image update to
|
||||
hardware devices on a single host (possibly for test purposes).
|
||||
|
||||
As a cloud admin, I want to push out hardware device image updates to hardware
|
||||
devices on multiple hosts in a cloud.
|
||||
|
||||
As a distributed-cloud admin, I want to push out hardware device image updates
|
||||
to hardware devices on all hosts on a single subcloud (possibly for test
|
||||
purposes).
|
||||
|
||||
As a distributed-cloud admin, I want to push out hardware device image updates
|
||||
to hardware devices on multiple hosts on many subclouds. While doing this, I
|
||||
want to control which hosts and which subclouds can be updated in parallel
|
||||
since I want to avoid causing service outages while doing the update.
|
||||
|
||||
As a distributed-cloud admin, I want to be able to display whether each
|
||||
subcloud is using up-to-date device images.
|
||||
|
||||
As a distributed-cloud admin, I want to see the status of in-progress device
|
||||
image updates.
|
||||
|
||||
As a distributed-cloud or cloud admin, I want to be able to abort an
|
||||
orchestrated device image update such that currently-in-progress device writes
|
||||
will finish but no additional ones will be scheduled.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The overall architecture of the device image orchestration will be modelled
|
||||
after the existing software-patch handling. In a single-cloud environment we
|
||||
will support uploading device images, "applying" them (which just means marking
|
||||
them as something that should get written to the hardware), and then actually
|
||||
kicking off the write to the hardware.
|
||||
|
||||
In an distributed-cloud environment, when using device image orchestration we
|
||||
will first do the above in the SystemController region, and then use dcmanager
|
||||
to handle pushing the images down to the subcloud and kicking off the actual
|
||||
update in the subcloud. The VIM in each subcloud will decide when to update
|
||||
the device images on each host, and a sysinv agent on each host will handle
|
||||
writing the actual device images to the hardware.
|
||||
|
||||
In a distributed-cloud environment it will also be possible for the admin user
|
||||
to explicitly issue commands to the sysinv API endpoint for a single subcloud.
|
||||
This will essentially bypass the orchestration mechanism and behave the same
|
||||
as the single-cloud environment.
|
||||
|
||||
Hardware Background
|
||||
-------------------
|
||||
|
||||
The initial hardware that we want to support is the Intel N3000 [1]_, an FPGA that
|
||||
we expect will be used by 5G edge providers. This FPGA is a little unique in
|
||||
that it takes ~40min to write the functional image to the hardware (service
|
||||
can continue during this time, then a hardware reset is required to load the
|
||||
new image). Once the new image is loaded, the device will provide multiple
|
||||
VFs, which in turn will be exported to Kubernets as resources, where they will
|
||||
be consumed by applications running in Kubernetes containers. Because of the
|
||||
long write times, these devices must be pre-programmed rather than programmed
|
||||
at Kubernetes pod startup.
|
||||
|
||||
Hardware Security Model
|
||||
-----------------------
|
||||
|
||||
By default, the N3000 will accept any valid bitstream that is sent to it (signed
|
||||
or unsigned). The customer/vendor can generate a public/private root key pair,
|
||||
then create a *root-entry-hash* bitstream from the public key. If a
|
||||
*root-entry-hash* bitstream is written to the N3000 it will set the root entry
|
||||
hash on the device. From that point on, only bitstreams signed by a code
|
||||
signing key (CSK) which is in turn signed by the private root key will be
|
||||
accepted by the N3000. Once a *root-entry-hash* bitstream has been written to
|
||||
the hardware, it cannot be erased or changed without sending the hardware back
|
||||
to the vendor.
|
||||
|
||||
The customer/vendor can generate new *FPGA user* bitstreams. These may be
|
||||
unsigned or signed with a CSK. Typically each such bitstream would be signed
|
||||
by a different CSK. Writing a new *user* bitstream will cause the new code to be
|
||||
loaded on the next bootup of the N3000. Only one *user* bitstream can be stored
|
||||
in the N3000 at a time.
|
||||
|
||||
The customer/vendor can create a *CSK-ID-cancellation* bitstream (generated from
|
||||
the private root key). When written to the N3000 it will revoke a
|
||||
previously-used CSK and disallow loading any images signed with it. Multiple
|
||||
*CSK-ID-cancellation* bitstreams can be processed for each N3000. Most
|
||||
importantly, StarlingX will not deal directly with CSKs, only bitstreams.
|
||||
|
||||
Cloud/Subcloud Sysinv FPGA Agent
|
||||
--------------------------------
|
||||
|
||||
The low-level interactions with the physical FPGA devices will be performed by
|
||||
a new *sysinv FPGA agent* which will reside on each node with a *worker*
|
||||
subfunction. The agent will communicate bi-directionally with sysinv-conductor
|
||||
via RPC. The interactions with the N3000 FPGA will be performed using the OPAE
|
||||
tools [2]_ in the n3000-opae Docker image running in a container. (This will
|
||||
require the use of privileged containers due to the need to directly access
|
||||
hardware devices.)
|
||||
|
||||
On startup, the existing *sysinv-agent* will try to create a file under
|
||||
/run/sysinv. If the file did not yet exist, it will send an RPC message to
|
||||
sysinv-conductor indicating that the host just rebooted. Sysinv-conductor will
|
||||
then clear any "reboot needed" DB entry for that host if it was set. If there
|
||||
are no more "pending firmware update" entries in the DB for any host, and if no
|
||||
host has the "reboot needed" DB entry set, then the "*firmware update in
|
||||
progress*" alarm will be cleared.
|
||||
|
||||
On startup, the existing *sysinv-agent* will do an inventory of the PCI
|
||||
devices on each worker node. The new *sysinv-fpga-agent* will inventory the
|
||||
FPGA devices as well, including querying additional details from each FPGA
|
||||
device as per the *host-device-show* command. The FPGA agent will send an RPC
|
||||
message to *sysinv-conductor* to update the database with up-to-date FPGA device
|
||||
information.
|
||||
|
||||
If there are problems that need to be dealt with immediately (such as the FPGA
|
||||
booting the factory image when there should be a functional image) then
|
||||
*sysinv-conductor* will send an RPC message to *sysinv-fpga-agent* to trigger
|
||||
a *device-image-update* operation to ensure that the FPGA is up-to-date. This
|
||||
will also cause an alarm to be raised.
|
||||
|
||||
If the FPGA has a valid functional image but it's not the currently-active
|
||||
functional image, then we will alarm it but not trigger a *device-image-update*
|
||||
operation. In the future we may wish to extend this to check whether the
|
||||
functional image was signed with a cancelled CSK-ID and if so then trigger a
|
||||
*device-image-update* operation due to security risks.
|
||||
|
||||
On sysinv-conductor startup it will send out a request to all *sysinv FPGA
|
||||
agents* to report their hardware status. This is needed to deal with certain
|
||||
error scenarios.
|
||||
|
||||
In certain error scenarios it's possible that the *sysinv FPGA agent* will be
|
||||
unable to send a message to sysinv-conductor. It will need to handle this
|
||||
gracefully.
|
||||
|
||||
Subcloud Sysinv Operations
|
||||
--------------------------
|
||||
|
||||
At the single-cloud or subcloud level, the commands start out fairly typical.
|
||||
We plan to extend sysinv to introduce create/list/show/delete commands for the
|
||||
FPGA images, extend the existing *host device* commands to operate on the FPGA
|
||||
device, add new commands to *apply* or *remove* a device image, and finally add
|
||||
new commands to initiate or abort the firmware update.
|
||||
|
||||
The concept of *apply* is used because there are different types of bitstreams
|
||||
and it's possible to have more than one bitstream that needs to be downloaded
|
||||
to a newly-added FPGA. This will be discussed in more detail in the
|
||||
activation section below.
|
||||
|
||||
system device-image-create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Define a new image, specifying bitstream file, the bitstream type (root-key,
|
||||
functional, or key-revocation), an optional name/version/description, the
|
||||
applicable PCI vendor/device the image is for, and various
|
||||
bitstream-type-specific metadata such as the bitstream ID (for the
|
||||
FPGA functional image), the key signature (for the root-key image), the key ID
|
||||
being revoked, etc. To simplify the dcmanager code, this should allow
|
||||
specifying the UUID for the image. (Ideally we should be able to issue a GET
|
||||
in RegionOne and pass the results directly to a PUT to the same location in
|
||||
the subloud to create a new image in the subcloud. Alternatively, a POST could
|
||||
be used but we'd have to add the UUID to the request body.) If not specified,
|
||||
the system will create a UUID for the image, the bitstream file will be stored
|
||||
in a replicated filesystem on the controller node, and the metadata will be
|
||||
stored in the sysinv database.
|
||||
|
||||
system device-image-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display high-level image data for all known images. This would include image
|
||||
type (root-key, functional, key-revocation), UUID, version.
|
||||
|
||||
system device-image-show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display detailed image data for a single image (specified via UUID). This
|
||||
would include UUID, image type, description, key ID, bitstream ID, name,
|
||||
description, signing key signature, any activations (with device label) for the
|
||||
image, etc.
|
||||
|
||||
system device-image-delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete an image (specified by UUID). If an FPGA functional image is deleted
|
||||
due to a security issue, it would be wise to also upload and activate a
|
||||
key-revocation bitstream to prevent the image from being uploaded again either
|
||||
by accident or maliciously.
|
||||
|
||||
system device-image-apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Make the specified image *active*, but do not actually initiate writing to the
|
||||
hardware. This applies to a specific image, and optionally takes a device
|
||||
label key/value such that only devices with the specified label would be
|
||||
updated. Initially only *functional*, *root-key*, and *key-revocation*
|
||||
bitstreams are supported. Only one *root-key* bitstream can ever be written to
|
||||
an N3000, so having more than one such bitstream be active doesn't make sense.
|
||||
Applying a *functional* bitstream will *remove* all other functional bitstreams
|
||||
for that FPGA PCI vendor/device. There can be multiple *key-revocation*
|
||||
bitstreams active.
|
||||
|
||||
Note that it would be possible to make multiple images active, then issue a
|
||||
*host-device-image-update* command to trigger writing them all to the hardware.
|
||||
|
||||
When an image has been applied, a "device firmware update in progress" alarm
|
||||
will be raised, and will stay raised until all affected devices have had their
|
||||
firmware updated or until the device image is removed. This implies that a
|
||||
"pending firmware update" DB entry will be created for each affected device for
|
||||
each applied image to indicate that the image needs to be written to the
|
||||
device.
|
||||
|
||||
system device-image-remove
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Deactivate the specified image, optionally allowing specifying a device label
|
||||
to deactivate the image only for devices with the specified label. If you try
|
||||
to deactivate an image which is currently being written to the hardware it will
|
||||
succeed but will not abort the write.
|
||||
|
||||
When an image is deleted, all of its activation records will also be deleted.
|
||||
(The implementation of this operation could probably be left to the end as it
|
||||
is not critical.)
|
||||
|
||||
Removing an image will remove any "pending firmware update" DB entries for that
|
||||
image. If there are no remaining pending firmware updates, and no "reboot
|
||||
needed" DB entries for any host, then the "device firmware update in progress"
|
||||
alarm can be cleared.
|
||||
|
||||
system host-device-image-update
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Tell sysinv to update the specified device on the specified host with any
|
||||
active images which have not yet been written to the hardware. In this
|
||||
scenario, sysinv-conductor will tell the FPGA agent to write each
|
||||
active-but-not-written image to the device in turn until they've all been
|
||||
written. We would want to write the root-key bitstream first, then any
|
||||
key-revocation bitstreams, then the functional bitstream. If we have
|
||||
successfully written the functional bitstream, the admin user (or the VIM
|
||||
in the orchestrated update case) will need to lock/unlock the node to cause
|
||||
the new functional image to be loaded.
|
||||
|
||||
While writing an image to the FPGA, we would want to block the reboot of the
|
||||
host in question. We will only allow updating device images on unlocked hosts,
|
||||
and once the device image starts no host-lock commands will be accepted
|
||||
unless the *force* option is used. While the FPGA agent is writing the image
|
||||
to the hardware, it will *stop* the watchdog service from running, since we
|
||||
don't want an unrelated process to trigger a reboot while we're writing to the
|
||||
hardware. After the image has been written, the FPGA agent will restart the
|
||||
watchdog service.
|
||||
|
||||
After each image is written the FPGA agent would send an RPC message to
|
||||
sysinv-conductor to remove the "*pending firmware update*" entry from the DB
|
||||
and to set a "reboot needed" DB entry for that host.
|
||||
|
||||
system host-device-image-update-abort
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Abort any pending image updates for this host. Any in-progress device image
|
||||
updates will continue until completion or failure.
|
||||
|
||||
(The implementation of this operation could be left towards the end, as it is
|
||||
not necessary for the success path.)
|
||||
|
||||
system host-device-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Add support to the existing command so the FPGA device displays in the list.
|
||||
Add a new "needs firmware update" column.
|
||||
|
||||
system host-device-show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Extend the existing command to add new optional device-specific fields. For the
|
||||
N3000 this would include accelerator status, type of booted image
|
||||
(user/factory), booted image bitstream ID, cancelled CSK IDs, root entry hash,
|
||||
BMC versions, PCI device ID, onboard NIC devices, etc.
|
||||
|
||||
We might want to include device labels (see below) in the output.
|
||||
|
||||
System host-device-label-assign
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assign a *key: value* label to a PCI device. This takes as arguments the PCI
|
||||
device, the host, the key, and the value.
|
||||
|
||||
System host-device-label-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
List all labels for a given PCI device. This takes as arguments the PCI device
|
||||
and the host, and returns a list of all key/value labels for the device.
|
||||
(Alternatively could take the PCI device UUID, but the CLI doesn't expose that
|
||||
currently.)
|
||||
|
||||
System host-device-label-remove
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Remove a label from a PCI device. This takes as arguments the PCI device, the
|
||||
host, and the key.
|
||||
|
||||
System device-label-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
List all devices and their labels from all hosts in the system. Devices
|
||||
without any labels are not included. This is intended for use by dcmanager to
|
||||
determine whether an image should be created in a given subcloud.
|
||||
|
||||
system host lock/swact
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The *lock* operation would be blocked by default during device image update to
|
||||
prevent accidentally rebooting while in the middle of updating the FPGA image
|
||||
(since that would result in a service outage while the FPGA gets updated
|
||||
again). Since we will only start a device image update on an unlocked host,
|
||||
this should be sufficient.
|
||||
|
||||
If the *force* option is specified for this command, the action will
|
||||
proceed. (This may mean that the device ends up in a bad state if the host
|
||||
reboots while a device image update was in progress.)
|
||||
|
||||
The manual swact operation will be blocked during a device image update to
|
||||
reduce the chances that it will interfere with the image update. The image
|
||||
update code in the rest of the system will try to deal with temporary outages
|
||||
caused by a swact, but we may need to handle it as a failure if the outage
|
||||
lasts long enough.
|
||||
|
||||
|
||||
Subcloud VIM Operations
|
||||
-----------------------
|
||||
|
||||
All of these operations would be analogous to the existing sw-manager
|
||||
patch-strategy and update-strategy operations. We're using *firmware update*
|
||||
in the CLI to allow it to be potentially more generic in the future, but
|
||||
initially these would apply to the FPGA image update only.
|
||||
|
||||
The VIM will control the overall firmware update strategy for the subcloud. It
|
||||
will decide whether a firmware update is currently allowed to be kicked off (if
|
||||
there are alarms raised it might block the firmware update strategy apply
|
||||
depending on the strategy), control how many hosts can do a firmware update in
|
||||
parallel, trigger each host to begin the firmware update, and aggregate the
|
||||
status of the firmware update on the various hosts.
|
||||
|
||||
When the VIM decides to initiate a firmware update on a given host, it will
|
||||
issue the HTTP equivalent of the *system host-device-image-update* command to
|
||||
sysinv on that host to tell that host to write all *applied but not yet
|
||||
written* device images to the hardware.
|
||||
|
||||
sw-manager fw-update-strategy create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Check the system state and builds up a sequence of commands needed to bring the
|
||||
subcloud into alignment with the desired state of the system. This would take
|
||||
options such as how many hosts can do firmware update in parallel, whether to
|
||||
stop on failure, whether outstanding alarms should prevent upgrade, etc.
|
||||
|
||||
This step will loop over all hosts querying sysinv to see whether each host has
|
||||
any devices that need updating, then generate a series of steps to bring all
|
||||
relevant hosts in the subcloud up-to-date for their device images.
|
||||
|
||||
If there are no firmware updates to be applied, the strategy creation will
|
||||
fail with the reason "no firmware updates found".
|
||||
|
||||
sw-manager fw-update-strategy apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Execute the firmware update strategy. Probably want an option similar to the
|
||||
*stage-id* supported for applying a patching strategy up to a specific stage ID.
|
||||
|
||||
Apply the specified firmware update strategy to each host specified in the
|
||||
strategy (this would typically be all hosts which have devices which need a
|
||||
firmware update) following the policies in the strategy around serial/parallel
|
||||
updates. For each affected host, the VIM will use the sysinv REST API to
|
||||
trigger a *system host-device-image-update* operation and then periodically
|
||||
check for the status of the update operation.
|
||||
|
||||
sw-manager fw-update-strategy show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display update strategy, optionally with more details (like current status of
|
||||
the overall sequence as stored in the VIM database).
|
||||
|
||||
sw-manager fw-update-strategy abort
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Allow existing in-progress FPGA updates to complete, do not trigger any
|
||||
additional nodes to begin FPGA updates. Signal to sysinv to abort FPGA
|
||||
update; this will still allow in-progress FPGA updates to complete since we do
|
||||
not want to end up with a half-written image (which would require a new FPGA
|
||||
update operation to recover).
|
||||
|
||||
(The implementation of this may be left till the end as it is not needed for
|
||||
the success path.)
|
||||
|
||||
sw-manager fw-update-strategy delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete the strategy once no longer needed.
|
||||
|
||||
System Controller DC Manager Operations
|
||||
---------------------------------------
|
||||
|
||||
The DC Manager operations in the system controller are strongly related to the
|
||||
VIM operations in the subcloud, and most of them are equivalent to the
|
||||
operations for the existing sw-manager patch-strategy and update-strategy
|
||||
operations.
|
||||
|
||||
The DC manager will control when to trigger a firmware update in a given
|
||||
subcloud, which subclouds can be updated in parallel, and whether to stop on
|
||||
failure or not.
|
||||
|
||||
The DC manager will also handle creating/deleting device images in each
|
||||
subcloud as needed to keep the subcloud in sync with the SystemController
|
||||
region by talking to sysinv-api-proxy in the SystemController region and in
|
||||
each subcloud. The actual device image files will be stored by the
|
||||
sysinv-api-proxy in a well-known location where DC manager can access them when
|
||||
creating device images in the subclouds. DC Manager will only create device
|
||||
images in the subcloud if there is at least one device in the subcloud which
|
||||
will be updated with the device image in question (based on any labels
|
||||
specified via the "*system device-image-apply*" command).
|
||||
|
||||
As part of the dcmanager, there will be a periodic audit which scans a number
|
||||
of subclouds in parallel and checks whether the subcloud has all of the
|
||||
*applied* device images that it should have (based on the labels the images
|
||||
were applied against and the device labels in the subcloud), and whether all of
|
||||
the *applied* device images have been written to the devices that they
|
||||
should be. If either of these is not true, then the subcloud "*firmware image
|
||||
sync status*" is considered "*out of sync*". This will result in the subcloud
|
||||
as a whole being considered "*out of sync*".
|
||||
|
||||
When a device image is *applied* in sysinv, dcmanager will be notified and will
|
||||
set the "*firmware image sync status*" to *unknown* for all subclouds
|
||||
since it does not know at this point which subcloud(s) the image needs to be
|
||||
created/applied/updated in. On the next audit, this sync status will be
|
||||
updated to "*in sync*" or "*out of sync*" as applicable.
|
||||
|
||||
|
||||
dcmanager subcloud-group create/add-member/remove-member/delete/list/show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
These introduce the concept of a "*subcloud group*", which is a way of
|
||||
grouping subclouds together such that all subclouds in the group can potentially
|
||||
be upgraded in parallel. A given subcloud can only be a member of one subcloud
|
||||
group.
|
||||
|
||||
This is needed because the customer will likely want to ensure (as much as
|
||||
posible) that we don't update the functional image (which requires a service
|
||||
outage) on all subclouds that serve a certain geographic area (which could
|
||||
cause an outage for end-users in that area).
|
||||
|
||||
There will be controls over how many subclouds in a group can be updated at
|
||||
once. Dcmanager will only apply update strategies in one group at a time,
|
||||
and will update all subclouds in a group before moving on to the next subcloud
|
||||
group.
|
||||
|
||||
|
||||
dcmanager fw-update-strategy create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Create a new update strategy, with options for number of subclouds to upgrade
|
||||
in parallel, whether to stop on failure, etc. (Eventually we may want to
|
||||
specify a list of which subcloud groups to update but this will not be
|
||||
included in the initial version.) This will generate a UUID for the created
|
||||
strategy, and will generate a step for each subcloud that dcmanager thinks is
|
||||
out-of-sync that is in the specified subcloud group. If there are any subclouds
|
||||
with an "unknown" sync state in the subcloud group then we would disallow
|
||||
the creation of a firmware update strategy for that group.
|
||||
|
||||
|
||||
dcmanager fw-update-strategy list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
List the firmware update strategies, with the most important bits of
|
||||
information for each. This should include the overall update strategy status
|
||||
(i.e. "in progress" if we've asked for it to be applied).
|
||||
|
||||
dcmanager fw-update-strategy show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Show specified firmware update strategy. This would include all the metadata
|
||||
specified as part of the "create", the overall update strategy status, as
|
||||
well as the status (as reported by the subcloud VIM) of the firmware update
|
||||
strategy application for all the subclouds in the specified subcloud group.
|
||||
|
||||
dcmanager fw-update-strategy apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Execute each step (where each step roughly corresponds to a subcloud) of the
|
||||
specified firmware update strategy. This would look something like this:
|
||||
|
||||
* Queries sysinv in RegionOne for active FPGA images using REST API.
|
||||
* For each strategy step, use sysinv REST API to:
|
||||
|
||||
* Query subcloud for device labels.
|
||||
* Query subcloud for FPGA images.
|
||||
* Create/update/delete FPGA images in subcloud as needed to bring it into
|
||||
sync with the FPGA images in the SystemController. We don't do this via
|
||||
dcorch because we want to ensure the data is up to date when applying the
|
||||
update strategy. (This process could take some time on slow subcloud link.)
|
||||
* Apply the device image in the subcloud.
|
||||
* Create FPGA update strategy using VIM REST API.
|
||||
* Apply FPGA update strategy using VIM REST API.
|
||||
* Monitor progress by querying FPGA update strategy using VIM REST API.
|
||||
|
||||
dcmanager fw-update-strategy abort
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Pass the abort down to each subcloud, and do not process any more subclouds.
|
||||
|
||||
(Maybe leave this till last as it is not needed for the success path.)
|
||||
|
||||
dcmanager fw-update-strategy delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete firmware update strategy. This would also delete the firmware update
|
||||
strategy in the subcloud using the VIM REST API. It is not valid to delete an
|
||||
in-progress update strategy.
|
||||
|
||||
dcmanager strategy-step list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Extend if needed to list the strategy-steps and their state for the FPGA update
|
||||
strategy that is being applied.
|
||||
|
||||
dcmanager strategy-step show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Extend if needed to show step for a specific subcloud.
|
||||
|
||||
System Controller Sysinv Operations
|
||||
-----------------------------------
|
||||
The sysinv operations in the system controller essentially duplicate the
|
||||
image-related subset of the sysinv operations in the subcloud. We don't expect
|
||||
the system controller to have any FPGAs, so the device-image-update,
|
||||
host-device-list, and host-device-show commands are not relevant. In all cases
|
||||
the request is intercepted by sysinv-api-proxy in the SystemController region
|
||||
and forwarded to RegionOne. Unlike normal resources, dcorch will not be used
|
||||
to synchronize the FPGA image information.
|
||||
|
||||
system device-image-create
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Define a new image, as described in `Subcloud Sysinv Operations`_. If
|
||||
successful, the sysinv API proxy will also save the image to
|
||||
/opt/device-image-vault, which will be a drbd-replicated filesystem analogous
|
||||
to how /opt/patch-vault is used to store patch files for orchestrated patching.
|
||||
|
||||
system device-image-list
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display high-level image data for all known images as per
|
||||
`Subcloud Sysinv Operations`_.
|
||||
|
||||
system device-image-show
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Display detailed image data for a single image as per
|
||||
`Subcloud Sysinv Operations`_.
|
||||
|
||||
system device-image-delete
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Delete an image as per `Subcloud Sysinv Operations`_. Remind the user about
|
||||
uploading and activating a key-revocation bitstream for security issues
|
||||
when deleting a functional image. Deleting an active image will not be
|
||||
allowed.
|
||||
|
||||
system device-image-apply
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Make the specified image *active* as per `Subcloud Sysinv Operations`_.
|
||||
|
||||
system device-image-remove
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Make the specified image *inactive* as per `Subcloud Sysinv Operations`_.
|
||||
|
||||
Fault Handling
|
||||
--------------
|
||||
|
||||
While a device is in the middle of updating its functional image, it's
|
||||
possible that a fault could occur that would normally result in the host being
|
||||
rebooted. If we reboot while updating the N3000 functional image it could
|
||||
result in a 40-minute outage on host startup while we flash the functional
|
||||
image again.
|
||||
|
||||
Given the above, the desired behavior while a device image update is in
|
||||
progress is to avoid rebooting on faults (critical process alarm, low memory
|
||||
alarm, etc.) as long as the fault is not something (like high temperature) that
|
||||
could actually damage the hardware.
|
||||
|
||||
This is less of an issue for AIO-SX since we're already suppressing mtce reboot
|
||||
actions.
|
||||
|
||||
The host watchdog will currently reset the host under certain circumstances.
|
||||
This is undesirable if we're in the middle of updating device images, so the
|
||||
sysinv FPGA agent will temporarily shut down the "hostwd" daemon during device
|
||||
image update and start it back up again after. (Later on we may want to
|
||||
modify it to stay running but emit logs instead of actually resetting the
|
||||
host.)
|
||||
|
||||
CLI Clients
|
||||
-----------
|
||||
|
||||
We will extend the existing *system*, *sw-manager*, and *dcmanager* clients to
|
||||
add the new commands and extend the existing commands where applicable.
|
||||
|
||||
Specifically for the case of system host-device-show the expectation is that
|
||||
the new FPGA-specific fields will only be returned by the server for FPGA
|
||||
devices. The client will need to be able to handle the variable set of fields
|
||||
rather than assuming a constant set of fields.
|
||||
|
||||
Web GUI
|
||||
-------
|
||||
|
||||
If we want to allow this to be handled entirely through the GUI we'd need to
|
||||
add support for all the system controller operations from sysinv and dcmanager.
|
||||
|
||||
This will not be implemented in the initial release.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Given our existing infrastructure, there aren't too many alternatives. We
|
||||
could extend the existing *sysinv-agent* instead of making a new FPGA-specific
|
||||
agent, but there's going to be a fair bit of hardware-specific code in the new
|
||||
agent so that might not make sense.
|
||||
|
||||
The VIM and dcmanager changes closely align with how we already support
|
||||
software patching and software upgrade, so this enables maximum code re-use.
|
||||
|
||||
Sysinv already talks to the hardware and deals with PCI devices, as well as
|
||||
controlling the lock/unlock/reboot operations, so it's the logical place to
|
||||
handle the interactions between those operations and the device image updates.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
The dcmanager DB will have a new "subcloud_group" table which maps subclouds
|
||||
into groups. Subclouds within a group can be updated in parallel, while
|
||||
subclouds from different groups cannot.
|
||||
|
||||
The sysinv DB will have a new *fpga_devices* table which will include new
|
||||
fields that are specific to the FPGA devices. Each row will be associated
|
||||
with a row in the *pci_devices* table.
|
||||
|
||||
The sysinv DB *pci_devices* table will get a new "needs firmware update"
|
||||
column.
|
||||
|
||||
The sysinv DB will get a new *device_images* table which stores all necessary
|
||||
information for each device image.
|
||||
|
||||
The dcmanager DB will get a number of new tables, (analogous to the ones used
|
||||
for software patching) which will track the strategy data for the device image
|
||||
update at the distributed-cloud SystemController level.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
TBD
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
The low-level implementation of the sysinv FPGA agent assumes the use of
|
||||
privileged containers to handle the actual low-level interaction with the
|
||||
physical hardware. We currently allow privileged containers, but we may want
|
||||
to lock things down further in the future. In that case we might need to
|
||||
install the OPAE tools as part of StarlingX rather than in a container.
|
||||
|
||||
This change does not directly deal with sensitive data. It deals with
|
||||
bitstreams which may represent sensitive data, but the bitstreams have already
|
||||
been signed before they're provided to StarlingX.
|
||||
|
||||
The biggest security impact would be an admin-user impact, since once an N3000
|
||||
device has had its root key programmed it cannot be changed short of sending
|
||||
the device back to the factory.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
When a device image is being updated, it's very likely that a hardware reset
|
||||
will be required, either of that specific device or of the whole host. This
|
||||
will necessarily cause a service outage on the device in question, as well as
|
||||
for any application containers making use of the device.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The new code is not expected to be called frequently. It is expected to be
|
||||
called more often during the initial phases of a customer network build-out as
|
||||
FPGA images are reworked to deal with teething issues.
|
||||
|
||||
There will be a periodic audit in dcmanager to check whether each subcloud is
|
||||
up-to-date on its hardware device images. We will generally only trigger
|
||||
device updates during a maintenance window, so this audit does not need to be
|
||||
frequent.
|
||||
|
||||
The API changes have been designed to minimize the number of calls required
|
||||
to perform this audit, since they involve communicating between the
|
||||
SystemController and the subclouds which may be geographically remote.
|
||||
|
||||
While performing firmware updates, the host in question will not be able to
|
||||
lock unless the admin forces the operation.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Nothing different than any other development in these areas of the code.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
All changes will be made in such a way as to support upgrades from the previous
|
||||
version.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
|
||||
Primary assignee:
|
||||
Chris Friesen
|
||||
|
||||
Other contributors:
|
||||
Al Bailey
|
||||
Eric MacDonald
|
||||
Teresa Ho
|
||||
|
||||
Repos Impacted
|
||||
--------------
|
||||
|
||||
`<https://opendev.org/starlingx/config>`_
|
||||
|
||||
`<https://opendev.org/starlingx/distcloud.git>`_
|
||||
|
||||
`<https://opendev.org/starlingx/nfv.git>`_
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Sysinv FPGA Agent changes.
|
||||
* Sysinv-api and Sysinv-conductor changes for triggering device image update.
|
||||
* Sysinv-api and Sysinv-conductor changes for managing device images.
|
||||
* VIM work for orchestrating firmware update at the subcloud level.
|
||||
* dcmanager work for orchestrating firmware update at the SystemController
|
||||
level.
|
||||
* sysinv-api-proxy work for proxying the device image management API to
|
||||
RegionOne and saving the device images in the vault for use by dcmanager.
|
||||
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit tests will be added for new/modified code and will be executed by tox,
|
||||
which is already supported for dcmanager, sysinv and VIM.
|
||||
|
||||
The expectation is that there will be 3rd party testing by Wind River with
|
||||
actual hardware to ensure that this works as expected.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The Cloud Platform Node Management guide, Cloud Platform Administration
|
||||
Tutorials, and Cloud Platform User Tutorials will likely need to be updated.
|
||||
|
||||
The VIM, dcmanager, and sysinv API documentation will be updated with new APIs
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/intel-fpga-pac-n3000/overview.html
|
||||
.. [2] https://opae.github.io/latest/index.html
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - stx-4.0
|
||||
- Introduced
|
Loading…
Reference in New Issue