From 8df985b8769637ad89ee082cb102cba000cbb4e1 Mon Sep 17 00:00:00 2001
From: John Kung <john.kung@windriver.com>
Date: Thu, 19 Mar 2020 11:38:21 -0400
Subject: [PATCH] StarlingX Platform Upgrades

This story will provide a mechanism to upgrade the platform components
on a running StarlingX system. This is required to allow upgrades
between Starlingx versions.

The platform upgrade components includes the Host OS and StarlingX
components (e.g. flock services)

A maintenance release for stx3.x is required to upgrade to stx4.0

Change-Id: I0dc023e93a5ec08ac975e3594b50729f6c505c8c
Story: 2007403
Task: 39105
Signed-off-by: John Kung <john.kung@windriver.com>
---
 .../starlingx-2007403-platform-upgrades.rst   | 735 ++++++++++++++++++
 1 file changed, 735 insertions(+)
 create mode 100644 doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst

diff --git a/doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst b/doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst
new file mode 100644
index 0000000..267be78
--- /dev/null
+++ b/doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst
@@ -0,0 +1,735 @@
+..
+    This work is licensed under a Creative Commons Attribution 3.0 Unported
+  License. http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+===========================
+StarlingX Platform Upgrades
+===========================
+
+Storyboard:
+https://storyboard.openstack.org/#!/story/2007403
+
+This story will provide a mechanism to upgrade the platform components on
+a running StarlingX system. This is required to allow upgrades between
+StarlingX versions.
+
+The platform upgrade components includes the Host OS and StarlingX
+components. (e.g. flock services)
+
+A maintenance release for stx3 is required to upgrade to stx4.0
+
+Problem Description
+===================
+
+StarlingX must provide a mechanism to allow migration to a new StarlingX
+release.
+
+In order to provide a robust and simple upgrade experience for
+users of StarlingX, the upgrade process must be automated as much as
+possible and controls must be in place to ensure the steps are followed
+in the right order.
+
+The platform components compatibility release over release are affected
+by inter node messaging between components, configuration migration
+requirements, and kubernetes control plane compatibility.
+
+The downtime over an upgrade must be minimized.
+
+* controller upgrade - impact minimized to time for host-swact
+* worker upgrade - impact to applications minimized to time it take to
+  migrate application from a worker node before it is upgraded
+* storage - no loss of storage over an upgrade
+
+Upgrades must be done in-service.  The platform and applications must
+continue to provide service during the upgrade.  This does not apply to
+simplex deployments.
+
+Upgrades must be done without any additional hardware.
+
+Background
+----------
+
+Three types of StarlingX upgrades will be supported:
+
+* Platform Upgrade, which includes Host OS and StarlingX components
+  (e.g. flock services)
+* Kubernetes Upgrade
+* Application upgrade, which includes: StarlingX applications (e.g.
+  platform-integ-apps, stx-openstack), User applications and kubernetes workloads
+
+These three types of upgrades are done independently. For example, the Platform is
+upgraded to a new release of StarlingX without changing the kubernetes version.
+However, there are dependencies which determine the order in which these
+upgrades can be done. For example, kubernetes must be upgraded to a particular
+version before a platform upgrade can be done. 
+
+Use Cases
+---------
+
+* Administrator wants to upgrade to a new StarlingX platform version
+  with minimal impact to running applications.
+* Administrator wants to abort an upgrade in progress prior to upgrading
+  all controllers
+  Note: Downgrade to previous release version is not supported
+
+
+Proposed Process
+================
+
+StarlingX will only support upgrades from release N to release N+1.
+For example, maintenance release stx3.x can be upgraded to stx4,
+but not directly to stx5.0.
+Changes required for kubernetes configuration compatibility is
+delivered in a maintenance release to enable upgrade from stx3 to stx4.
+
+The administrator must ensure that their deployment has enough
+extra capacity (e.g. worker hosts) to allow one (or more) hosts to be
+temporarily taken out of service for the upgrade.
+
+For each supported platform version,  supported upgrades from version is
+tracked by the metadata (in metal/common-bsp/files/upgrades/metadata.xml).
+The metadata handling is extended to support multiple from versions.
+
+A maintenance release will enable stx3 to stx4 upgrades, and includes
+the configuration updates required to enable compatibility with the
+kubernetes control plane during the upgrade.
+
+The following is a summary of the steps the user will take when performing
+a platform upgrade. For each step, a summary of the actions the system
+will perform is provided.
+
+The software_upgrade table tracks the upgrade state, and includes:
+   * upgrade-started
+   * data-migration
+   * data-migration-complete
+   * upgrading-controllers
+   * upgrading-hosts
+   * activation-requested
+   * activation-complete
+   * completing
+   * completed
+
+When an upgrade is aborted the following state transitions occur:
+   * aborting
+   * abort-completing
+   * aborting-reinstall
+
+#.  **Import release N+1 load**
+
+    ::
+
+      # system load-import <bootimage.iso> <bootimage.sig>
+
+      # system load-list
+      +----+----------+------------------+
+      | id | state    | software_version |
+      +----+----------+------------------+
+      | 1  | active   | 19.12            |
+      | 2  | imported | 20.06            |
+      +----+----------+------------------+
+
+    The fields are:
+
+    * software_version: comes from metadata in the load image.
+
+    * states:
+
+      * active: the current version, version N
+      * importing: image is being uploaded to load repository
+      * error: error load state
+      * deleting: load is being deleted from repository
+      * imported: version that can be upgraded to, i.e. version N+1
+
+#.  **Perform Health checks for upgrade**
+
+    ::
+
+      # system health-query-upgrade
+
+    This will perform health checks to ensure the system is at
+    a state ready for upgrade.
+
+    These health checks are also performed as part of upgrade-start.
+
+    These include checks for:
+
+    * upgrade target load is imported
+    * all hosts provisioned
+    * all hosts load current
+    * all hosts unlocked/enabled
+    * all hosts have matching configs
+    * no management affecting alarms
+    * for ceph systems: storage cluster is healthy
+    * verifies kubernetes nodes are ready
+    * verifies kubernetes control plane pods are ready
+    * verifies that the kubernetes control plane is at a version and
+      configuration required for upgrade. If not, the kubernetes upgrade [1]_
+      method must be performed in order to bring it to baseline.
+
+#.  **Start the upgrade**
+
+    ::
+
+      # system upgrade-start
+
+    This performs semantic checks and the health checks as per the
+    'health-query-upgrade' step.
+
+    This will make a copy of the system data (e.g. postgres databases,
+    armada, helm, kubernetes, and puppet hiera data migrations) to be used
+    in the upgrade.
+
+    Note that /opt/etcd cluster data may be dynamically updating cluster
+    state info until the host-swact when service management brings down the
+    active-standby etcd on N side and up on N+1 side.
+
+    Configuration changes are not allowed after this point, until the upgrade
+    is completed.
+
+#.  **Lock and upgrade controller-1**
+
+    ::
+
+      # system host-upgrade controller-1
+
+    * upgrade state is set to 'data-migration'
+    * update upgrade_controller_1 flag so that controller-1 can determine
+      whether in an upgrade
+    * host controller-1 is reinstalled with N+1 load
+    * Migrate data and configuration from release N to release N+1
+    * A special release N+1 puppet upgrade manifest is applied, based on
+      the hiera data that was migrated from release N.  This allows for one-time
+      actions similar to what was done on the initial install of controller-0
+      (e.g. configuring rabbit, postgres, keystone).
+    * Generate hiera data for release N+1, to be used to apply
+      the regular puppet manifest when controller-1 is unlocked
+    * sync replicated (DRBD) filesystems
+    * upgrade state is set to 'data-migration-complete'
+    * system data is present in both release N and release N+1 versioned
+      directories (e.g. /opt/platform/config/<release>,
+      /var/lib/postgresql/<release>)
+
+#.  **Unlock controller-1**
+
+    This includes generating configuration data for controller-1 which must be
+    generated from the active controller.
+
+    * the join_cmd for the kubernetes controlplane is generated on the N side
+      for the N+1 hierdata
+
+    The N+1 hiera data drives the puppet manifest apply.
+
+#.  **Swact to controller-1**
+
+    ::
+
+      # system host-swact controller-0
+
+    * controller-1 becomes active and runs release N+1 while rest of the
+      system is running release N
+    * Any release N+1 components that do inter-node communications must be
+      backwards compatible to ensure that communication with release N
+      works correctly
+    * update to backup /opt/etcd/version_N and restore
+      to /opt/etcd/version_N+1 for the target version on host-swact
+      This must be performed at a time where data loss can be avoided.
+      As part of the host-swact startup on controller-1, and during an upgrade,
+      etcd is copied from the release N etcd directory to the release N+1
+      etcd directory.
+
+#.  **Lock and upgrade controller-0**
+
+    ::
+
+      # system host-upgrade controller-0
+
+    * install N+1 load and on the host-unlock, apply the upgrades manifest
+      and the puppet host configuration
+    * after controller-0 is upgraded, upgrade state is set to 'upgrading-hosts'
+
+#.  **If applicable, Lock and upgrade storage hosts**
+
+    ::
+
+      # system host-upgrade storage-0
+
+    * If provisioned, all storage hosts must be upgraded prior to
+      proceeding with workers
+    * Install N+1 load; up to half of the storage hosts can be done in parallel
+    * Ceph data sync
+
+#.  **Lock and upgrade worker hosts**
+
+    ::
+
+      # system host-upgrade worker-x
+
+    * Migrate workloads from worker node (triggered by host-lock)
+    * Install N+1 load
+    * Can be done in parallel, depending upon excess capacity.
+      Each worker host will first be locked using the existing "system
+      host-lock" CLI (worker hosts can be done in any order). This results in
+      services being migrated off the host and applies the NoExecute taint,
+      which will evict any pods that can be evicted.
+
+#.  **Activate the upgrade**
+
+    ::
+
+     # system upgrade-activate
+
+    * Perform any additional configuration which may be required after all
+      hosts have been upgraded.
+
+#.  **host-swact to controller-0**
+
+    ::
+
+     # system host-swact controller-1
+
+#.  **Complete the upgrade**
+
+    ::
+
+     # system upgrade-complete
+
+    * Run post-checks to ensure upgrade has been completed
+    * Remove release N data
+
+**Failure Handling**
+
+* When a failure happens and cannot be resolved without manual intervention,
+  the upgrade state will be set to data-migration-failed or activation-failed.
+* To recover, the user will need to resolve the issue that caused the upgrade
+  step to fail.
+* An upgrade-abort is only possible before controller-0 has been upgraded.
+  In other cases, the user would need to resolve the issue and reattempt
+  the step.
+
+**Health Checks**
+
+* In order to ensure the health and stability of the system we will do
+  health checks both before allowing a platform upgrade to start and then as
+  each upgrade CLI is run.
+* The health checks will include:
+
+  * basic system health (i.e. system health-query)
+  * new kubernetes specific checks - for example:
+
+    * verify that all kubernetes control plane pods are running
+    * verify that all kubernetes applications are fully applied
+    * verify that kubernetes control plane version and configuration is at
+      baseline required for platform upgrade.
+
+**Interactions with container applications**
+
+* The kubernetes join_cmd must be created from the N side running the
+  active kubernetes control plane.
+* The platform upgrade must be performed exclusively from the kubernetes
+  upgrade.  A kubernetes upgrade is not allowed when a platform upgrade
+  is in progress, and vice-versa.
+* Before starting a platform upgrade, we also need to check that
+  kubernetes configuration is at a baseline suitable for upgrade.
+  The N+1 load metadata enforces the configuration baseline required on
+  the N from side.
+* If the N+1 version is at a newer kubernetes version, then the
+  kubernetes upgrade procedure must be completed first in order to align
+  the kubernetes version.
+* After a platform upgrade has started, helm-override operations will be
+  prevented as these configuration changes will not be preserved after
+  upgrade-start and can also trigger applications to be reapplied.
+
+Alternatives
+------------
+
+Update the kubernetes configuration to N+1 configuration after the upgrade.
+However, this would necessitate the coordination of activation of features
+such as control plane address, encryption at rest during an upgrade, such
+as during the upgrade-activate step.  This would require N+1 to be
+backwards compatible with N.
+
+A mechanism is required to upgrade etcd [3]_, thus keeping the versioning
+for the etcd database will allow an upgrade to a newer etcd version.
+
+etcd Upgrades
+
+* host-swact to controller-1 during upgrade. As part of the host-swact during
+  an upgrade, the kubernetes etcd is copied from N side.  This would take a
+  copy at a time when the etcd data is not allowed to change, as etcd
+  would be brought down on controller-0, and prior to service management
+  bringing up etcd on controller-1.  After the new version of etcd runs with
+  the migrated etcd, it is no longer possible to run the old version of etcd
+  against it.  Therefore, the release N version of the data must be maintained
+  in the event of a host-swact back or upgrade-abort prior to
+  upgrade of controller-0.  This is the chosen alternative.
+* Alternative: upgrade-start: As an alternative to updating etcd on host-swact.
+  Migrate etcd for upgrade. Configuration changes which
+  affect the cluster state information could still occur in this scenario.
+  Kubernetes state changes that occur after the snapshot would be lost
+  and have the potential to put the kubernetes cluster into a bad state.
+* Alternative: /opt/etcd is unversioned so that the N and N+1 sides both reference
+  the same directory.  This is based on the premise that kubernetes control
+  plane is upgraded independently and does not require a versioned directory.
+  However, as noted in the host-swact alternative, this would not be compatible with
+  upgrade-abort or host-swact back to the N release.
+
+
+Data Model Impact
+-----------------
+
+The following tables in the sysinv database are required. The datamodel
+required to support platform upgrades are in the stx3.0 data model,
+and include the following platform upgrade focused tables.
+
+* loads
+  represents the load version (e.g. N and N+1), load state, compatible
+  versions
+
+* software_upgrade
+  represents the software upgrade state, from_load and to_load
+
+* host_upgrade
+  represents the software_load and target_load for each host
+
+REST API Impact
+---------------
+
+The v1/load, health, upgrade implements the platform upgrade specific URL
+utilized for the upgrade. The config repo api-ref-sysinv-v1-config.rst
+doc is updated accordingly.
+
+The sysinv REST API supports the following upgrade-related methods:
+
+* The existing resource /loads
+
+  * URLS:
+
+    * /v1/loads
+
+  * Request Methods:
+
+    * GET /v1/loads
+
+      * Returns all platform loads known to the system
+
+    * POST /v1/loads/import_load
+
+      * Imports the new load passed into the body of the POST request
+
+* The existing resource /upgrade
+
+  * URLS:
+
+    * /v1/upgrade
+
+* The existing resource /ihosts support for upgrade actions.
+
+  * URLS:
+
+    * /v1/ihosts/<hostid>
+
+  * Request Methods:
+
+    * POST /v1/ihosts/<hostid>/upgrade
+
+      * Upgrades the platform load on the specified host
+
+Security Impact
+---------------
+
+This story is providing a mechanism to upgrade platform from one version
+to another. It does not introduce any additional security impacts above
+what is already there regarding the initial deployment.
+
+Other End User Impact
+---------------------
+
+End users will typically perform upgrades using the sysinv (i.e.
+system) CLI. The CLI commands used for the upgrade are as noted in
+the `Proposed Process`_ section above.
+
+Performance Impact
+------------------
+
+When a platform upgrade is in progress, each host must be taken out of
+service in order to install the new load.
+The user must ensure that there is enough capacity in the system to handle
+the removal from service of one (or more) hosts as the load on each
+host is upgraded.
+
+Other Deployer Impact
+---------------------
+
+Deployers will now be able to upgrade StarlingX platform on a running system.
+
+Developer Impact
+----------------
+
+Developers working on the StarlingX components that manage container
+applications may need to be aware that certain operations should be
+prevented when a platform upgrade is in progress. This is discussed in
+the `Proposed Process`_ section above.
+
+Upgrade Impact
+--------------
+
+StarlingX platform upgrades are independent from the Kubernetes upgrade [1]_.
+However, when StarlingX platform upgrades are supported, checks must be put
+in place to ensure that the kubernetes version is not allowed to change due
+to a platform upgrade. In effect, the system must be upgraded to the same
+version of kubernetes as is packaged in the new platform release, to ensure
+this is the case. This will be enforced through semantic checking in the
+platform upgrade APIs.
+
+The platform upgrade excludes the upgrade of applications.  Applications will
+need to be compatible with the new version of the platform/kubernetes.
+Any upgrade of hosted applications is independent of the platform upgrade.
+
+Simplex Platform Upgrades
+=========================
+
+At a high level the simplex upgrade process involves the following steps.
+
+* Taking a backup of the platform data.
+* Installing the new StarlingX software.
+* Restoring and migrating the platform data.
+
+Simplex Upgrade Process
+-----------------------
+
+#.  **Import release N+1 load**
+
+    ::
+
+     # system load-import <bootimage.iso> <bootimage.sig>
+
+     # system load-list
+     +----+----------+------------------+
+     | id | state    | software_version |
+     +----+----------+------------------+
+     | 1  | active   | 19.12            |
+     | 2  | imported | 20.06            |
+     +----+----------+------------------+
+
+#.  **Start the upgrade**
+
+    ::
+
+     # system upgrade-start
+
+    This performs semantic checks and the health checks as per the
+    'health-query-upgrade' command.
+
+    This will make a copy of the system platform data similar to a platform
+    backup. The upgrade data will be placed under /opt/backups.
+
+    Any changes made after this point will be lost.
+
+#.  **Copy the upgrade data**
+
+    During the upgrade process the rootfs will be wiped, and the upgrade data
+    deleted. The upgrade data must be copied from the system to an alternate safe
+    location (such as a USB drive or remote server).
+
+#.  **Lock and upgrade controller-0**
+
+    ::
+
+     # system host-upgrade controller-0
+
+    This will wipe the rootfs and reboot the host.
+
+#.  **Install the new release of StarlingX**
+
+    Install the new release of StarlingX software via network or USB.
+
+#.  **Restore the upgrade data**
+
+    ::
+
+     # ansible-playbook /usr/share/ansible/stx-ansible/playbooks/upgrade.yml
+
+    The upgrade playbook will migrate the upgrade data to the current release
+    and restore it to the system.
+
+    This playbook requires the following parameters:
+
+    * ansible_become_pass
+    * admin_password
+    * upgrade_data_file
+
+#.  **Unlock controller-0**
+
+    ::
+
+     # system host-unlock controller-0
+
+#.  **Activate the upgrade**
+
+    ::
+
+     # system upgrade-activate
+
+    Perform any additional configuration which may be required after the host is
+    unlocked.
+
+#.  **Complete the upgrade**
+
+    ::
+
+     # system upgrade-complete
+
+    Remove data from the previous release.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+
+* John Kung (john.kung@windriver.com)
+
+Other contributors:
+
+* David Sullivan (david.sullivan@windriver.com)
+
+Repos Impacted
+--------------
+
+* config
+* update
+* integ
+* metal
+* stx-puppet
+* ansible-playbooks
+
+Work Items
+----------
+
+Please refer to the Story [2]_ for the complete list of tasks.
+
+The following are prerequisites to prior the upgrade
+
+* update kubernetes configuration to the configured features on N+1
+  This will be enabled by a software delivered increment that will enable the
+  required configuration baseline.
+
+  * update kubernetes control_plane_address
+  * kubernetes encryption at rest
+
+  Updating the kubernetes version is covered by [1] and
+  is performed independently.
+
+* This is enforced by the upgrade load N+1 metadata which specifies the
+  upgrade supported from load.
+
+* The etcd directory is unversioned so that it can be referenced by the N
+  and N+1 kubernetes control plane
+
+The following steps in the upgrade require changes:
+
+* load-import
+  The metadata handling is extended to support multiple from versions.
+
+* health-query-upgrade
+  Health checks are added to ensure kubernetes version and configuration
+  are at correct baseline for upgrade
+
+upgrade-start
+
+* upgrade-start-pkg-extract
+  Update to reference dnf rather than superseded repoquery tool
+* migrate puppet hiera data
+* Export armada, helm, kubernetes configuration to N+1
+* export the databases for N+1
+
+host-upgrade
+
+* create /etc/platform/.upgrade_controller_1 so that controller-1 via RPC
+  can determine that controller upgrade is required
+
+host-unlock
+
+* Create join command from the N side for the N+1 side
+* run upgrades playbook for docker. This will push docker images required.
+
+host-swact
+
+* update to backup /opt/etcd/from_version and restore
+  to /opt/etcd/to_version for the target version on host-swact
+  This is performed at a time where data loss can be avoided. During an upgrade,
+  before etcd has started on controller-1, after host-swact, the etcd is copied
+  from controller-0.  Normally, etcdctl snapshot is required when data is still
+  dynamically changing; however, as service management manages etcd in
+  active-standby, and the snapshot is occuring as part of etcd startup, it is
+  possible to use a direct copy.
+
+Ansible:
+
+* upgrade playbook for docker. push_k8s_images.yml is updated to handle
+  platform upgrade case.
+
+Integ:
+
+* Update registry-token-server to continue to support GET for token
+  This is performed as part of Story 2006145, Task 38763
+  https://review.opendev.org/#/c/707283/
+
+* Add semantic checks to existing APIs
+
+  * application-apply/remove/etc... - prevent when platform upgrade in
+    progress
+  * helm-override-update/etc... - prevent when platform upgrade in progress
+
+Miscellaneous:
+
+* Update metadata for upgrade versions
+* Remove openstack service, databases references in upgrade code
+* Update supported from version checks
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Upgrades must be tested in the following StarlingX configurations:
+
+* AIO-DX
+* Standard with controller storage
+* Standard with dedicated storage
+* AIO-SX
+
+The testing can be performed on hardware or virtual environments.
+
+Documentation Impact
+====================
+
+New user end user documentation will be required to describe how platform
+upgrades should be done.
+
+The config API reference will also need updates.
+
+References
+==========
+
+.. [1]  Kubernetes Upgrade Story https://storyboard.openstack.org/#!/story/2006781
+.. [2]  Platform Upgrades Story https://storyboard.openstack.org/#!/story/2007403
+.. [3]  etcd upgrades https://etcd.io/docs/v3.4.0/upgrades
+
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - stx-4.0
+     - Introduced