From 8df985b8769637ad89ee082cb102cba000cbb4e1 Mon Sep 17 00:00:00 2001 From: John Kung Date: Thu, 19 Mar 2020 11:38:21 -0400 Subject: [PATCH] StarlingX Platform Upgrades This story will provide a mechanism to upgrade the platform components on a running StarlingX system. This is required to allow upgrades between Starlingx versions. The platform upgrade components includes the Host OS and StarlingX components (e.g. flock services) A maintenance release for stx3.x is required to upgrade to stx4.0 Change-Id: I0dc023e93a5ec08ac975e3594b50729f6c505c8c Story: 2007403 Task: 39105 Signed-off-by: John Kung --- .../starlingx-2007403-platform-upgrades.rst | 735 ++++++++++++++++++ 1 file changed, 735 insertions(+) create mode 100644 doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst diff --git a/doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst b/doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst new file mode 100644 index 0000000..267be78 --- /dev/null +++ b/doc/source/specs/stx-4.0/approved/starlingx-2007403-platform-upgrades.rst @@ -0,0 +1,735 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. http://creativecommons.org/licenses/by/3.0/legalcode + + +=========================== +StarlingX Platform Upgrades +=========================== + +Storyboard: +https://storyboard.openstack.org/#!/story/2007403 + +This story will provide a mechanism to upgrade the platform components on +a running StarlingX system. This is required to allow upgrades between +StarlingX versions. + +The platform upgrade components includes the Host OS and StarlingX +components. (e.g. flock services) + +A maintenance release for stx3 is required to upgrade to stx4.0 + +Problem Description +=================== + +StarlingX must provide a mechanism to allow migration to a new StarlingX +release. + +In order to provide a robust and simple upgrade experience for +users of StarlingX, the upgrade process must be automated as much as +possible and controls must be in place to ensure the steps are followed +in the right order. + +The platform components compatibility release over release are affected +by inter node messaging between components, configuration migration +requirements, and kubernetes control plane compatibility. + +The downtime over an upgrade must be minimized. + +* controller upgrade - impact minimized to time for host-swact +* worker upgrade - impact to applications minimized to time it take to + migrate application from a worker node before it is upgraded +* storage - no loss of storage over an upgrade + +Upgrades must be done in-service. The platform and applications must +continue to provide service during the upgrade. This does not apply to +simplex deployments. + +Upgrades must be done without any additional hardware. + +Background +---------- + +Three types of StarlingX upgrades will be supported: + +* Platform Upgrade, which includes Host OS and StarlingX components + (e.g. flock services) +* Kubernetes Upgrade +* Application upgrade, which includes: StarlingX applications (e.g. + platform-integ-apps, stx-openstack), User applications and kubernetes workloads + +These three types of upgrades are done independently. For example, the Platform is +upgraded to a new release of StarlingX without changing the kubernetes version. +However, there are dependencies which determine the order in which these +upgrades can be done. For example, kubernetes must be upgraded to a particular +version before a platform upgrade can be done. + +Use Cases +--------- + +* Administrator wants to upgrade to a new StarlingX platform version + with minimal impact to running applications. +* Administrator wants to abort an upgrade in progress prior to upgrading + all controllers + Note: Downgrade to previous release version is not supported + + +Proposed Process +================ + +StarlingX will only support upgrades from release N to release N+1. +For example, maintenance release stx3.x can be upgraded to stx4, +but not directly to stx5.0. +Changes required for kubernetes configuration compatibility is +delivered in a maintenance release to enable upgrade from stx3 to stx4. + +The administrator must ensure that their deployment has enough +extra capacity (e.g. worker hosts) to allow one (or more) hosts to be +temporarily taken out of service for the upgrade. + +For each supported platform version, supported upgrades from version is +tracked by the metadata (in metal/common-bsp/files/upgrades/metadata.xml). +The metadata handling is extended to support multiple from versions. + +A maintenance release will enable stx3 to stx4 upgrades, and includes +the configuration updates required to enable compatibility with the +kubernetes control plane during the upgrade. + +The following is a summary of the steps the user will take when performing +a platform upgrade. For each step, a summary of the actions the system +will perform is provided. + +The software_upgrade table tracks the upgrade state, and includes: + * upgrade-started + * data-migration + * data-migration-complete + * upgrading-controllers + * upgrading-hosts + * activation-requested + * activation-complete + * completing + * completed + +When an upgrade is aborted the following state transitions occur: + * aborting + * abort-completing + * aborting-reinstall + +#. **Import release N+1 load** + + :: + + # system load-import + + # system load-list + +----+----------+------------------+ + | id | state | software_version | + +----+----------+------------------+ + | 1 | active | 19.12 | + | 2 | imported | 20.06 | + +----+----------+------------------+ + + The fields are: + + * software_version: comes from metadata in the load image. + + * states: + + * active: the current version, version N + * importing: image is being uploaded to load repository + * error: error load state + * deleting: load is being deleted from repository + * imported: version that can be upgraded to, i.e. version N+1 + +#. **Perform Health checks for upgrade** + + :: + + # system health-query-upgrade + + This will perform health checks to ensure the system is at + a state ready for upgrade. + + These health checks are also performed as part of upgrade-start. + + These include checks for: + + * upgrade target load is imported + * all hosts provisioned + * all hosts load current + * all hosts unlocked/enabled + * all hosts have matching configs + * no management affecting alarms + * for ceph systems: storage cluster is healthy + * verifies kubernetes nodes are ready + * verifies kubernetes control plane pods are ready + * verifies that the kubernetes control plane is at a version and + configuration required for upgrade. If not, the kubernetes upgrade [1]_ + method must be performed in order to bring it to baseline. + +#. **Start the upgrade** + + :: + + # system upgrade-start + + This performs semantic checks and the health checks as per the + 'health-query-upgrade' step. + + This will make a copy of the system data (e.g. postgres databases, + armada, helm, kubernetes, and puppet hiera data migrations) to be used + in the upgrade. + + Note that /opt/etcd cluster data may be dynamically updating cluster + state info until the host-swact when service management brings down the + active-standby etcd on N side and up on N+1 side. + + Configuration changes are not allowed after this point, until the upgrade + is completed. + +#. **Lock and upgrade controller-1** + + :: + + # system host-upgrade controller-1 + + * upgrade state is set to 'data-migration' + * update upgrade_controller_1 flag so that controller-1 can determine + whether in an upgrade + * host controller-1 is reinstalled with N+1 load + * Migrate data and configuration from release N to release N+1 + * A special release N+1 puppet upgrade manifest is applied, based on + the hiera data that was migrated from release N. This allows for one-time + actions similar to what was done on the initial install of controller-0 + (e.g. configuring rabbit, postgres, keystone). + * Generate hiera data for release N+1, to be used to apply + the regular puppet manifest when controller-1 is unlocked + * sync replicated (DRBD) filesystems + * upgrade state is set to 'data-migration-complete' + * system data is present in both release N and release N+1 versioned + directories (e.g. /opt/platform/config/, + /var/lib/postgresql/) + +#. **Unlock controller-1** + + This includes generating configuration data for controller-1 which must be + generated from the active controller. + + * the join_cmd for the kubernetes controlplane is generated on the N side + for the N+1 hierdata + + The N+1 hiera data drives the puppet manifest apply. + +#. **Swact to controller-1** + + :: + + # system host-swact controller-0 + + * controller-1 becomes active and runs release N+1 while rest of the + system is running release N + * Any release N+1 components that do inter-node communications must be + backwards compatible to ensure that communication with release N + works correctly + * update to backup /opt/etcd/version_N and restore + to /opt/etcd/version_N+1 for the target version on host-swact + This must be performed at a time where data loss can be avoided. + As part of the host-swact startup on controller-1, and during an upgrade, + etcd is copied from the release N etcd directory to the release N+1 + etcd directory. + +#. **Lock and upgrade controller-0** + + :: + + # system host-upgrade controller-0 + + * install N+1 load and on the host-unlock, apply the upgrades manifest + and the puppet host configuration + * after controller-0 is upgraded, upgrade state is set to 'upgrading-hosts' + +#. **If applicable, Lock and upgrade storage hosts** + + :: + + # system host-upgrade storage-0 + + * If provisioned, all storage hosts must be upgraded prior to + proceeding with workers + * Install N+1 load; up to half of the storage hosts can be done in parallel + * Ceph data sync + +#. **Lock and upgrade worker hosts** + + :: + + # system host-upgrade worker-x + + * Migrate workloads from worker node (triggered by host-lock) + * Install N+1 load + * Can be done in parallel, depending upon excess capacity. + Each worker host will first be locked using the existing "system + host-lock" CLI (worker hosts can be done in any order). This results in + services being migrated off the host and applies the NoExecute taint, + which will evict any pods that can be evicted. + +#. **Activate the upgrade** + + :: + + # system upgrade-activate + + * Perform any additional configuration which may be required after all + hosts have been upgraded. + +#. **host-swact to controller-0** + + :: + + # system host-swact controller-1 + +#. **Complete the upgrade** + + :: + + # system upgrade-complete + + * Run post-checks to ensure upgrade has been completed + * Remove release N data + +**Failure Handling** + +* When a failure happens and cannot be resolved without manual intervention, + the upgrade state will be set to data-migration-failed or activation-failed. +* To recover, the user will need to resolve the issue that caused the upgrade + step to fail. +* An upgrade-abort is only possible before controller-0 has been upgraded. + In other cases, the user would need to resolve the issue and reattempt + the step. + +**Health Checks** + +* In order to ensure the health and stability of the system we will do + health checks both before allowing a platform upgrade to start and then as + each upgrade CLI is run. +* The health checks will include: + + * basic system health (i.e. system health-query) + * new kubernetes specific checks - for example: + + * verify that all kubernetes control plane pods are running + * verify that all kubernetes applications are fully applied + * verify that kubernetes control plane version and configuration is at + baseline required for platform upgrade. + +**Interactions with container applications** + +* The kubernetes join_cmd must be created from the N side running the + active kubernetes control plane. +* The platform upgrade must be performed exclusively from the kubernetes + upgrade. A kubernetes upgrade is not allowed when a platform upgrade + is in progress, and vice-versa. +* Before starting a platform upgrade, we also need to check that + kubernetes configuration is at a baseline suitable for upgrade. + The N+1 load metadata enforces the configuration baseline required on + the N from side. +* If the N+1 version is at a newer kubernetes version, then the + kubernetes upgrade procedure must be completed first in order to align + the kubernetes version. +* After a platform upgrade has started, helm-override operations will be + prevented as these configuration changes will not be preserved after + upgrade-start and can also trigger applications to be reapplied. + +Alternatives +------------ + +Update the kubernetes configuration to N+1 configuration after the upgrade. +However, this would necessitate the coordination of activation of features +such as control plane address, encryption at rest during an upgrade, such +as during the upgrade-activate step. This would require N+1 to be +backwards compatible with N. + +A mechanism is required to upgrade etcd [3]_, thus keeping the versioning +for the etcd database will allow an upgrade to a newer etcd version. + +etcd Upgrades + +* host-swact to controller-1 during upgrade. As part of the host-swact during + an upgrade, the kubernetes etcd is copied from N side. This would take a + copy at a time when the etcd data is not allowed to change, as etcd + would be brought down on controller-0, and prior to service management + bringing up etcd on controller-1. After the new version of etcd runs with + the migrated etcd, it is no longer possible to run the old version of etcd + against it. Therefore, the release N version of the data must be maintained + in the event of a host-swact back or upgrade-abort prior to + upgrade of controller-0. This is the chosen alternative. +* Alternative: upgrade-start: As an alternative to updating etcd on host-swact. + Migrate etcd for upgrade. Configuration changes which + affect the cluster state information could still occur in this scenario. + Kubernetes state changes that occur after the snapshot would be lost + and have the potential to put the kubernetes cluster into a bad state. +* Alternative: /opt/etcd is unversioned so that the N and N+1 sides both reference + the same directory. This is based on the premise that kubernetes control + plane is upgraded independently and does not require a versioned directory. + However, as noted in the host-swact alternative, this would not be compatible with + upgrade-abort or host-swact back to the N release. + + +Data Model Impact +----------------- + +The following tables in the sysinv database are required. The datamodel +required to support platform upgrades are in the stx3.0 data model, +and include the following platform upgrade focused tables. + +* loads + represents the load version (e.g. N and N+1), load state, compatible + versions + +* software_upgrade + represents the software upgrade state, from_load and to_load + +* host_upgrade + represents the software_load and target_load for each host + +REST API Impact +--------------- + +The v1/load, health, upgrade implements the platform upgrade specific URL +utilized for the upgrade. The config repo api-ref-sysinv-v1-config.rst +doc is updated accordingly. + +The sysinv REST API supports the following upgrade-related methods: + +* The existing resource /loads + + * URLS: + + * /v1/loads + + * Request Methods: + + * GET /v1/loads + + * Returns all platform loads known to the system + + * POST /v1/loads/import_load + + * Imports the new load passed into the body of the POST request + +* The existing resource /upgrade + + * URLS: + + * /v1/upgrade + +* The existing resource /ihosts support for upgrade actions. + + * URLS: + + * /v1/ihosts/ + + * Request Methods: + + * POST /v1/ihosts//upgrade + + * Upgrades the platform load on the specified host + +Security Impact +--------------- + +This story is providing a mechanism to upgrade platform from one version +to another. It does not introduce any additional security impacts above +what is already there regarding the initial deployment. + +Other End User Impact +--------------------- + +End users will typically perform upgrades using the sysinv (i.e. +system) CLI. The CLI commands used for the upgrade are as noted in +the `Proposed Process`_ section above. + +Performance Impact +------------------ + +When a platform upgrade is in progress, each host must be taken out of +service in order to install the new load. +The user must ensure that there is enough capacity in the system to handle +the removal from service of one (or more) hosts as the load on each +host is upgraded. + +Other Deployer Impact +--------------------- + +Deployers will now be able to upgrade StarlingX platform on a running system. + +Developer Impact +---------------- + +Developers working on the StarlingX components that manage container +applications may need to be aware that certain operations should be +prevented when a platform upgrade is in progress. This is discussed in +the `Proposed Process`_ section above. + +Upgrade Impact +-------------- + +StarlingX platform upgrades are independent from the Kubernetes upgrade [1]_. +However, when StarlingX platform upgrades are supported, checks must be put +in place to ensure that the kubernetes version is not allowed to change due +to a platform upgrade. In effect, the system must be upgraded to the same +version of kubernetes as is packaged in the new platform release, to ensure +this is the case. This will be enforced through semantic checking in the +platform upgrade APIs. + +The platform upgrade excludes the upgrade of applications. Applications will +need to be compatible with the new version of the platform/kubernetes. +Any upgrade of hosted applications is independent of the platform upgrade. + +Simplex Platform Upgrades +========================= + +At a high level the simplex upgrade process involves the following steps. + +* Taking a backup of the platform data. +* Installing the new StarlingX software. +* Restoring and migrating the platform data. + +Simplex Upgrade Process +----------------------- + +#. **Import release N+1 load** + + :: + + # system load-import + + # system load-list + +----+----------+------------------+ + | id | state | software_version | + +----+----------+------------------+ + | 1 | active | 19.12 | + | 2 | imported | 20.06 | + +----+----------+------------------+ + +#. **Start the upgrade** + + :: + + # system upgrade-start + + This performs semantic checks and the health checks as per the + 'health-query-upgrade' command. + + This will make a copy of the system platform data similar to a platform + backup. The upgrade data will be placed under /opt/backups. + + Any changes made after this point will be lost. + +#. **Copy the upgrade data** + + During the upgrade process the rootfs will be wiped, and the upgrade data + deleted. The upgrade data must be copied from the system to an alternate safe + location (such as a USB drive or remote server). + +#. **Lock and upgrade controller-0** + + :: + + # system host-upgrade controller-0 + + This will wipe the rootfs and reboot the host. + +#. **Install the new release of StarlingX** + + Install the new release of StarlingX software via network or USB. + +#. **Restore the upgrade data** + + :: + + # ansible-playbook /usr/share/ansible/stx-ansible/playbooks/upgrade.yml + + The upgrade playbook will migrate the upgrade data to the current release + and restore it to the system. + + This playbook requires the following parameters: + + * ansible_become_pass + * admin_password + * upgrade_data_file + +#. **Unlock controller-0** + + :: + + # system host-unlock controller-0 + +#. **Activate the upgrade** + + :: + + # system upgrade-activate + + Perform any additional configuration which may be required after the host is + unlocked. + +#. **Complete the upgrade** + + :: + + # system upgrade-complete + + Remove data from the previous release. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + +* John Kung (john.kung@windriver.com) + +Other contributors: + +* David Sullivan (david.sullivan@windriver.com) + +Repos Impacted +-------------- + +* config +* update +* integ +* metal +* stx-puppet +* ansible-playbooks + +Work Items +---------- + +Please refer to the Story [2]_ for the complete list of tasks. + +The following are prerequisites to prior the upgrade + +* update kubernetes configuration to the configured features on N+1 + This will be enabled by a software delivered increment that will enable the + required configuration baseline. + + * update kubernetes control_plane_address + * kubernetes encryption at rest + + Updating the kubernetes version is covered by [1] and + is performed independently. + +* This is enforced by the upgrade load N+1 metadata which specifies the + upgrade supported from load. + +* The etcd directory is unversioned so that it can be referenced by the N + and N+1 kubernetes control plane + +The following steps in the upgrade require changes: + +* load-import + The metadata handling is extended to support multiple from versions. + +* health-query-upgrade + Health checks are added to ensure kubernetes version and configuration + are at correct baseline for upgrade + +upgrade-start + +* upgrade-start-pkg-extract + Update to reference dnf rather than superseded repoquery tool +* migrate puppet hiera data +* Export armada, helm, kubernetes configuration to N+1 +* export the databases for N+1 + +host-upgrade + +* create /etc/platform/.upgrade_controller_1 so that controller-1 via RPC + can determine that controller upgrade is required + +host-unlock + +* Create join command from the N side for the N+1 side +* run upgrades playbook for docker. This will push docker images required. + +host-swact + +* update to backup /opt/etcd/from_version and restore + to /opt/etcd/to_version for the target version on host-swact + This is performed at a time where data loss can be avoided. During an upgrade, + before etcd has started on controller-1, after host-swact, the etcd is copied + from controller-0. Normally, etcdctl snapshot is required when data is still + dynamically changing; however, as service management manages etcd in + active-standby, and the snapshot is occuring as part of etcd startup, it is + possible to use a direct copy. + +Ansible: + +* upgrade playbook for docker. push_k8s_images.yml is updated to handle + platform upgrade case. + +Integ: + +* Update registry-token-server to continue to support GET for token + This is performed as part of Story 2006145, Task 38763 + https://review.opendev.org/#/c/707283/ + +* Add semantic checks to existing APIs + + * application-apply/remove/etc... - prevent when platform upgrade in + progress + * helm-override-update/etc... - prevent when platform upgrade in progress + +Miscellaneous: + +* Update metadata for upgrade versions +* Remove openstack service, databases references in upgrade code +* Update supported from version checks + +Dependencies +============ + +None + +Testing +======= + +Upgrades must be tested in the following StarlingX configurations: + +* AIO-DX +* Standard with controller storage +* Standard with dedicated storage +* AIO-SX + +The testing can be performed on hardware or virtual environments. + +Documentation Impact +==================== + +New user end user documentation will be required to describe how platform +upgrades should be done. + +The config API reference will also need updates. + +References +========== + +.. [1] Kubernetes Upgrade Story https://storyboard.openstack.org/#!/story/2006781 +.. [2] Platform Upgrades Story https://storyboard.openstack.org/#!/story/2007403 +.. [3] etcd upgrades https://etcd.io/docs/v3.4.0/upgrades + + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - stx-4.0 + - Introduced