Merge "StarlingX Platform Upgrades"

This commit is contained in:
Zuul 2020-04-06 22:19:23 +00:00 committed by Gerrit Code Review
commit 8fecfa3bf0
1 changed files with 735 additions and 0 deletions

View File

@ -0,0 +1,735 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License. http://creativecommons.org/licenses/by/3.0/legalcode
===========================
StarlingX Platform Upgrades
===========================
Storyboard:
https://storyboard.openstack.org/#!/story/2007403
This story will provide a mechanism to upgrade the platform components on
a running StarlingX system. This is required to allow upgrades between
StarlingX versions.
The platform upgrade components includes the Host OS and StarlingX
components. (e.g. flock services)
A maintenance release for stx3 is required to upgrade to stx4.0
Problem Description
===================
StarlingX must provide a mechanism to allow migration to a new StarlingX
release.
In order to provide a robust and simple upgrade experience for
users of StarlingX, the upgrade process must be automated as much as
possible and controls must be in place to ensure the steps are followed
in the right order.
The platform components compatibility release over release are affected
by inter node messaging between components, configuration migration
requirements, and kubernetes control plane compatibility.
The downtime over an upgrade must be minimized.
* controller upgrade - impact minimized to time for host-swact
* worker upgrade - impact to applications minimized to time it take to
migrate application from a worker node before it is upgraded
* storage - no loss of storage over an upgrade
Upgrades must be done in-service. The platform and applications must
continue to provide service during the upgrade. This does not apply to
simplex deployments.
Upgrades must be done without any additional hardware.
Background
----------
Three types of StarlingX upgrades will be supported:
* Platform Upgrade, which includes Host OS and StarlingX components
(e.g. flock services)
* Kubernetes Upgrade
* Application upgrade, which includes: StarlingX applications (e.g.
platform-integ-apps, stx-openstack), User applications and kubernetes workloads
These three types of upgrades are done independently. For example, the Platform is
upgraded to a new release of StarlingX without changing the kubernetes version.
However, there are dependencies which determine the order in which these
upgrades can be done. For example, kubernetes must be upgraded to a particular
version before a platform upgrade can be done.
Use Cases
---------
* Administrator wants to upgrade to a new StarlingX platform version
with minimal impact to running applications.
* Administrator wants to abort an upgrade in progress prior to upgrading
all controllers
Note: Downgrade to previous release version is not supported
Proposed Process
================
StarlingX will only support upgrades from release N to release N+1.
For example, maintenance release stx3.x can be upgraded to stx4,
but not directly to stx5.0.
Changes required for kubernetes configuration compatibility is
delivered in a maintenance release to enable upgrade from stx3 to stx4.
The administrator must ensure that their deployment has enough
extra capacity (e.g. worker hosts) to allow one (or more) hosts to be
temporarily taken out of service for the upgrade.
For each supported platform version, supported upgrades from version is
tracked by the metadata (in metal/common-bsp/files/upgrades/metadata.xml).
The metadata handling is extended to support multiple from versions.
A maintenance release will enable stx3 to stx4 upgrades, and includes
the configuration updates required to enable compatibility with the
kubernetes control plane during the upgrade.
The following is a summary of the steps the user will take when performing
a platform upgrade. For each step, a summary of the actions the system
will perform is provided.
The software_upgrade table tracks the upgrade state, and includes:
* upgrade-started
* data-migration
* data-migration-complete
* upgrading-controllers
* upgrading-hosts
* activation-requested
* activation-complete
* completing
* completed
When an upgrade is aborted the following state transitions occur:
* aborting
* abort-completing
* aborting-reinstall
#. **Import release N+1 load**
::
# system load-import <bootimage.iso> <bootimage.sig>
# system load-list
+----+----------+------------------+
| id | state | software_version |
+----+----------+------------------+
| 1 | active | 19.12 |
| 2 | imported | 20.06 |
+----+----------+------------------+
The fields are:
* software_version: comes from metadata in the load image.
* states:
* active: the current version, version N
* importing: image is being uploaded to load repository
* error: error load state
* deleting: load is being deleted from repository
* imported: version that can be upgraded to, i.e. version N+1
#. **Perform Health checks for upgrade**
::
# system health-query-upgrade
This will perform health checks to ensure the system is at
a state ready for upgrade.
These health checks are also performed as part of upgrade-start.
These include checks for:
* upgrade target load is imported
* all hosts provisioned
* all hosts load current
* all hosts unlocked/enabled
* all hosts have matching configs
* no management affecting alarms
* for ceph systems: storage cluster is healthy
* verifies kubernetes nodes are ready
* verifies kubernetes control plane pods are ready
* verifies that the kubernetes control plane is at a version and
configuration required for upgrade. If not, the kubernetes upgrade [1]_
method must be performed in order to bring it to baseline.
#. **Start the upgrade**
::
# system upgrade-start
This performs semantic checks and the health checks as per the
'health-query-upgrade' step.
This will make a copy of the system data (e.g. postgres databases,
armada, helm, kubernetes, and puppet hiera data migrations) to be used
in the upgrade.
Note that /opt/etcd cluster data may be dynamically updating cluster
state info until the host-swact when service management brings down the
active-standby etcd on N side and up on N+1 side.
Configuration changes are not allowed after this point, until the upgrade
is completed.
#. **Lock and upgrade controller-1**
::
# system host-upgrade controller-1
* upgrade state is set to 'data-migration'
* update upgrade_controller_1 flag so that controller-1 can determine
whether in an upgrade
* host controller-1 is reinstalled with N+1 load
* Migrate data and configuration from release N to release N+1
* A special release N+1 puppet upgrade manifest is applied, based on
the hiera data that was migrated from release N. This allows for one-time
actions similar to what was done on the initial install of controller-0
(e.g. configuring rabbit, postgres, keystone).
* Generate hiera data for release N+1, to be used to apply
the regular puppet manifest when controller-1 is unlocked
* sync replicated (DRBD) filesystems
* upgrade state is set to 'data-migration-complete'
* system data is present in both release N and release N+1 versioned
directories (e.g. /opt/platform/config/<release>,
/var/lib/postgresql/<release>)
#. **Unlock controller-1**
This includes generating configuration data for controller-1 which must be
generated from the active controller.
* the join_cmd for the kubernetes controlplane is generated on the N side
for the N+1 hierdata
The N+1 hiera data drives the puppet manifest apply.
#. **Swact to controller-1**
::
# system host-swact controller-0
* controller-1 becomes active and runs release N+1 while rest of the
system is running release N
* Any release N+1 components that do inter-node communications must be
backwards compatible to ensure that communication with release N
works correctly
* update to backup /opt/etcd/version_N and restore
to /opt/etcd/version_N+1 for the target version on host-swact
This must be performed at a time where data loss can be avoided.
As part of the host-swact startup on controller-1, and during an upgrade,
etcd is copied from the release N etcd directory to the release N+1
etcd directory.
#. **Lock and upgrade controller-0**
::
# system host-upgrade controller-0
* install N+1 load and on the host-unlock, apply the upgrades manifest
and the puppet host configuration
* after controller-0 is upgraded, upgrade state is set to 'upgrading-hosts'
#. **If applicable, Lock and upgrade storage hosts**
::
# system host-upgrade storage-0
* If provisioned, all storage hosts must be upgraded prior to
proceeding with workers
* Install N+1 load; up to half of the storage hosts can be done in parallel
* Ceph data sync
#. **Lock and upgrade worker hosts**
::
# system host-upgrade worker-x
* Migrate workloads from worker node (triggered by host-lock)
* Install N+1 load
* Can be done in parallel, depending upon excess capacity.
Each worker host will first be locked using the existing "system
host-lock" CLI (worker hosts can be done in any order). This results in
services being migrated off the host and applies the NoExecute taint,
which will evict any pods that can be evicted.
#. **Activate the upgrade**
::
# system upgrade-activate
* Perform any additional configuration which may be required after all
hosts have been upgraded.
#. **host-swact to controller-0**
::
# system host-swact controller-1
#. **Complete the upgrade**
::
# system upgrade-complete
* Run post-checks to ensure upgrade has been completed
* Remove release N data
**Failure Handling**
* When a failure happens and cannot be resolved without manual intervention,
the upgrade state will be set to data-migration-failed or activation-failed.
* To recover, the user will need to resolve the issue that caused the upgrade
step to fail.
* An upgrade-abort is only possible before controller-0 has been upgraded.
In other cases, the user would need to resolve the issue and reattempt
the step.
**Health Checks**
* In order to ensure the health and stability of the system we will do
health checks both before allowing a platform upgrade to start and then as
each upgrade CLI is run.
* The health checks will include:
* basic system health (i.e. system health-query)
* new kubernetes specific checks - for example:
* verify that all kubernetes control plane pods are running
* verify that all kubernetes applications are fully applied
* verify that kubernetes control plane version and configuration is at
baseline required for platform upgrade.
**Interactions with container applications**
* The kubernetes join_cmd must be created from the N side running the
active kubernetes control plane.
* The platform upgrade must be performed exclusively from the kubernetes
upgrade. A kubernetes upgrade is not allowed when a platform upgrade
is in progress, and vice-versa.
* Before starting a platform upgrade, we also need to check that
kubernetes configuration is at a baseline suitable for upgrade.
The N+1 load metadata enforces the configuration baseline required on
the N from side.
* If the N+1 version is at a newer kubernetes version, then the
kubernetes upgrade procedure must be completed first in order to align
the kubernetes version.
* After a platform upgrade has started, helm-override operations will be
prevented as these configuration changes will not be preserved after
upgrade-start and can also trigger applications to be reapplied.
Alternatives
------------
Update the kubernetes configuration to N+1 configuration after the upgrade.
However, this would necessitate the coordination of activation of features
such as control plane address, encryption at rest during an upgrade, such
as during the upgrade-activate step. This would require N+1 to be
backwards compatible with N.
A mechanism is required to upgrade etcd [3]_, thus keeping the versioning
for the etcd database will allow an upgrade to a newer etcd version.
etcd Upgrades
* host-swact to controller-1 during upgrade. As part of the host-swact during
an upgrade, the kubernetes etcd is copied from N side. This would take a
copy at a time when the etcd data is not allowed to change, as etcd
would be brought down on controller-0, and prior to service management
bringing up etcd on controller-1. After the new version of etcd runs with
the migrated etcd, it is no longer possible to run the old version of etcd
against it. Therefore, the release N version of the data must be maintained
in the event of a host-swact back or upgrade-abort prior to
upgrade of controller-0. This is the chosen alternative.
* Alternative: upgrade-start: As an alternative to updating etcd on host-swact.
Migrate etcd for upgrade. Configuration changes which
affect the cluster state information could still occur in this scenario.
Kubernetes state changes that occur after the snapshot would be lost
and have the potential to put the kubernetes cluster into a bad state.
* Alternative: /opt/etcd is unversioned so that the N and N+1 sides both reference
the same directory. This is based on the premise that kubernetes control
plane is upgraded independently and does not require a versioned directory.
However, as noted in the host-swact alternative, this would not be compatible with
upgrade-abort or host-swact back to the N release.
Data Model Impact
-----------------
The following tables in the sysinv database are required. The datamodel
required to support platform upgrades are in the stx3.0 data model,
and include the following platform upgrade focused tables.
* loads
represents the load version (e.g. N and N+1), load state, compatible
versions
* software_upgrade
represents the software upgrade state, from_load and to_load
* host_upgrade
represents the software_load and target_load for each host
REST API Impact
---------------
The v1/load, health, upgrade implements the platform upgrade specific URL
utilized for the upgrade. The config repo api-ref-sysinv-v1-config.rst
doc is updated accordingly.
The sysinv REST API supports the following upgrade-related methods:
* The existing resource /loads
* URLS:
* /v1/loads
* Request Methods:
* GET /v1/loads
* Returns all platform loads known to the system
* POST /v1/loads/import_load
* Imports the new load passed into the body of the POST request
* The existing resource /upgrade
* URLS:
* /v1/upgrade
* The existing resource /ihosts support for upgrade actions.
* URLS:
* /v1/ihosts/<hostid>
* Request Methods:
* POST /v1/ihosts/<hostid>/upgrade
* Upgrades the platform load on the specified host
Security Impact
---------------
This story is providing a mechanism to upgrade platform from one version
to another. It does not introduce any additional security impacts above
what is already there regarding the initial deployment.
Other End User Impact
---------------------
End users will typically perform upgrades using the sysinv (i.e.
system) CLI. The CLI commands used for the upgrade are as noted in
the `Proposed Process`_ section above.
Performance Impact
------------------
When a platform upgrade is in progress, each host must be taken out of
service in order to install the new load.
The user must ensure that there is enough capacity in the system to handle
the removal from service of one (or more) hosts as the load on each
host is upgraded.
Other Deployer Impact
---------------------
Deployers will now be able to upgrade StarlingX platform on a running system.
Developer Impact
----------------
Developers working on the StarlingX components that manage container
applications may need to be aware that certain operations should be
prevented when a platform upgrade is in progress. This is discussed in
the `Proposed Process`_ section above.
Upgrade Impact
--------------
StarlingX platform upgrades are independent from the Kubernetes upgrade [1]_.
However, when StarlingX platform upgrades are supported, checks must be put
in place to ensure that the kubernetes version is not allowed to change due
to a platform upgrade. In effect, the system must be upgraded to the same
version of kubernetes as is packaged in the new platform release, to ensure
this is the case. This will be enforced through semantic checking in the
platform upgrade APIs.
The platform upgrade excludes the upgrade of applications. Applications will
need to be compatible with the new version of the platform/kubernetes.
Any upgrade of hosted applications is independent of the platform upgrade.
Simplex Platform Upgrades
=========================
At a high level the simplex upgrade process involves the following steps.
* Taking a backup of the platform data.
* Installing the new StarlingX software.
* Restoring and migrating the platform data.
Simplex Upgrade Process
-----------------------
#. **Import release N+1 load**
::
# system load-import <bootimage.iso> <bootimage.sig>
# system load-list
+----+----------+------------------+
| id | state | software_version |
+----+----------+------------------+
| 1 | active | 19.12 |
| 2 | imported | 20.06 |
+----+----------+------------------+
#. **Start the upgrade**
::
# system upgrade-start
This performs semantic checks and the health checks as per the
'health-query-upgrade' command.
This will make a copy of the system platform data similar to a platform
backup. The upgrade data will be placed under /opt/backups.
Any changes made after this point will be lost.
#. **Copy the upgrade data**
During the upgrade process the rootfs will be wiped, and the upgrade data
deleted. The upgrade data must be copied from the system to an alternate safe
location (such as a USB drive or remote server).
#. **Lock and upgrade controller-0**
::
# system host-upgrade controller-0
This will wipe the rootfs and reboot the host.
#. **Install the new release of StarlingX**
Install the new release of StarlingX software via network or USB.
#. **Restore the upgrade data**
::
# ansible-playbook /usr/share/ansible/stx-ansible/playbooks/upgrade.yml
The upgrade playbook will migrate the upgrade data to the current release
and restore it to the system.
This playbook requires the following parameters:
* ansible_become_pass
* admin_password
* upgrade_data_file
#. **Unlock controller-0**
::
# system host-unlock controller-0
#. **Activate the upgrade**
::
# system upgrade-activate
Perform any additional configuration which may be required after the host is
unlocked.
#. **Complete the upgrade**
::
# system upgrade-complete
Remove data from the previous release.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
* John Kung (john.kung@windriver.com)
Other contributors:
* David Sullivan (david.sullivan@windriver.com)
Repos Impacted
--------------
* config
* update
* integ
* metal
* stx-puppet
* ansible-playbooks
Work Items
----------
Please refer to the Story [2]_ for the complete list of tasks.
The following are prerequisites to prior the upgrade
* update kubernetes configuration to the configured features on N+1
This will be enabled by a software delivered increment that will enable the
required configuration baseline.
* update kubernetes control_plane_address
* kubernetes encryption at rest
Updating the kubernetes version is covered by [1] and
is performed independently.
* This is enforced by the upgrade load N+1 metadata which specifies the
upgrade supported from load.
* The etcd directory is unversioned so that it can be referenced by the N
and N+1 kubernetes control plane
The following steps in the upgrade require changes:
* load-import
The metadata handling is extended to support multiple from versions.
* health-query-upgrade
Health checks are added to ensure kubernetes version and configuration
are at correct baseline for upgrade
upgrade-start
* upgrade-start-pkg-extract
Update to reference dnf rather than superseded repoquery tool
* migrate puppet hiera data
* Export armada, helm, kubernetes configuration to N+1
* export the databases for N+1
host-upgrade
* create /etc/platform/.upgrade_controller_1 so that controller-1 via RPC
can determine that controller upgrade is required
host-unlock
* Create join command from the N side for the N+1 side
* run upgrades playbook for docker. This will push docker images required.
host-swact
* update to backup /opt/etcd/from_version and restore
to /opt/etcd/to_version for the target version on host-swact
This is performed at a time where data loss can be avoided. During an upgrade,
before etcd has started on controller-1, after host-swact, the etcd is copied
from controller-0. Normally, etcdctl snapshot is required when data is still
dynamically changing; however, as service management manages etcd in
active-standby, and the snapshot is occuring as part of etcd startup, it is
possible to use a direct copy.
Ansible:
* upgrade playbook for docker. push_k8s_images.yml is updated to handle
platform upgrade case.
Integ:
* Update registry-token-server to continue to support GET for token
This is performed as part of Story 2006145, Task 38763
https://review.opendev.org/#/c/707283/
* Add semantic checks to existing APIs
* application-apply/remove/etc... - prevent when platform upgrade in
progress
* helm-override-update/etc... - prevent when platform upgrade in progress
Miscellaneous:
* Update metadata for upgrade versions
* Remove openstack service, databases references in upgrade code
* Update supported from version checks
Dependencies
============
None
Testing
=======
Upgrades must be tested in the following StarlingX configurations:
* AIO-DX
* Standard with controller storage
* Standard with dedicated storage
* AIO-SX
The testing can be performed on hardware or virtual environments.
Documentation Impact
====================
New user end user documentation will be required to describe how platform
upgrades should be done.
The config API reference will also need updates.
References
==========
.. [1] Kubernetes Upgrade Story https://storyboard.openstack.org/#!/story/2006781
.. [2] Platform Upgrades Story https://storyboard.openstack.org/#!/story/2007403
.. [3] etcd upgrades https://etcd.io/docs/v3.4.0/upgrades
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - stx-4.0
- Introduced