tripleo-specs/specs/queens/fast-forward-upgrades.rst
Andreas Jaeger c4a651a302 Cleanup py27 support
Make a few cleanups:
- Remove obsolete sections from setup.cfg
- Switch to using sphinx-build
- Cleanup doc/source/conf.py to remove now obsolete content.
- Use newer openstackdocstheme version
- Remove install_command from tox.ini, the default is fine;
  cleanup for Python 3 only
- Fix sphinx warnings

Change-Id: Ic70971b3c927508cd8a69b74f0c6b152671e8f83
2020-04-10 12:41:07 +02:00

13 KiB

.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

http://creativecommons.org/licenses/by/3.0/legalcode

Fast-forward upgrades

https://blueprints.launchpad.net/tripleo/+spec/fast-forward-upgrades

Fast-forward upgrades are upgrades that move an environment from release N to N+X in a single step, where X is greater than 1 and for fast-forward upgrades is typically 3. This spec outlines how such upgrades can be orchestrated by TripleO between the Newton and Queens OpenStack releases.

Problem Description

OpenStack upgrades are often seen by operators as problematic12. Whilst TripleO upgrades have improved greatly over recent cycles many operators are still reluctant to upgrade with each new release.

This often leads to a situation where environments remain on the release used when first deployed. Eventually this release will come to the end of its supported life (EOL), forcing operators to upgrade to the next supported release. There can also be restrictions imposed on an environment that simply do not allow upgrades to be performed ahead of the EOL of a given release, forcing operators to again wait until the release hits EOL.

While it is possible to then linearly upgrade to a supported release with the cadence of upstream releases, downstream distributions providing long-term support (LTS) releases may not be able to provide the same path once the initially installed release reaches EOL. Operators in such a situation may also want to avoid running multiple lengthy linear upgrades to reach their desired release.

Proposed Change

Overview

TripleO support for fast-forward upgrades will first target N to N+3 upgrades between the Newton and Queens releases:

Newton    Ocata     Pike       Queens
+-----+   +-----+   +-----+    +-----+
|     |   | N+1 |   | N+2 |    |     |
|  N  | ---------------------> | N+3 |
|     |   |     |   |     |    |     |
+-----+   +-----+   +-----+    +-----+

This will give the impression of the Ocata and Pike releases being skipped with the fast-forward upgrade moving the environment from Newton to Queens. In reality as OpenStack projects with the supports-upgrade tag are only required to support N to N+1 upgrades3 the upgrade will still need to move through each release, completing database migrations and a limited set of other tasks.

Caveats

Before outlining the suggested changes to TripleO it is worth highlighting the following caveats for fast-forward upgrades:

  • The control plane is inaccessible for the duration of the upgrade
  • The data plane and active workloads must remain available for the duration of the upgrade.

Prerequisites

Prior to the overcloud fast-forward upgrade starting the following prerequisite tasks must be completed:

  • Rolling minor update of the overcloud on N

This is a normal TripleO overcloud update4 and should bring each node in the environment up to the latest supported version of the underlying OS and pulling in the latest packages. Operators can then reboot the nodes as required. The reboot ensuring that the latest kernel, openvswitch, QEMU and any other reboot dependant package is reloaded before proceeding with the upgrade. This can happen well in advance of the overcloud fast-forward upgrade and should remove the need for additional reboots during the upgrade.

  • Upgrade undercloud from N to N+3

The undercloud also needs to be upgraded to N+3 ahead of any overcloud upgrade. Again this can happen well in advance of the overcloud upgrade. For the time being this is a traditional, linear upgrade between N and N+1 releases until we reach the target N+3 Queens release.

  • Container images cached prior to the start of the upgrade

With the introduction of containerised TripleO overclouds in Pike operators will need to cache the required container images prior to the fast-forward upgrade if they wish to end up with a containerised Queens overcloud.

High level flow

At a high level the following actions will be carried out by the fast-forward upgrade to move the overcloud from N to `N+3`:

  • Stop all OpenStack control and compute services across all roles

This will bring down the OpenStack control plane, leaving infrastructure services such as the databases running, while allowing any workloads to continue running without interruption. For HA environments this will disable the cluster, ensuring that OpenStack services are not restarted.

  • Upgrade a single host from N to N+1 then N+1 to N+2

As alluded to earlier, OpenStack projects currently only support N to N+1 upgrades and so fast-forward upgrades still need to cycle through each release in order to complete data migrations and any other tasks that are required before these migrations can be completed. This part of the upgrade is limited to a single host per role to ensure this is completed as quickly as possible.

  • Optional upgrade and deployment of single canary compute host to N+3

As fast-forward upgrades aim to ensure workloads are online and accessible during the upgrade we can optionally upgrade all control service hosting roles _and a single canary compute to N+3 to verify that workloads will remain active and accessible during the upgrade.

A canary compute node will be selected at the start of the upgrade and have instances launched on it to validate that both it and the data plane remain active during the upgrade. The upgrade will halt if either become inaccessible with a recovery procedure being provided to move all hosts back to N+1 without further disruption to the active workloads on the untouched compute hosts.

  • Upgrade and deployment of all roles to N+3

If the above optional canary compute host upgrade is not used then the final action in the fast-forward upgrade will be a traditional N to N+1 migration between N+2 and N+3 followed by the deployment of all roles on N+3. This final action essentially being a redeployment of the overcloud to containers on N+3 (Queens) as previously seen when upgrading TripleO environments from Ocata to Pike.

A python-tripleoclient command and associated Mistral workflow will control if this final step is applied to all roles in parallel (default), all hosts in a given role or selected hosts in a given role. The latter being useful if a user wants to control the order in which computes are moved from N+1 to N+3 etc.

Implementation

As with updates5 and upgrades6 specific fast-forward upgrade Ansible tasks associated with the first two actions above will be introduced into the tripleo-heat-template service templates for each service as RoleConfig outputs.

As with upgrade_tasks each task is associated with a particular step in the process. For fast_forward_upgrade_tasks these steps are split between prep tasks that apply to all hosts and bootstrap tasks that only apply to a single host for a given role.

Prep step tasks will map to the following actions:

  • Step=1: Disable the overall cluster
  • Step=2: Stop OpenStack services
  • Step=3: Update host repositories

Bootstrap step tasks will map to the following actions:

  • Step=4: Take OpenStack DB backups
  • Step=5: Pre package update commands
  • Step=6: Update required packages
  • Step=7: Post package update commands
  • Step=8: OpenStack service DB sync
  • Step=9: Validation

As with update_tasks each task will use simple when conditionals to identify which step and release(s) it is associated with, ensuring these tasks are executed at the correct point in the upgrade.

For example, a step 2 fast_forward_upgrade_task task on Ocata is listed below:

fast_forward_upgrade_tasks:
  - name: Example Ocata step 2 task
    command: /bin/foo bar
    when:
      - step|int == 2
      - release == 'ocata'

These tasks will then be collated into role specific Ansible playbooks via the RoleConfig output of the overcloud heat template, with step and release variables being fed in to ensure tasks are executed in the correct order.

As with major upgrades7 a new mistral workflow and tripleoclient command will be introduced to generate and execute the associated Ansible tasks.

openstack overcloud fast-forward-upgrade --templates [..path to latest THT..] \
                           [..original environment arguments..] \
                           [..new container environment agruments..]

Operators will also be able to generate8 , download and review the playbooks ahead of time using the latest version of tripleo-heat-templates with the following commands:

openstack overcloud deploy --templates [..path to latest THT..] \
                           [..original environment arguments..] \
                           [..new container environment agruments..] \
                           -e environments/fast-forward-upgrade.yaml \
                           -e environments/noop-deploy-steps.yaml
openstack overcloud config download

Dev workflow

The existing tripleo-upgrade Ansible role will be used to automate the fast-forward upgrade process for use by developers and CI, including the initial overcloud minor update, undercloud upgrade to N+3 and fast-forward upgrade itself.

Developers working on fast_forward_upgrade_tasks will also be able to deploy minimal overcloud deployments via tripleo-quickstart using release configs also used by CI.

Further, when developing tasks, developers will be able to manually render and run fast_forward_upgrade_tasks as standalone Ansible playbooks. Allowing them to run a subset of the tasks against specific nodes using tripleo-ansible-inventory. Examples of how to do this will be documented hopefully ensuring a smooth development experience for anyone looking to contribute tasks for specific services.

Alternatives

  • Continue to force operators to upgrade linearly through each major release
  • Parallel cloud migrations.

Security Impact

N/A

Other End User Impact

  • The control plane will be down for the duration of the upgrade
  • The data plane and workloads will remain up.

Performance Impact

N/A

Other Deployer Impact

N/A

Developer Impact

  • Third party service template providers will need to provide fast_forward_upgrade_steps in their THT service configurations.

Implementation

Assignee(s)

Primary assignees:

  • lbezdick
  • marios
  • chem

Other contributors:

  • shardy
  • lyarwood

Work Items

  • Introduce fast_forward_upgrades_playbook.yaml to RoleConfig
  • Introduce fast_forward_upgrade_tasks in each service template
  • Introduce a python-tripleoclient command and associated Mistral workflow.

Dependencies

  • TripleO - Ansible upgrade Workflow with UI integration9

The new major upgrade workflow being introduced for Pike to Queens upgrades will obviously impact what fast-forward upgrades looks like to Queens. At present the high level flow for fast-forward upgrades assumes that we can reuse the current upgrade_tasks between N+2 and N+3 to disable and then potentially remove baremetal services. This is likely to change as the major upgrade workflow is introduced and so it is likely that these steps will need to be encoded in fast_forward_upgrade_tasks.

Testing

  • Third party CI jobs will need to be created to test Newton to Queens using RDO given the upstream EOL of stable/newton with the release of Pike.
  • These jobs should cover the initial undercloud upgrade, overcloud upgrade and optional canary compute node checks.
  • An additional third party CI job will be required to verify that a Queens undercloud can correctly manage a Newton overcloud, allowing the separation of the undercloud upgrade and fast-forward upgrade discussed under prerequisites.
  • Finally, minimal overcloud roles should be used to verify the upgrade for certain services. For example, when changes are made to the fast_forward_upgrade_tasks of Nova via changes to docker/services/nova-*.yaml files then a basic overcloud deployment of Keystone, Glance, Swift, Cinder, Neutron and Nova could be used to quickly verify the changes in regards to fast-forward upgrades.

Documentation Impact

  • This will require extensive developer and user documentation to be written, most likely in a new section of the docs specifically detailing the fast-forward upgrade flow.

References


  1. https://etherpad.openstack.org/p/MEX-ops-migrations-upgrades↩︎

  2. https://etherpad.openstack.org/p/BOS-forum-skip-level-upgrading↩︎

  3. https://governance.openstack.org/tc/reference/tags/assert_supports-upgrade.html↩︎

  4. http://tripleo.org/install/post_deployment/package_update.html↩︎

  5. https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/README.rst#update-steps↩︎

  6. https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/README.rst#upgrade-steps↩︎

  7. https://review.openstack.org/#/q/topic:major-upgrade+(status:open+OR+status:merged)↩︎

  8. https://review.openstack.org/#/c/495658/↩︎

  9. https://specs.openstack.org/openstack/tripleo-specs/specs/queens/tripleo_ansible_upgrades_workflow.html↩︎