165 lines
8.2 KiB
ReStructuredText
165 lines
8.2 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
========================================
|
|
Upgrade controllers with no API downtime
|
|
========================================
|
|
|
|
https://blueprints.launchpad.net/neutron/+spec/online-upgrades
|
|
|
|
Problem Description
|
|
===================
|
|
|
|
Currently, database migration for a new major neutron release means full
|
|
shutdown of all neutron-server instances before `contract
|
|
<http://docs.openstack.org/developer/neutron/devref/alembic_migrations.html#expand-and-contract-scripts>`_
|
|
alembic migration scripts are applied.
|
|
|
|
When running a popular cloud, it's usually hard or impractical to shut down all
|
|
neutron-server instances before applying all ``contract`` alembic migration
|
|
scripts that may take a while, especially for databases with a lot of data to
|
|
migrate. Such a shutdown requires a significantly more elaborate planning,
|
|
trying to squeeze the upgrade process in a maintenance window with less
|
|
disruption to API users. For clouds with high SLAs, it may be impossible to
|
|
shutdown all Neutron API endpoints at one time, leaving users with no
|
|
Networking API for extended time. Due to dependencies from other services (i.e.
|
|
Nova) on Neutron API availability, upgrades have a greater impact as Neutron is
|
|
a central service in any Openstack installation.
|
|
|
|
This spec describes an approach to allow for non-impacting neutron-server
|
|
upgrades, leaving instances running in a cluster. Instead of full shutdown,
|
|
operators will be able to upgrade the services in rolling mode, upgrading each
|
|
node running the service without disruption for other such nodes. If Networking
|
|
API is served by multiple nodes hidden behind a load balancer, that approach
|
|
should allow for no-downtime upgrade experience. Ideally, users would not
|
|
notice any issues accessing Neutron API services for the entirety of the
|
|
upgrade.
|
|
|
|
.. note::
|
|
|
|
Running mixed major versions of neutron-server in a cloud opens the question
|
|
of how to mitigate slight differences of API behaviour between those
|
|
versions. For example, in Newton, all resources received a new
|
|
``project_id`` field. If we would run mixed Mitaka/Newton versions of
|
|
neutron-server behind a round robin load balancer, then consequent GET calls
|
|
for the same resource would result in different reply payloads, depending on
|
|
which particular neutron-server is hit by an API request.
|
|
|
|
Enforcing consistent API behaviour for mixed version environment, or pinning
|
|
API behaviour to a version that would describe previous major version
|
|
behaviour, is *out of scope* for the proposal. Consequent blueprints may
|
|
clarify best practices, or propose mechanisms for strict API behaviour
|
|
control.
|
|
|
|
Proposed Change
|
|
===============
|
|
|
|
Since neutron-server downtime derives solely from executing unsafe ``contract``
|
|
upgrade migrations while neutron-server is operational, the solution is to make
|
|
those migrations safe for online execution, or eliminate them. This is achieved
|
|
with two major changes:
|
|
|
|
#. Time consuming data migration changes are moved from neutron-db-manage phase
|
|
into neutron-server itself, so that data is migrated while the service is up
|
|
and serving requests, instead of while it's fully shut down. Data migration
|
|
process will be ``lazy``, happening at the time when a resource is touched
|
|
by plugin code. That said, users should not be able to switch to a next
|
|
major version before they complete migration for all remaining resources. We
|
|
will provide a tool to trigger pending migrations (preferrably in chunks)
|
|
that will become part of preparation process for next upgrade. The tool will
|
|
be modeled similar to ``online-data-migrations`` command found in Cinder.
|
|
|
|
#. Remaining schema contraction changes are postponed to the time when:
|
|
|
|
- all the data is migrated from old tables/columns, and
|
|
- no neutron-server instances running in a cluster are able to access
|
|
obsolete tables/columns.
|
|
|
|
.. note::
|
|
|
|
This idea is not new, other projects already got rid of unsafe migration
|
|
scripts that would require offline execution. Among those projects are Nova
|
|
and Cinder.
|
|
|
|
Data migration between multiple tables/columns implemented in neutron-server
|
|
runtime is potentially error prone and requires specific reviewer attention. It
|
|
would be impractical to expect proper attention to those intricacies for any
|
|
patch that needs to read or update a database model. So the first step is to
|
|
isolate the layer that has access to database models behind a special facade.
|
|
The base of the facade is oslo.versionedobjects and corresponding NeutronObject
|
|
framework that is already in tree and is successfully used by several features
|
|
(qos, vlan-aware-vms). The work to switch all the plugin code that accesses
|
|
database using SQLAlchemy models, to object facade, is ongoing and is tracked
|
|
as a `separate blueprint
|
|
<https://blueprints.launchpad.net/neutron/+spec/adopt-oslo-versioned-objects-for-db>`_.
|
|
|
|
Once plugin code is switched to using objects for resource persistence, we can
|
|
implement any needed *data* migration rules in a single place, in a
|
|
corresponding object class, isolating consuming code from all the complexities
|
|
of the migration/conversion process.
|
|
|
|
Even with persistence facade, some work on migration mechanism and techniques
|
|
is expected. For the start, an Ocata feature that would need a schema/data
|
|
migration change will be identified and used as a ``guinea pig`` to explore the
|
|
proposal practicalities.
|
|
|
|
At the time of writing, `port bindings rework
|
|
<https://bugs.launchpad.net/bugs/1580880>`_ is probably the best candidate to
|
|
try the new approach.
|
|
|
|
As for unsafe *schema* changes, if at release X we want to introduce a contract
|
|
migration we can only do so destructively in release X+2 (i.e. after all the
|
|
data used by X and X+1 located in the old schema is migrated), which guarantees
|
|
that whenever the deployer upgrades to X+2 (from X+1), the server instance
|
|
running behind won't hit the data/code being affected by the schema migration.
|
|
At this point, even seemingly unsafe operations like dropping tables or columns
|
|
become safe to execute while neutron-server instances are online. For Ocata,
|
|
there should be no new ``contract`` alembic scripts at all. Those may show up
|
|
again in later releases.
|
|
|
|
To achieve the goal for Newton to Ocata upgrades, we should guarantee that all
|
|
patches will follow decided path during Ocata. This will be achieved by both
|
|
automation as well as social means.
|
|
|
|
For the former, we introduce a `functional test
|
|
<https://review.openstack.org/#/c/400239/>`_ that fails on attempt to execute
|
|
an operation known to be unsafe.
|
|
|
|
We will probably not be able to catch everything programmatically, so we still
|
|
need to make sure core reviewer team is aware of new requirements, and
|
|
proactively track all proposed alembic migrations at least first cycles until
|
|
it becomes a habit of an average Joe the Reviewer to spot and -1 unsafe
|
|
patches.
|
|
|
|
To make sure the new upgrade mode works, a new gating grenade job will be
|
|
implemented that will run multiple neutron-server instances of different major
|
|
versions. Access to Networking API will be implemented using a lightweight load
|
|
balancer (``haproxy``) hiding multiple instances of neutron-server behind it.
|
|
We may also consider moving known consumers of Neutron API (for example, Nova)
|
|
to the 'old' subnode to make sure it can talk to newer as well as older Neutron
|
|
in the same setup. Only two adjacent major versions of neutron-server will be
|
|
used in the grenade job to conform to `assert:supports-upgrade tag requirements
|
|
<https://governance.openstack.org/reference/tags/assert_supports-upgrade.html#requirements>`_.
|
|
|
|
Action items
|
|
------------
|
|
|
|
(The feature assumes completion of `adopt-oslo-versioned-objects-for-db
|
|
blueprint
|
|
<https://blueprints.launchpad.net/neutron/+spec/adopt-oslo-versioned-objects-for-db>`_
|
|
but does not strictly depend on it.)
|
|
|
|
#. Block unsafe contract migrations at the start of Ocata (*Done*).
|
|
#. Explore practicalities of the proposal for data migrations for a new feature.
|
|
#. Add a voting grenade job running different major versions of neutron-server.
|
|
#. Document new upgrade path in ops upgrades guide, with its limitations.
|
|
#. Update devref and wider audience about new requirements.
|
|
|
|
References
|
|
==========
|
|
|
|
None.
|