diff --git a/doc/source/devref/alembic_migrations.rst b/doc/source/devref/alembic_migrations.rst index fbdba620c28..d00619ca957 100644 --- a/doc/source/devref/alembic_migrations.rst +++ b/doc/source/devref/alembic_migrations.rst @@ -20,6 +20,7 @@ ''''''' Heading 4 (Avoid deeper levels because they do not render well.) +.. _alembic_migrations: Alembic Migrations ================== diff --git a/doc/source/devref/effective_neutron.rst b/doc/source/devref/effective_neutron.rst index cf72dd29586..bbe918189ec 100644 --- a/doc/source/devref/effective_neutron.rst +++ b/doc/source/devref/effective_neutron.rst @@ -182,10 +182,7 @@ Backward compatibility Document common pitfalls as well as good practices done when extending the RPC Interfaces. -* The Neutron upgrade path requires the server to support the previous version of - the agent. Any changes to the existing RPC methods must be compatible with the - previous version of the agent. Otherwise a version bump is required and the old - method must be kept under the previous version RPC endpoint. +* Make yourself familiar with :ref:`Upgrade review guidelines `. Scalability issues diff --git a/doc/source/devref/index.rst b/doc/source/devref/index.rst index 76b638f126e..13a7d5cc966 100644 --- a/doc/source/devref/index.rst +++ b/doc/source/devref/index.rst @@ -69,6 +69,7 @@ Neutron Internals oslo-incubator callbacks dns_order + upgrade Testing ------- diff --git a/doc/source/devref/rpc_api.rst b/doc/source/devref/rpc_api.rst index 951085db8e4..5be9978e003 100644 --- a/doc/source/devref/rpc_api.rst +++ b/doc/source/devref/rpc_api.rst @@ -95,6 +95,8 @@ This class implements the server side of the interface. The oslo_messaging.Target() defined says that this class currently implements version 1.1 of the interface. +.. _rpc_versioning: + Versioning ---------- diff --git a/doc/source/devref/rpc_callbacks.rst b/doc/source/devref/rpc_callbacks.rst index cc2fae57af3..de79e24a66a 100644 --- a/doc/source/devref/rpc_callbacks.rst +++ b/doc/source/devref/rpc_callbacks.rst @@ -21,6 +21,8 @@ (Avoid deeper levels because they do not render well.) +.. _rpc_callbacks: + Neutron Messaging Callback System ================================= diff --git a/doc/source/devref/upgrade.rst b/doc/source/devref/upgrade.rst new file mode 100644 index 00000000000..21808917f5b --- /dev/null +++ b/doc/source/devref/upgrade.rst @@ -0,0 +1,250 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + + Convention for heading levels in Neutron devref: + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + (Avoid deeper levels because they do not render well.) + +.. note:: + + Much of this document discusses upgrade considerations for the Neutron + reference implementation using Neutron's agents. It's expected that each + Neutron plugin provides its own documentation that discusses upgrade + considerations specific to that choice of backend. For example, OVN does + not use Neutron agents, but does have a local controller that runs on each + compute node. OVN supports rolling upgrades, but information about how that + works should be covered in the documentation for networking-ovn, the OVN + Neutron plugin. + +Upgrade strategy +================ + +There are two general upgrade scenarios supported by Neutron: + +#. All services are shut down, code upgraded, then all services are started again. +#. Services are upgraded gradually, based on operator service windows. + +The latter is the preferred way to upgrade an OpenStack cloud, since it allows +for more granularity and less service downtime. This scenario is usually called +'rolling upgrade'. + +Rolling upgrade +--------------- + +Rolling upgrades imply that during some interval of time there will be services +of different code versions running and interacting in the same cloud. It puts +multiple constraints onto the software. + +#. older services should be able to talk with newer services. +#. older services should not require the database to have older schema + (otherwise newer services that require the newer schema would not work). + +`More info on rolling upgrades in OpenStack +`_. + +Those requirements are achieved in Neutron by: + +#. If the Neutron backend makes use of Neutron agents, the Neutron server have + backwards compatibility code to deal with older messaging payloads. +#. isolating a single service that accesses database (neutron-server). + +To simplify the matter, it's always assumed that the order of service upgrades +is as following: + +#. first, all neutron-servers are upgraded. +#. then, if applicable, neutron agents are upgraded. + +This approach allows us to avoid backwards compatibility code on agent side and +is in line with other OpenStack projects that support rolling upgrades +(specifically, nova). + +Server upgrade +~~~~~~~~~~~~~~ + +Neutron-server is the very first component that should be upgraded to the new +code. It's also the only component that relies on new database schema to be +present, other components communicate with the cloud through AMQP and hence do +not depend on particular database state. + +Database upgrades are implemented with alembic migration chains. + +Database upgrade is split into two parts: + +#. neutron-db-manage upgrade --expand +#. neutron-db-manage upgrade --contract + +Each part represents a separate alembic branch. + +:ref:`More info on alembic scripts `. + +The former step can be executed while old neutron-server code is running. The +latter step requires *all* neutron-server instances to be shut down. Once it's +complete, neutron-servers can be started again. + +Agents upgrade +~~~~~~~~~~~~~~ + +.. note:: + + This section does not apply when the cloud does not use AMQP agents to + provide networking services to instances. In that case, other backend + specific upgrade instructions may also apply. + +Once neutron-server services are restarted with the new database schema and the +new code, it's time to upgrade Neutron agents. + +Note that in the meantime, neutron-server should be able to serve AMQP messages +sent by older versions of agents which are part of the cloud. + +The recommended order of agent upgrade (per node) is: + +#. first, L2 agents (openvswitch, linuxbridge, sr-iov). +#. then, all other agents (L3, DHCP, Metadata, ...). + +The rationale of the agent upgrade order is that L2 agent is usually +responsible for wiring ports for other agents to use, so it's better to allow +it to do its job first and then proceed with other agents that will use the +already configured ports for their needs. + +Each network/compute node can have its own upgrade schedule that is independent +of other nodes. + +AMQP considerations ++++++++++++++++++++ + +Since it's always assumed that neutron-server component is upgraded before +agents, only the former should handle both old and new RPC versions. + +The implication of that is that no code that handles UnsupportedVersion +oslo.messaging exceptions belongs to agent code. + +:ref:`More information about RPC versioning `. + +Interface signature +''''''''''''''''''' + +An RPC interface is defined by its name, version, and (named) arguments that +it accepts. There are no strict guarantees that arguments will have expected +types or meaning, as long as they are serializable. + +Message content versioning +'''''''''''''''''''''''''' + +To provide better compatibility guarantees for rolling upgrades, RPC interfaces +could also define specific format for arguments they accept. In OpenStack +world, it's usually implemented using oslo.versionedobjects library, and +relying on the library to define serialized form for arguments that are passed +thru AMQP wire. + +Note that Neutron has *not* adopted oslo.versionedobjects library for its RPC +interfaces yet (except for QoS feature). + +:ref:`More information about RPC callbacks used for QoS `. + +Networking backends +~~~~~~~~~~~~~~~~~~~ + +Backend software upgrade should not result in any data plane disruptions. +Meaning, e.g. Open vSwitch L2 agent should not reset flows or rewire ports; +Neutron L3 agent should not delete namespaces left by older version of the +agent; Neutron DHCP agent should not require immediate DHCP lease renewal; etc. + +The same considerations apply to setups that do not rely on agents. Meaning, +f.e. OpenDaylight or OVN controller should not break data plane connectivity +during its upgrade process. + +Upgrade testing +--------------- + +`Grenade `_ is the OpenStack project +that is designed to validate upgrade scenarios. + +Currently, only offline (non-rolling) upgrade scenario is validated in Neutron +gate. The upgrade scenario follows the following steps: + +#. the 'old' cloud is set up using latest stable release code +#. all services are stopped +#. code is updated to the patch under review +#. new database migration scripts are applied, if needed +#. all services are started +#. the 'new' cloud is validated with a subset of tempest tests + +The scenario validates that no configuration option names are changed in one +cycle. More generally, it validates that the 'new' cloud is capable of running +using the 'old' configuration files. It also validates that database migration +scripts can be executed. + +The scenario does *not* validate AMQP versioning compatibility. + +Other projects (for example Nova) have so called 'partial' grenade jobs where +some services are left running using the old version of code. Such a job would +be needed in Neutron gate to validate rolling upgrades for the project. Till +that time, it's all up to reviewers to catch compatibility issues in patches on +review. + +Another hole in testing belongs to split migration script branches. It's +assumed that an 'old' cloud can successfully run after 'expand' migration +scripts from the 'new' cloud are applied to its database; but it's not +validated in gate. + +.. _upgrade_review_guidelines: + +Review guidelines +----------------- + +There are several upgrade related gotchas that should be tracked by reviewers. + +First things first, a general advice to reviewers: make sure new code does not +violate requirements set by `global OpenStack deprecation policy +`_. + +Now to specifics: + +#. Configuration options: + + * options should not be dropped from the tree without waiting for + deprecation period (currently it's one development cycle long) and a + deprecation message issued if the deprecated option is used. + * option values should not change their meaning between releases. + +#. Data plane: + + * agent restart should not result in data plane disruption (no Open vSwitch + ports reset; no network namespaces deleted; no device names changed). + +#. RPC versioning: + + * no RPC version major number should be bumped before all agents had a + chance to upgrade (meaning, at least one release cycle is needed before + compatibility code to handle old clients is stripped from the tree). + * no compatibility code should be added to agent side of AMQP interfaces. + * server code should be able to handle all previous versions of agents, + unless the major version of an interface is bumped. + * no RPC interface arguments should change their meaning, or names. + * new arguments added to RPC interfaces should not be mandatory. It means + that server should be able to handle old requests, without the new + argument specified. Also, if the argument is not passed, the old behaviour + before the addition of the argument should be retained. + +#. Database migrations: + + * migration code should be split into two branches (contract, expand) as + needed. No code that is unsafe to execute while neutron-server is running + should be added to expand branch. + * if possible, contract migrations should be minimized or avoided to reduce + the time when API endpoints must be down during database upgrade.