From 7c3c3b986347d6f83595dae1a536f0fb38efa6de Mon Sep 17 00:00:00 2001 From: Matt Riedemann Date: Mon, 27 Aug 2018 17:19:30 -0400 Subject: [PATCH] Add contributor guide for upgrade status checks This adds some background, guidelines and structural notes on writing nova-status upgrade checks. This is intentionally written with some potentially redundant information or nova developers as it's also meant to be consumed outside nova as part of the community-wide "upgrade-checkers" goal for Stein [1]. Story: 2003570 [1] https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html Change-Id: I340b25edeab3ac19c5d0bedfc69acd037d57bdd2 --- doc/source/index.rst | 1 + doc/source/install/verify.rst | 2 + doc/source/reference/index.rst | 2 + doc/source/reference/upgrade-checks.rst | 202 ++++++++++++++++++++++++ doc/source/user/upgrade.rst | 1 + 5 files changed, 208 insertions(+) create mode 100644 doc/source/reference/upgrade-checks.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index 5a2dd2d69f6e..d04c53f494ab 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -238,6 +238,7 @@ looking parts of our architecture. These are collected below. reference/stable-api reference/threading reference/update-provider-tree + reference/upgrade-checks reference/vm-states user/index user/aggregates diff --git a/doc/source/install/verify.rst b/doc/source/install/verify.rst index 43abb5d94747..242a79f0ea2f 100644 --- a/doc/source/install/verify.rst +++ b/doc/source/install/verify.rst @@ -98,6 +98,8 @@ Verify operation of the Compute service. #. Check the cells and placement API are working successfully: + .. _verify-install-nova-status: + .. code-block:: console # nova-status upgrade check diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst index b763b3732c19..9e2ec917ea50 100644 --- a/doc/source/reference/index.rst +++ b/doc/source/reference/index.rst @@ -24,6 +24,8 @@ The following is a dive into some of the internals in nova. nova, and considerations when adding notifications. * :doc:`/reference/update-provider-tree`: A detailed explanation of the ``ComputeDriver.update_provider_tree`` method. +* :doc:`/reference/upgrade-checks`: A guide to writing automated upgrade + checks. Debugging ========= diff --git a/doc/source/reference/upgrade-checks.rst b/doc/source/reference/upgrade-checks.rst new file mode 100644 index 000000000000..7110d3b38292 --- /dev/null +++ b/doc/source/reference/upgrade-checks.rst @@ -0,0 +1,202 @@ +============== +Upgrade checks +============== + +Nova provides automated :ref:`upgrade check tooling ` to +assist deployment tools in verifying critical parts of the deployment, +especially when it comes to major changes during upgrades that require operator +intervention. + +This guide covers the background on nova's upgrade check tooling, how it is +used, and what to look for in writing new checks. + +Background +========== + +Nova has historically supported offline database schema migrations +(``nova-manage db sync``) and :ref:`online data migrations ` +during upgrades. + +The ``nova-status upgrade check`` command was introduced in the 15.0.0 Ocata +release to aid in the verification of two major required changes in that +release, namely Placement and Cells v2. + +Integration with the Placement service and deploying Cells v2 was optional +starting in the 14.0.0 Newton release and made required in the Ocata release. +The nova team working on these changes knew that there were required deployment +changes to successfully upgrade to Ocata. In addition, the required deployment +changes were not things that could simply be verified in a database migration +script, e.g. a migration script should not make REST API calls to Placement. + +So ``nova-status upgrade check`` was written to provide an automated +"pre-flight" check to verify that required deployment steps were performed +prior to upgrading to Ocata. + +Reference the `Ocata changes`_ for implementation details. + +.. _Ocata changes: https://review.openstack.org/#/q/topic:bp/resource-providers-scheduler-db-filters+status:merged+file:%255Enova/cmd/status.py + +Guidelines +========== + +* The checks should be able to run within a virtual environment or container. + All that is required is a full configuration file, similar to running other + ``nova-manage`` type administration commands. In the case of nova, this + means having :oslo.config:group:`api_database`, + :oslo.config:group:`placement`, etc sections configured. + +* Candidates for automated upgrade checks are things in a project's upgrade + release notes which can be verified via the database. For example, when + upgrading to Cells v2 in Ocata, one required step was creating + "cell mappings" for ``cell0`` and ``cell1``. This can easily be verified by + checking the contents of the ``cell_mappings`` table in the ``nova_api`` + database. + +* Checks will query the database(s) and potentially REST APIs (depending on the + check) but should not expect to run RPC calls. For example, a check should + not require that the ``nova-compute`` service is running on a particular + host. + +* Checks are typically meant to be run before re-starting and upgrading to new + service code, which is how `grenade uses them`_, but they can also be run + as a :ref:`post-install verify step ` which is + how `openstack-ansible`_ also uses them. + +* Checks must be idempotent so they can be run repeatedly and the results are + always based on the latest data. This allows an operator to run the checks, + fix any issues reported, and then iterate until the status check no longer + reports any issues. + +* Checks which cannot easily, or should not, be run within offline database + migrations are a good candidate for these CLI-driven checks. For example, + ``instances`` records are in the cell database and for each instance there + should be a corresponding ``request_specs`` table entry in the ``nova_api`` + database. A ``nova-manage db online_data_migrations`` routine was added in + the Newton release to back-fill request specs for existing instances, and + `in Rocky`_ an upgrade check was added to make sure all non-deleted instances + have a request spec so compatibility code can be removed in Stein. In older + releases of nova we would have added a `blocker migration`_ as part of the + database schema migrations to make sure the online data migrations had been + completed before the upgrade could proceed. + + .. note:: Usage of ``nova-status upgrade check`` does not preclude the need + for blocker migrations within a given database, but in the case of + request specs the check spans multiple databases and was a better + fit for the nova-status tooling. + +* All checks should have an accompanying upgrade release note. + +.. _grenade uses them: https://github.com/openstack-dev/grenade/blob/dc7f4a4ba/projects/60_nova/upgrade.sh#L96 +.. _openstack-ansible: https://review.openstack.org/#/c/575125/ +.. _in Rocky: https://review.openstack.org/#/c/581813/ +.. _blocker migration: https://review.openstack.org/#/c/289450/ + +Structure +========= + +There is no graph logic for checks, meaning each check is meant to be run +independently of other checks in the same set. For example, a project could +have five checks which run serially but that does not mean the second check +in the set depends on the results of the first check in the set, or the +third check depends on the second, and so on. + +The base framework is fairly simple as can be seen from the `initial change`_. +Each check is registered in the ``_upgrade_checks`` variable and the ``check`` +method executes each check and records the result. The most severe result is +recorded for the final return code. + +There are one of three possible results per check: + +* ``Success``: All upgrade readiness checks passed successfully and there is + nothing to do. +* ``Warning``: At least one check encountered an issue and requires further + investigation. This is considered a warning but the upgrade may be OK. +* ``Failure``: There was an upgrade status check failure that needs to be + investigated. This should be considered something that stops an upgrade. + +The ``UpgradeCheckResult`` object provides for adding details when there +is a warning or failure result which generally should refer to how to resolve +the failure, e.g. maybe ``nova-manage db online_data_migrations`` is +incomplete and needs to be run again. + +Using the `cells v2 check`_ as an example, there are really two checks +involved: + +1. Do the cell0 and cell1 mappings exist? +2. Do host mappings exist in the API database if there are compute node + records in the cell database? + +Failing either check results in a ``Failure`` status for that check and return +code of ``2`` for the overall run. + +The initial `placement check`_ provides an example of a warning response. In +that check, if there are fewer resource providers in Placement than there are +compute nodes in the cell database(s), the deployment may be underutilized +because the ``nova-scheduler`` is using the Placement service to determine +candidate hosts for scheduling. + +Warning results are good for cases where scenarios are known to run through +a rolling upgrade process, e.g. ``nova-compute`` being configured to report +resource provider information into the Placement service. These are things +that should be investigated and completed at some point, but might not cause +any immediate failures. + +The results feed into a standard output for the checks: + +.. code-block:: console + + $ nova-status upgrade check + +----------------------------------------------------+ + | Upgrade Check Results | + +----------------------------------------------------+ + | Check: Cells v2 | + | Result: Success | + | Details: None | + +----------------------------------------------------+ + | Check: Placement API | + | Result: Failure | + | Details: There is no placement-api endpoint in the | + | service catalog. | + +----------------------------------------------------+ + +.. _initial change: https://review.openstack.org/#/c/411517/ +.. _cells v2 check: https://review.openstack.org/#/c/411525/ +.. _placement check: https://review.openstack.org/#/c/413250/ + +Other +===== + +Documentation +------------- + +Each check should be documented in the +:ref:`history section ` of the CLI guide and have a +release note. This is important since the checks can be run in an isolated +environment apart from the actual deployed version of the code and since the +checks should be idempotent, the history / change log is good for knowing +what is being validated. + +Backports +--------- + +Sometimes upgrade checks can be backported to aid in pre-empting bugs on +stable branches. For example, a check was added for `bug 1759316`_ in Rocky +which was also backported to stable/queens in case anyone upgrading from Pike +to Queens would hit the same issue. Backportable checks are generally only +made for latent bugs since someone who has already passed checks and upgraded +to a given stable branch should not start failing after a patch release on that +same branch. For this reason, any check being backported should have a release +note with it. + +.. _bug 1759316: https://bugs.launchpad.net/nova/+bug/1759316 + +Other projects +-------------- + +A community-wide `goal for the Stein release`_ is adding the same type of +``$PROJECT-status upgrade check`` tooling to other projects to ease in +upgrading OpenStack across the board. So while the guidelines in this document +are primarily specific to nova, they should apply generically to other projects +wishing to incorporate the same tooling. + +.. _goal for the Stein release: https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html diff --git a/doc/source/user/upgrade.rst b/doc/source/user/upgrade.rst index 4cb77c9b7ec5..38a32854d3d3 100644 --- a/doc/source/user/upgrade.rst +++ b/doc/source/user/upgrade.rst @@ -184,6 +184,7 @@ code has been upgraded. to list and iterate over cell mapping records, which require a functioning API database schema. +.. _data-migrations: Data Migrations '''''''''''''''''