This adds some background, guidelines and structural notes on writing nova-status upgrade checks. This is intentionally written with some potentially redundant information or nova developers as it's also meant to be consumed outside nova as part of the community-wide "upgrade-checkers" goal for Stein [1]. Story: 2003570 [1] https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html Change-Id: I340b25edeab3ac19c5d0bedfc69acd037d57bdd2
9.3 KiB
Upgrade checks
Nova provides automated upgrade check tooling <nova-status-checks> to
assist deployment tools in verifying critical parts of the deployment,
especially when it comes to major changes during upgrades that require
operator intervention.
This guide covers the background on nova's upgrade check tooling, how it is used, and what to look for in writing new checks.
Background
Nova has historically supported offline database schema migrations
(nova-manage db sync) and online data migrations <data-migrations> during
upgrades.
The nova-status upgrade check command was introduced in
the 15.0.0 Ocata release to aid in the verification of two major
required changes in that release, namely Placement and Cells v2.
Integration with the Placement service and deploying Cells v2 was optional starting in the 14.0.0 Newton release and made required in the Ocata release. The nova team working on these changes knew that there were required deployment changes to successfully upgrade to Ocata. In addition, the required deployment changes were not things that could simply be verified in a database migration script, e.g. a migration script should not make REST API calls to Placement.
So nova-status upgrade check was written to provide an
automated "pre-flight" check to verify that required deployment steps
were performed prior to upgrading to Ocata.
Reference the Ocata changes for implementation details.
Guidelines
The checks should be able to run within a virtual environment or container. All that is required is a full configuration file, similar to running other
nova-managetype administration commands. In the case of nova, this means havingapi_database,placement, etc sections configured.Candidates for automated upgrade checks are things in a project's upgrade release notes which can be verified via the database. For example, when upgrading to Cells v2 in Ocata, one required step was creating "cell mappings" for
cell0andcell1. This can easily be verified by checking the contents of thecell_mappingstable in thenova_apidatabase.Checks will query the database(s) and potentially REST APIs (depending on the check) but should not expect to run RPC calls. For example, a check should not require that the
nova-computeservice is running on a particular host.Checks are typically meant to be run before re-starting and upgrading to new service code, which is how grenade uses them, but they can also be run as a
post-install verify step <verify-install-nova-status>which is how openstack-ansible also uses them.Checks must be idempotent so they can be run repeatedly and the results are always based on the latest data. This allows an operator to run the checks, fix any issues reported, and then iterate until the status check no longer reports any issues.
Checks which cannot easily, or should not, be run within offline database migrations are a good candidate for these CLI-driven checks. For example,
instancesrecords are in the cell database and for each instance there should be a correspondingrequest_specstable entry in thenova_apidatabase. Anova-manage db online_data_migrationsroutine was added in the Newton release to back-fill request specs for existing instances, and in Rocky an upgrade check was added to make sure all non-deleted instances have a request spec so compatibility code can be removed in Stein. In older releases of nova we would have added a blocker migration as part of the database schema migrations to make sure the online data migrations had been completed before the upgrade could proceed.Note
Usage of
nova-status upgrade checkdoes not preclude the need for blocker migrations within a given database, but in the case of request specs the check spans multiple databases and was a better fit for the nova-status tooling.All checks should have an accompanying upgrade release note.
Structure
There is no graph logic for checks, meaning each check is meant to be run independently of other checks in the same set. For example, a project could have five checks which run serially but that does not mean the second check in the set depends on the results of the first check in the set, or the third check depends on the second, and so on.
The base framework is fairly simple as can be seen from the initial change. Each
check is registered in the _upgrade_checks variable and the
check method executes each check and records the result.
The most severe result is recorded for the final return code.
There are one of three possible results per check:
Success: All upgrade readiness checks passed successfully and there is nothing to do.Warning: At least one check encountered an issue and requires further investigation. This is considered a warning but the upgrade may be OK.Failure: There was an upgrade status check failure that needs to be investigated. This should be considered something that stops an upgrade.
The UpgradeCheckResult object provides for adding
details when there is a warning or failure result which generally should
refer to how to resolve the failure, e.g. maybe
nova-manage db online_data_migrations is incomplete and
needs to be run again.
Using the cells v2 check as an example, there are really two checks involved:
- Do the cell0 and cell1 mappings exist?
- Do host mappings exist in the API database if there are compute node records in the cell database?
Failing either check results in a Failure status for
that check and return code of 2 for the overall run.
The initial placement check
provides an example of a warning response. In that check, if there are
fewer resource providers in Placement than there are compute nodes in
the cell database(s), the deployment may be underutilized because the
nova-scheduler is using the Placement service to determine
candidate hosts for scheduling.
Warning results are good for cases where scenarios are known to run
through a rolling upgrade process, e.g. nova-compute being
configured to report resource provider information into the Placement
service. These are things that should be investigated and completed at
some point, but might not cause any immediate failures.
The results feed into a standard output for the checks:
$ nova-status upgrade check
+----------------------------------------------------+
| Upgrade Check Results |
+----------------------------------------------------+
| Check: Cells v2 |
| Result: Success |
| Details: None |
+----------------------------------------------------+
| Check: Placement API |
| Result: Failure |
| Details: There is no placement-api endpoint in the |
| service catalog. |
+----------------------------------------------------+
Other
Documentation
Each check should be documented in the history section <nova-status-checks> of the CLI
guide and have a release note. This is important since the checks can be
run in an isolated environment apart from the actual deployed version of
the code and since the checks should be idempotent, the history / change
log is good for knowing what is being validated.
Backports
Sometimes upgrade checks can be backported to aid in pre-empting bugs on stable branches. For example, a check was added for bug 1759316 in Rocky which was also backported to stable/queens in case anyone upgrading from Pike to Queens would hit the same issue. Backportable checks are generally only made for latent bugs since someone who has already passed checks and upgraded to a given stable branch should not start failing after a patch release on that same branch. For this reason, any check being backported should have a release note with it.
Other projects
A community-wide goal
for the Stein release is adding the same type of
$PROJECT-status upgrade check tooling to other projects to
ease in upgrading OpenStack across the board. So while the guidelines in
this document are primarily specific to nova, they should apply
generically to other projects wishing to incorporate the same
tooling.