9.9 KiB

Raw Blame History

Online schema migration

Future incompatible changes in sqlalchemy migrations, like removing, renaming columns and tables can break rolling upgrades (upgrades when multiple keystone instances are run simultaneously at different versions).

This spec solves one zero downtime related issue inside keystone. We want it to be a step forward in providing near zero downtime API availability during a rolling upgrade. Please be aware, that the used persistence layer may not support zero downtime schema altering operations, which will cause a table to be locked and inaccessible during database migration.

Support for rolling upgrades is a key theme for the Mitaka cycle¹. There are also new "assert" tags, that could be added to keystone in the future².

This spec is a statement of commitment by the keystone reviewers to support rolling upgrades with near zero downtime. The upgrade can be performed by rolling the new release across a cluster of nodes, upgrading each node one-by-one.

Problem Description

Currently, schema is migrated before running the new release, to modify the schema so that it is compatible with the new code version. We don't limit developer's choice on what schema updates are permissible. However, certain operations, like table/column drops and name changes can make the database not compatible with the older release. The period of data migration may be very long in cases when a table contains thousands or millions of entries, making the upgrade time consuming, which can become not acceptable if it requires that the old release must be stopped. Also, going back to the previous release always means restoring from backup.

Keystone was mentioned at the summit in the ops live upgrades session as one of the projects which are the closest to being zero-downtime³. Even though one of the operators stated that he upgraded keystone with near zero downtime, this is impossible across major releases due to schema changes which cause incompatibilities between versions.

Proposed Change

Queries generated by SQLAlchemy expect fields defined there to be present inside the DB. To keep the database in sync, schema migration is executed before new version of code is run. The key concept of a rolling upgrade is that two versions may run at the same time, so we want to keep structures which are used by the previous version in place. New structures are ignored by this old version, because it still contains old SQLAlchemy models, while the new code can support new additive structures, like adding a new column with a default value. Old columns and tables can only be removed only when no longer used by both current and the previous release (which may still be running). We will only support keeping compatibility with one previous release to reduce the migrations implementation complexity. Upgrades have to be applied incrementally, one release at a time, with only two adjacent versions running simultaneously.

To address the problem of schema incompatibilities between versions, we can ban schema changes which cause those incompatibilities, specifically drops and alters. There is already a solution which is present in nova⁴.

The test is surrounded by comments with direct wording that signals code reviewers to take into consideration the impact on upgradeability.

The unit test blocks all alters and drops, but also contains a list of migrations where we allow altering and dropping things.

The rules for adding exceptions are very specific:

Migrations which don't cause incompatibilities are allowed, for example dropping an index or constraint.
Migrations removing structures not used in the previous version are allowed (we keep compatibility between releases), ex.:
1. feature is deprecated according to the deprecation policies (release 1),
2. code supporting the feature is removed the following release (release 2),
3. table can be dropped a release after the code has been removed (i.e. in release 3).
Any other changes which don't pass this test are disallowed.

We start requiring things to be additive in Mitaka, so we ignore all migrations before that point.

It is important to note, that this unit test will not catch all issues which may still be introduced, like adding a column without a default, or adding a foreign key which cannot be handled by the older version. Data interpretation incompatibilities may be introduced without altering the schema. This is why we are also introducing a grenade test, which will attempt to catch other errors at the CI gate.

Some types of changes, like changing the format in which data is stored, may require live migration of data. Since there are no ready-made solutions, it is up to the developer to decide how this is achieved. The data migration may be started automatically, when the new version is introduced, while an old version is still running, and finished up with a migration script in the next (third) release. It could also be scripted, with information on how to do it in the release notes. A suggested migration would happen in three phases:

A new column, with a new format is created in a schema upgrade script. Because the old version doesn't know about it, the old column is retained and data is written to both columns. The data should be read only from the old column at this point, especially if the row can be updated by the old version, which may still be running.
The next version contains an upgrade script which migrates the rest of data from the old and into the new column. It migrates any data which wasn't migrated in the process of normally running the service. If the data was read only from the old column, this release should read data from the new column, and, as before, write to both (to be backwards compatible).
In the following release (in case when the previous release reads data from the new column), the old column can be removed from SQLAlchemy models and is no longer used.
After it is no longer used, a migrate script can remove the old column.

The first two steps can be squashed into one, if logic is written to distinguish situations where the columns are updated by the old or the new version in step one, or if the row is not updated after inserting.

Step b) can also be implemented in the same release, by providing a configuration, which switches the place from which the data is read. In cases when a table is no longer used it can be dropped in the next release.

Before removing columns and tables, a sanity check should be performed, ensuring all data was migrated.

Alternatives

Currently, keystone operators run migrations before running the new version. We can introduce two-phase migrations like in Neutron⁵. The approach is to organize schema migrations scripts into "expand" and "contract" phases that are also linked to major release versions in a backwards compatible way. The "expand" phase is run before the upgrade, and the "contract" phase can be executed after all the services are running with the new version.

The last two steps (above) could be squashed into one, if two-phase migrations are implemented. In this case, the last step can be run in the contract phase of the same release.

A similar approach for two-phase migrations, taken by nova⁶ was determined experimental and will be removed in Mitaka⁷.

Security Impact

The code which does migrations may become more complicated, since they are done online. We have to ensure that updates are correctly made to two places and read in a correct way when we want to maintain compatibility.

Notifications Impact

None

Other End User Impact

None

Performance Impact

The new way of doing schema changes entails that some data migrations would have to be done online (for example, we would have to maintain data in two places, before migration is finished), which could impact performance. On the other hand the performance is impacted infinitely when the service is down for performing the upgrade.

Other Deployer Impact

Ability to perform online schema migration will have a large and positive impact on deployment.

Developer Impact

Currently, we don't limit developer's choice on what schema updates are permissible. This change is proposing a unit test and a grenade test, that will limit the changes that can be done in one release - changes will have to be split between releases. Still, patches could be added to the exceptions list and a proper release node could be added, notifying the operator about the need and scope of downtime.

Implementation

Assignee(s)

Primary assignee:: xek

Work Items

Prepare a unit test blocking alters and drops in SQL migrations.
Prepare developer documentation with examples.
Help review patches which fall over on this test.
Add grenade CI test, with altering requests sent to two keystone instances at different versions

9.9 KiB

Raw Blame History

Online schema migration

Problem Description

Proposed Change

Alternatives

Security Impact

Notifications Impact

Other End User Impact

Performance Impact

Other Deployer Impact

Developer Impact

Implementation

Assignee(s)

Work Items

Dependencies

Documentation Impact

References

9.9 KiB Raw Blame History

Online schema migration

Problem Description

Proposed Change

Alternatives

Security Impact

Notifications Impact

Other End User Impact

Performance Impact

Other Deployer Impact

Developer Impact

Implementation

Assignee(s)

Work Items

Dependencies

Documentation Impact

References

9.9 KiB

Raw Blame History