Online schema migration

Specification blueprint online-schema-migration Change-Id: I657caeaebfc63f59bd7f380bf571859d4d336d46
2015-11-13 14:38:17 +01:00
parent bea334c2dc
commit 7fcdc8625f
1 changed files with 245 additions and 0 deletions
--- a/specs/mitaka/online-schema-migration.rst
+++ b/specs/mitaka/online-schema-migration.rst
@@ -0,0 +1,245 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=======================
+Online schema migration
+=======================
+
+`bp online-schema-migration <https://blueprints.launchpad.net/keystone/+spec/online-schema-migration>`_
+
+
+Future incompatible changes in sqlalchemy migrations, like removing,
+renaming columns and tables can break rolling upgrades (upgrades when
+multiple keystone instances are run simultaneously at different versions).
+
+This spec solves one zero downtime related issue inside keystone. We want it
+to be a step forward in providing near zero downtime API availability during
+a rolling upgrade. Please be aware, that the used persistence layer may not
+support zero downtime schema altering operations, which will cause a table to
+be locked and inaccessible during database migration.
+
+Support for rolling upgrades is a key theme for the Mitaka cycle [1]_. There
+are also new "assert" tags, that could be added to keystone in the future [2]_.
+
+This spec is a statement of commitment by the keystone reviewers to support
+rolling upgrades with near zero downtime. The upgrade can be performed by
+rolling the new release across a cluster of nodes, upgrading each node
+one-by-one.
+
+
+Problem Description
+===================
+
+Currently, schema is migrated before running the new release, to modify the
+schema so that it is compatible with the new code version. We don't limit
+developer's choice on what schema updates are permissible. However, certain
+operations, like table/column drops and name changes can make the database
+not compatible with the older release. The period of data migration may be
+very long in cases when a table contains thousands or millions of entries,
+making the upgrade time consuming, which can become not acceptable if it
+requires that the old release must be stopped. Also, going back to the previous
+release always means restoring from backup.
+
+Keystone was mentioned at the summit in the ops live upgrades session as one
+of the projects which are the closest to being zero-downtime [3]_. Even though
+one of the operators stated that he upgraded keystone with near zero downtime,
+this is impossible across major releases due to schema changes which cause
+incompatibilities between versions.
+
+
+Proposed Change
+===============
+
+Queries generated by SQLAlchemy expect fields defined there to be present
+inside the DB. To keep the database in sync, schema migration is executed
+before new version of code is run. The key concept of a rolling upgrade is that
+two versions may run at the same time, so we want to keep structures which are
+used by the previous version in place. New structures are ignored by this old
+version, because it still contains old SQLAlchemy models, while the new code
+can support new additive structures, like adding a new column with a default
+value. Old columns and tables can only be removed only when no longer used by
+both current and the previous release (which may still be running). We will
+only support keeping compatibility with one previous release to reduce the
+migrations implementation complexity. Upgrades have to be applied
+incrementally, one release at a time, with only two adjacent versions running
+simultaneously.
+
+To address the problem of schema incompatibilities between versions, we can ban
+schema changes which cause those incompatibilities, specifically drops and
+alters. There is already a solution which is present in nova [4]_.
+
+The test is surrounded by comments with direct wording that signals code
+reviewers to take into consideration the impact on upgradeability.
+
+The unit test blocks all alters and drops, but also contains a list of
+migrations where we allow altering and dropping things.
+
+The rules for adding exceptions are very specific:
+
+1) Migrations which don't cause incompatibilities are allowed, for example
+   dropping an index or constraint.
+2) Migrations removing structures not used in the previous version are allowed
+   (we keep compatibility between releases), ex.:
+
+   a) feature is deprecated according to the deprecation policies (release 1),
+   b) code supporting the feature is removed the following release (release 2),
+   c) table can be dropped a release after the code has been removed (i.e. in
+      release 3).
+
+3) Any other changes which don't pass this test are disallowed.
+
+We start requiring things to be additive in Mitaka, so we ignore all migrations
+before that point.
+
+It is important to note, that this unit test will not catch all issues which
+may still be introduced, like adding a column without a default, or adding a
+foreign key which cannot be handled by the older version. Data interpretation
+incompatibilities may be introduced without altering the schema. This is why
+we are also introducing a grenade test, which will attempt to catch other
+errors at the CI gate.
+
+Some types of changes, like changing the format in which data is stored, may
+require live migration of data. Since there are no ready-made solutions, it is
+up to the developer to decide how this is achieved. The data migration may be
+started automatically, when the new version is introduced, while an old version
+is still running, and finished up with a migration script in the next (third)
+release. It could also be scripted, with information on how to do it in the
+release notes. A suggested migration would happen in three phases:
+
+a) A new column, with a new format is created in a schema upgrade script.
+   Because the old version doesn't know about it, the old column is retained
+   and data is written to both columns. The data should be read only from the
+   old column at this point, especially if the row can be updated by the old
+   version, which may still be running.
+b) The next version contains an upgrade script which migrates the rest of data
+   from the old and into the new column. It migrates any data which wasn't
+   migrated in the process of normally running the service. If the data was
+   read only from the old column, this release should read data from the new
+   column, and, as before, write to both (to be backwards compatible).
+c) In the following release (in case when the previous release reads data from
+   the new column), the old column can be removed from SQLAlchemy models and
+   is no longer used.
+d) After it is no longer used, a migrate script can remove the old column.
+
+The first two steps can be squashed into one, if logic is written to
+distinguish situations where the columns are updated by the old or the new
+version in step one, or if the row is not updated after inserting.
+
+Step b) can also be implemented in the same release, by providing a
+configuration, which switches the place from which the data is read. In cases
+when a table is no longer used it can be dropped in the next release.
+
+Before removing columns and tables, a sanity check should be performed,
+ensuring all data was migrated.
+
+
+Alternatives
+------------
+
+Currently, keystone operators run migrations before running the new version.
+We can introduce two-phase migrations like in Neutron [5]_. The approach is to
+organize schema migrations scripts into "expand" and "contract" phases that are
+also linked to major release versions in a backwards compatible way. The
+"expand" phase is run before the upgrade, and the "contract" phase can be
+executed after all the services are running with the new version.
+
+The last two steps (above) could be squashed into one, if two-phase migrations
+are implemented. In this case, the last step can be run in the contract phase
+of the same release.
+
+A similar approach for two-phase migrations, taken by nova [6]_ was determined
+experimental and will be removed in Mitaka [7]_.
+
+
+Security Impact
+---------------
+
+The code which does migrations may become more complicated, since they are done
+online. We have to ensure that updates are correctly made to two places and
+read in a correct way when we want to maintain compatibility.
+
+
+Notifications Impact
+--------------------
+
+None
+
+
+Other End User Impact
+---------------------
+
+None
+
+
+Performance Impact
+------------------
+
+The new way of doing schema changes entails that some data migrations would
+have to be done online (for example, we would have to maintain data in two
+places, before migration is finished), which could impact performance. On the
+other hand the performance is impacted infinitely when the service is down for
+performing the upgrade.
+
+
+Other Deployer Impact
+---------------------
+
+Ability to perform online schema migration will have a large and positive
+impact on deployment.
+
+Developer Impact
+----------------
+
+Currently, we don't limit developer's choice on what schema updates are
+permissible. This change is proposing a unit test and a grenade test, that will
+limit the changes that can be done in one release - changes will have to be
+split between releases. Still, patches could be added to the exceptions list
+and a proper release node could be added, notifying the operator about the
+need and scope of downtime.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  xek
+
+
+Work Items
+----------
+
+* Prepare a unit test blocking alters and drops in SQL migrations.
+* Prepare developer documentation with examples.
+* Help review patches which fall over on this test.
+* Add grenade CI test, with altering requests sent to two keystone instances
+  at different versions
+
+
+Dependencies
+============
+
+None
+
+
+Documentation Impact
+====================
+
+Developer documentation with examples will be added.
+
+
+References
+==========
+
+.. [1] https://etherpad.openstack.org/p/mitaka-crossproject-themes
+.. [2] http://permalink.gmane.org/gmane.comp.cloud.openstack.devel/69083
+.. [3] https://etherpad.openstack.org/p/TYO-ops-upgrades
+.. [4] https://github.com/openstack/nova/blob/stable/liberty/nova/tests/unit/db/test_migrations.py#L224-L225
+.. [5] https://blueprints.launchpad.net/neutron/+spec/online-schema-migrations
+.. [6] https://blueprints.launchpad.net/nova/+spec/online-schema-changes
+.. [7] https://etherpad.openstack.org/p/mitaka-nova-upgrade