From 1e00e8564b0b2b4ff985c659b31ed7d8063b784d Mon Sep 17 00:00:00 2001
From: Alistair Coles <alistair.coles@hp.com>
Date: Wed, 4 Feb 2015 14:48:37 +0000
Subject: [PATCH] Increasing ring partition power

This document describes a process and
modifications to swift code that together
enable ring partition power to be increased
without cluster downtime.

Change-Id: I22a0e4c12dbfa38760fae58c60fa782e530e4f77
---
 .../increasing_partition_power.rst            | 449 ++++++++++++++++++
 1 file changed, 449 insertions(+)
 create mode 100644 specs/in_progress/increasing_partition_power.rst

diff --git a/specs/in_progress/increasing_partition_power.rst b/specs/in_progress/increasing_partition_power.rst
new file mode 100644
index 0000000..d19162c
--- /dev/null
+++ b/specs/in_progress/increasing_partition_power.rst
@@ -0,0 +1,449 @@
+::
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+===============================
+Increasing ring partition power
+===============================
+
+This document describes a process and modifications to swift code that
+together enable ring partition power to be increased without cluster downtime.
+
+Swift operators sometimes pick a ring partition power when deploying swift
+and later wish to change the partition power:
+
+ #. The operator chooses a partition power that proves to be too small and
+    subsequently constrains their ability to rebalance a growing cluster.
+ #. Perhaps more likely, in an attempt to avoid the above problem, operators
+    choose a partition power that proves to be unnecessarily large and would
+    subsequently like to reduce it.
+
+This proposal directly addresses the first problem by enabling partition power
+to be increased. Although it does not directly address the second problem
+(i.e. it does not enable ring power reduction), it does indirectly help to
+avoid that problem by removing the motivation to choose large partition power
+when first deploying a cluster.
+
+Problem Description
+===================
+
+The ring power determines the partition to which a resource (account, container
+or object) is mapped. The partition is included in the path under which the
+resource is stored in a backend filesystem. Changing the partition power
+therefore requires relocating resources to new paths in backend filesystems.
+
+In a heavily populated cluster a relocation process could be time-consuming and
+so to avoid down-time it is desirable to relocate resources while the cluster
+is still operating. However, it is necessary to do so without (temporary) loss
+of access to data and without compromising the performance of processes such as
+replication and auditing.
+
+Proposed Change
+===============
+
+Overview
+--------
+
+The proposed solution avoids copying any file contents during a partition power
+change. Objects are 'moved' from their current partition to a new partition,
+but the current and new partitions are arranged to be on the same device, so
+the 'move' is achieved using filesystem links without copying data.
+
+(It may well be that the motivation for increasing partition power is to allow
+a rebalancing of the ring. Any rebalancing would occur after the partition
+power increase has completed - during partition power changes the ring balance
+is not changed.)
+
+To allow the cluster to continue operating during a partition power change (in
+particular, to avoid any disruption or incorrect behavior of the replicator and
+auditor processes), new partition directories are created in a separate
+filesystem branch from the current partition directories. When all new
+partition directories have been populated, the ring transitions to using the
+new filesystem branch.
+
+During this transition, object servers maintain links to resource files from
+both the current and new partition directories. However, as already discussed,
+no file content is duplicated or copied. The old partition directories are
+eventually deleted.
+
+Detailed description
+--------------------
+
+The process of changing a ring's partition power comprises three phases:
+
+1. Preparation - during this phase the current partition directories continue
+   to be used but existing resources are also linked to new partition
+   directories in anticipation of the new ring partition power.
+
+2. Switchover - during this phase the ring transitions to using the new
+   partition directories; proxy and backend servers rollover to using the new
+   ring partition power.
+
+3. Cleanup - once all servers are using the new ring partition power,
+   resource files in old partition directories are removed.
+
+For simplicity, we describe the details of each phase in terms of an object
+ring but note that the same process can be applied to account and container
+rings and servers.
+
+Preparation phase
+^^^^^^^^^^^^^^^^^
+
+During the preparation phase two new attributes are set in the ring file:
+
+ * the ring's `epoch`: if not already set, a new `epoch` attribute is added to
+   the ring. The ring epoch is used to determine the parent directory for
+   partition directories. Similar to the way in which a ring's policy index is
+   appended to the `objects` directory name, the epoch will be prefixed to the
+   `objects` directory name. For simplicity, the ring epoch will be a
+   monotonically increasing integer starting at 0. A 'legacy' ring having no
+   epoch attribute will be treated as having epoch 0.
+
+ * the `next_part_power` attribute indicates the partition power that will be
+   used in the next epoch of the ring. The `next_part_power` attribute is used
+   during the preparation phase to determine the partition directory in which
+   an object should be stored in the next epoch of the ring.
+
+At this point in time no other changes are made to the ring file:
+the current part power and the mapping of partitions to devices are unchanged.
+
+The updated ring file is distributed to all servers.  During this preparation
+phase, proxy servers will continue to use the current ring partition mapping to
+determine the backend url for objects. Object servers, along with replicator
+and auditor processes, also continue to use the current ring
+parameters. However, during PUT and DELETE operations object servers will
+create additional links to object files in the object's future partition
+directory in preparation for an eventual switchover to the ring's next
+epoch. This does not require any additional copying or writing of object
+contents.
+
+The filesystem path for future partition directories is determined as follows.
+In general, the path to an object file on an object server's filesystem has the
+form::
+
+  dev/[<epoch>-]objects[-<policy>]/<partition>/<suffix>/<hash>/<ts>.<ext>
+
+where:
+
+ * `epoch` is the ring's epoch, if non-zero
+ * `policy` is the object container's policy index, if non-zero
+ * `dev` is the device to which `partition` is mapped by the ring file
+ * `partition` is the object's partition,
+   calculated using `partition = F(hash) >> (32 - P)`,
+   where `P` is the ring partition power
+ * `suffix` is the last three digits of `hash`
+ * `hash` is a hash of the object name
+ * `ts` is the object timestamp
+ * `ext` is the filename extension (`data`, `meta` or `ts`)
+
+Given `next_part_power` and `epoch` in the ring file, it is possible to
+calculate::
+
+  future_partition = F(hash) >> (32 - next_part_power)
+  next_epoch = epoch + 1
+
+The future partition directory is then::
+
+  dev/<next_epoch>-objects[-<policy>]/<next_partition>/<suffix>/<hash>/<ts>.<ext>
+
+For example, consider a ring in its first epoch, with current partition power
+P, containing an object currently in partition X, where 0 <= X < 2**P. If the
+partition power increases by a factor of 2, the object's future partition will
+be either 2X or 2X+1 in the ring's next epoch. During a DELETE an additional
+filesystem link will be created at one of::
+
+  dev/1-objects/<2X>/<suffix>/<hash>/<ts>.ts
+  dev/1-objects/<2X+1>/<suffix>/<hash>/<ts>.ts
+
+Once object servers are known to be using the updated ring file a new relinker
+process is started. The relinker prepares an object server's filesystem for a
+partition power change by crawling the filesystem and linking existing objects
+to future partition directories. The relinker determines each object's future
+partition directory in the same way as described above for the object server.
+
+The relinker does not remove links from current partition directories. Once the
+relinker has successfully completed, every existing object should be linked
+from both a current partition directory and a future partition directory. Any
+subsequent object PUTs or DELETEs will be reflected in both the current and
+future partition directory as described above.
+
+To avoid newly created objects being 'lost', it is important that an object
+server is using the updated ring file before the relinker process starts in
+order to guarantee that either the object server or the relinker create future
+partition links for every object. This may require object servers to be
+restarted prior to the relinker process being started, or to otherwise report
+that they have reloaded the ring file.
+
+The relinker will report successful completion in a file
+`/var/cache/swift/relinker.recon` that can be queried via (modified) recon
+middleware.
+
+Once the relinker process has successfully completed on all object servers, the
+partition power change process may move on to the switchover phase.
+
+Switchover phase
+^^^^^^^^^^^^^^^^
+
+To begin the switchover to using the next partition power, the ring file is
+updated once more:
+
+ * the current partition power is stored as `previous_part_power`
+ * the current partition power is set to `next_partition_power`
+ * `next_partition_power` is set to None
+ * the ring's `epoch` is incremented
+ * the mapping of partitions to devices is re-created so that partitions 2X and
+   2X+1 map to the same devices to which partition X was mapped in the previous
+   epoch. This is a simple transformation. Since no object content is moved
+   between devices the actual ring balance remains unchanged.
+
+The updated ring file is then distributed to all proxy and object servers.
+
+Since ring file distribution and loading is not instantaneous, there is a
+window of time during which a proxy server may direct object requests to either
+an old partition or a current partition (note that the partitions previously
+referred to as 'future' are now referred to as 'current').  Object servers will
+therefore create additional filesystem links during PUT and DELETE requests,
+pointing from old partition directories to files in the current partition
+directories. The paths to the old partition directories are determined in the
+same way as future partition directories were determined during the preparation
+phase, but now using the `previous_part_power` and decrementing the current
+ring `epoch`.
+
+This means that if one proxy PUTs an object using a current partition, then
+another proxy subsequently attempts to GET the object using the old partition,
+the object will be found, since both current and old partitions map to the same
+device. Similarly if one proxy PUTs an object using the old partition and
+another proxy then GETs the object using the current partition, the object will
+be found in the current partition on the object server.
+
+The object auditor and replicator processes are restarted to force reloading of
+the ring file and commence to operate using the current ring parameters.
+
+Cleanup phase
+^^^^^^^^^^^^^
+
+The cleanup phase may start once all servers are known to be using the updated
+ring file. Once again, this may require servers to be restarted or to report
+that they have reloaded the ring file during switchover.
+
+A final update is made to the ring file: the `previous_partition_power`
+attribute is set to `None` and the ring file is once again distributed. Once
+object servers have reloaded the update ring file they will cease to create
+object file links in old partition directories.
+
+At this point the old partition directories may be deleted - there is no need
+to create tombstone files when deleting objects in the old partitions since
+these partition directories are no longer used by any swift process.
+
+A cleanup process will crawl the filesystem and delete any partition
+directories that are not part of the current epoch or a future epoch. This
+cleanup process should repeat periodically in case any devices that were
+offline during the partition power change come back online - the old epoch
+partition directories discovered on those devices may be deleted. Normal
+replication may cause current epoch partition directories to be created on a
+resurrected disk.
+
+(The cleanup function could be added to an existing process such as the
+auditor).
+
+Other considerations
+--------------------
+
+swift-dispersion-[populate|report]
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The swift-dispersion-[populate|report] tools will need to be made epoch-aware.
+After increasing partition power, swift-dispersion-populate may need to be
+run to achieve the desired coverage. (Although initially the device coverage
+will remain unchanged, the percentage of partitions covered will have reduced
+by whatever factor the partition power has increased.)
+
+Auditing
+^^^^^^^^
+
+During preparation and switchover, the auditor may find a corrupt object. The
+quarantine directory is not in the epoch partition directory filesystem branch,
+so a quarantined object will not be lost when old partitions are deleted.
+
+The quarantining of an object in a current partition directory will not remove
+the object from a future partition, so after switchover the auditor will
+discover the object again, and quarantine it again. The diskfile quarantine
+renamer could optionally be made 'relinker' aware and unlink duplicate object
+references when quarantining an object.
+
+
+Alternatives
+------------
+
+Prior work
+^^^^^^^^^^
+
+The swift_ring_tool_ enables ring power increases while swift services are
+disabled. It takes a similar approach to this proposal in that the ring
+mapping is changed so that every resource remains on the same device when
+moved to its new partition. However, new partitions are created in the
+same filesystem branch as existing (hence the need for services to be suspended
+during the relocation).
+
+.. _swift_ring_tool: https://github.com/enovance/swift-ring-tool/
+
+Previous proposals have been made to upstream swift:
+
+https://bugs.launchpad.net/swift/+bug/933803 suggests a 'same-device'
+partition re-mapping, as does this proposal, but did not provide for
+relocation of resources to new partition directories.
+
+https://review.openstack.org/#/c/21888/ suggests maintaining a partition power
+per device (so only new devices use the increase partition power) but appears
+to have been abandoned due to complexities with replication.
+
+
+Create future partitions in existing `objects[-policy]` directory
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The duplication of filesystem entries for objects and creation of (potentially
+duplicate) partitions during the preparation phase could have undesirable
+effects on other backend processes if they are not isolated in another
+filesystem branch.
+
+For example, the object replicator is likely to discover newly created future
+partition directories that appear to be 'misplaced'. The replicator will
+attempt to sync these to their primary nodes (according to the old ring
+mapping) which is unnecessary. Worse, the replicator might then delete the
+future partitions from their current nodes, undoing the work of the relinker
+process.
+
+If the replicator were to adopt the future ring mappings from the outset of the
+preparation phase then the same problems arise with respect to current
+partitions that now appear to be misplaced. Furthermore, the replication
+process is likely to race with the relinker process on remote nodes to
+populate future partitions: if relocation proceeds faster on node A than B then
+the replicator may start to sync objects from A to B, which is again
+unnecessary and expensive.
+
+The auditor will also be impacted as it will discover objects in the future
+partition directories and audit them, being unable to distinguish them as
+duplicates of the object still stored in the current partition.
+
+These issues could of course be avoided by disabling replication and auditing
+during the preparation phase, but instead we propose to make the future ring
+partition naming be mutually exclusive from current ring partition naming, and
+simply restrict the replicator and auditor to only process partitions that are
+in the current ring partition set. In other words we isolate these processes
+from the future partition directories that are being created by the relinker.
+
+
+Use mutually exclusive future partitions in existing `objects` directory
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current algorithm for calculating the partition for an object is to
+calculate a 32 bit hash of the object and then use its P most significant bits,
+resulting in partitions in the range {0, 2**P - 1}. i.e.::
+
+  part = H(object name) >> (32 - P)
+
+A ring with partition power P+1 will re-use all the partition numbers of a ring
+with partition power P.
+
+To eliminate overlap of future ring partitions with current ring partitions we
+could change the partition number algortihm to add an offset to each partition
+number when a ring's partition power is increased:
+
+offset = 2**P part = (H(object name) >> (32 - P)) + offset
+
+This is backwards compatible: if `offset` is not defined in a ring file then it
+is set to zero.
+
+To ensure that partition numbers remain < 2**32, this change will reduce the
+maximum partition power from 32 to 31.
+
+Proxy servers start to use the new ring at outset of relocation phase
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This would mean that GETs to backends would use the new rings partitions in
+object urls. Objects may not yet have been relocated to their new partition
+directory and the object servers would therefore need to fall back to looking
+in the old ring partition for the object. PUTs and DELETEs to the new partition
+would need to be made conditional upon a newer object timestamp not existing in
+the old location. This is more complicated than the proposed method.
+
+Enable partition power reduction
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Ring power reduction is not easily achieved with the approach presented in this
+proposal because there is no guarantee that partitions in the current epoch
+that will be merged into partitions in the next epoch are located on the same
+device. File contents are therefore likely to need copying between devices
+during a preparation phase.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  alistair.coles@hp.com
+
+Work Items
+----------
+
+ #. modify ring classes to support new attributes
+ #. modify ringbuilder to manage new attributes
+ #. modify backend servers to duplicate links to files in future epoch partition
+    directories
+ #. make backend servers and relinker report their status in a way that recon
+    can report e.g. servers report when a new ring epoch has been loaded, the
+    relinker reports when all relinking has been completed.
+ #. make recon support reporting these states
+ #. modify code that assumes storage-directory is objects[-policy_index] to
+    be aware of epoch prefix
+ #. make swift-dispersion-populate and swift-dispersion-report epoch-aware
+ #. implement relinker daemon
+ #. document process
+
+Repositories
+------------
+
+No new git repositories will be created.
+
+Servers
+-------
+
+No new servers are created.
+
+DNS Entries
+-----------
+
+No DNS entries will to be created or updated.
+
+Documentation
+-------------
+
+Process will be documented in the administrator's guide. Additions will be made
+to the ring-builder documents.
+
+Security
+--------
+
+No security issues are foreseen.
+
+Testing
+-------
+
+Unit tests will be added for changes to ring-builder, ring classes and
+object server.
+
+Probe tests will be needed to verify the process of increasing ring power.
+
+Functional tests will be unchanged.
+
+
+Dependencies
+============
+
+None