From 1e00e8564b0b2b4ff985c659b31ed7d8063b784d Mon Sep 17 00:00:00 2001 From: Alistair Coles Date: Wed, 4 Feb 2015 14:48:37 +0000 Subject: [PATCH] Increasing ring partition power This document describes a process and modifications to swift code that together enable ring partition power to be increased without cluster downtime. Change-Id: I22a0e4c12dbfa38760fae58c60fa782e530e4f77 --- .../increasing_partition_power.rst | 449 ++++++++++++++++++ 1 file changed, 449 insertions(+) create mode 100644 specs/in_progress/increasing_partition_power.rst diff --git a/specs/in_progress/increasing_partition_power.rst b/specs/in_progress/increasing_partition_power.rst new file mode 100644 index 0000000..d19162c --- /dev/null +++ b/specs/in_progress/increasing_partition_power.rst @@ -0,0 +1,449 @@ +:: + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode + + +=============================== +Increasing ring partition power +=============================== + +This document describes a process and modifications to swift code that +together enable ring partition power to be increased without cluster downtime. + +Swift operators sometimes pick a ring partition power when deploying swift +and later wish to change the partition power: + + #. The operator chooses a partition power that proves to be too small and + subsequently constrains their ability to rebalance a growing cluster. + #. Perhaps more likely, in an attempt to avoid the above problem, operators + choose a partition power that proves to be unnecessarily large and would + subsequently like to reduce it. + +This proposal directly addresses the first problem by enabling partition power +to be increased. Although it does not directly address the second problem +(i.e. it does not enable ring power reduction), it does indirectly help to +avoid that problem by removing the motivation to choose large partition power +when first deploying a cluster. + +Problem Description +=================== + +The ring power determines the partition to which a resource (account, container +or object) is mapped. The partition is included in the path under which the +resource is stored in a backend filesystem. Changing the partition power +therefore requires relocating resources to new paths in backend filesystems. + +In a heavily populated cluster a relocation process could be time-consuming and +so to avoid down-time it is desirable to relocate resources while the cluster +is still operating. However, it is necessary to do so without (temporary) loss +of access to data and without compromising the performance of processes such as +replication and auditing. + +Proposed Change +=============== + +Overview +-------- + +The proposed solution avoids copying any file contents during a partition power +change. Objects are 'moved' from their current partition to a new partition, +but the current and new partitions are arranged to be on the same device, so +the 'move' is achieved using filesystem links without copying data. + +(It may well be that the motivation for increasing partition power is to allow +a rebalancing of the ring. Any rebalancing would occur after the partition +power increase has completed - during partition power changes the ring balance +is not changed.) + +To allow the cluster to continue operating during a partition power change (in +particular, to avoid any disruption or incorrect behavior of the replicator and +auditor processes), new partition directories are created in a separate +filesystem branch from the current partition directories. When all new +partition directories have been populated, the ring transitions to using the +new filesystem branch. + +During this transition, object servers maintain links to resource files from +both the current and new partition directories. However, as already discussed, +no file content is duplicated or copied. The old partition directories are +eventually deleted. + +Detailed description +-------------------- + +The process of changing a ring's partition power comprises three phases: + +1. Preparation - during this phase the current partition directories continue + to be used but existing resources are also linked to new partition + directories in anticipation of the new ring partition power. + +2. Switchover - during this phase the ring transitions to using the new + partition directories; proxy and backend servers rollover to using the new + ring partition power. + +3. Cleanup - once all servers are using the new ring partition power, + resource files in old partition directories are removed. + +For simplicity, we describe the details of each phase in terms of an object +ring but note that the same process can be applied to account and container +rings and servers. + +Preparation phase +^^^^^^^^^^^^^^^^^ + +During the preparation phase two new attributes are set in the ring file: + + * the ring's `epoch`: if not already set, a new `epoch` attribute is added to + the ring. The ring epoch is used to determine the parent directory for + partition directories. Similar to the way in which a ring's policy index is + appended to the `objects` directory name, the epoch will be prefixed to the + `objects` directory name. For simplicity, the ring epoch will be a + monotonically increasing integer starting at 0. A 'legacy' ring having no + epoch attribute will be treated as having epoch 0. + + * the `next_part_power` attribute indicates the partition power that will be + used in the next epoch of the ring. The `next_part_power` attribute is used + during the preparation phase to determine the partition directory in which + an object should be stored in the next epoch of the ring. + +At this point in time no other changes are made to the ring file: +the current part power and the mapping of partitions to devices are unchanged. + +The updated ring file is distributed to all servers. During this preparation +phase, proxy servers will continue to use the current ring partition mapping to +determine the backend url for objects. Object servers, along with replicator +and auditor processes, also continue to use the current ring +parameters. However, during PUT and DELETE operations object servers will +create additional links to object files in the object's future partition +directory in preparation for an eventual switchover to the ring's next +epoch. This does not require any additional copying or writing of object +contents. + +The filesystem path for future partition directories is determined as follows. +In general, the path to an object file on an object server's filesystem has the +form:: + + dev/[-]objects[-]////. + +where: + + * `epoch` is the ring's epoch, if non-zero + * `policy` is the object container's policy index, if non-zero + * `dev` is the device to which `partition` is mapped by the ring file + * `partition` is the object's partition, + calculated using `partition = F(hash) >> (32 - P)`, + where `P` is the ring partition power + * `suffix` is the last three digits of `hash` + * `hash` is a hash of the object name + * `ts` is the object timestamp + * `ext` is the filename extension (`data`, `meta` or `ts`) + +Given `next_part_power` and `epoch` in the ring file, it is possible to +calculate:: + + future_partition = F(hash) >> (32 - next_part_power) + next_epoch = epoch + 1 + +The future partition directory is then:: + + dev/-objects[-]////. + +For example, consider a ring in its first epoch, with current partition power +P, containing an object currently in partition X, where 0 <= X < 2**P. If the +partition power increases by a factor of 2, the object's future partition will +be either 2X or 2X+1 in the ring's next epoch. During a DELETE an additional +filesystem link will be created at one of:: + + dev/1-objects/<2X>///.ts + dev/1-objects/<2X+1>///.ts + +Once object servers are known to be using the updated ring file a new relinker +process is started. The relinker prepares an object server's filesystem for a +partition power change by crawling the filesystem and linking existing objects +to future partition directories. The relinker determines each object's future +partition directory in the same way as described above for the object server. + +The relinker does not remove links from current partition directories. Once the +relinker has successfully completed, every existing object should be linked +from both a current partition directory and a future partition directory. Any +subsequent object PUTs or DELETEs will be reflected in both the current and +future partition directory as described above. + +To avoid newly created objects being 'lost', it is important that an object +server is using the updated ring file before the relinker process starts in +order to guarantee that either the object server or the relinker create future +partition links for every object. This may require object servers to be +restarted prior to the relinker process being started, or to otherwise report +that they have reloaded the ring file. + +The relinker will report successful completion in a file +`/var/cache/swift/relinker.recon` that can be queried via (modified) recon +middleware. + +Once the relinker process has successfully completed on all object servers, the +partition power change process may move on to the switchover phase. + +Switchover phase +^^^^^^^^^^^^^^^^ + +To begin the switchover to using the next partition power, the ring file is +updated once more: + + * the current partition power is stored as `previous_part_power` + * the current partition power is set to `next_partition_power` + * `next_partition_power` is set to None + * the ring's `epoch` is incremented + * the mapping of partitions to devices is re-created so that partitions 2X and + 2X+1 map to the same devices to which partition X was mapped in the previous + epoch. This is a simple transformation. Since no object content is moved + between devices the actual ring balance remains unchanged. + +The updated ring file is then distributed to all proxy and object servers. + +Since ring file distribution and loading is not instantaneous, there is a +window of time during which a proxy server may direct object requests to either +an old partition or a current partition (note that the partitions previously +referred to as 'future' are now referred to as 'current'). Object servers will +therefore create additional filesystem links during PUT and DELETE requests, +pointing from old partition directories to files in the current partition +directories. The paths to the old partition directories are determined in the +same way as future partition directories were determined during the preparation +phase, but now using the `previous_part_power` and decrementing the current +ring `epoch`. + +This means that if one proxy PUTs an object using a current partition, then +another proxy subsequently attempts to GET the object using the old partition, +the object will be found, since both current and old partitions map to the same +device. Similarly if one proxy PUTs an object using the old partition and +another proxy then GETs the object using the current partition, the object will +be found in the current partition on the object server. + +The object auditor and replicator processes are restarted to force reloading of +the ring file and commence to operate using the current ring parameters. + +Cleanup phase +^^^^^^^^^^^^^ + +The cleanup phase may start once all servers are known to be using the updated +ring file. Once again, this may require servers to be restarted or to report +that they have reloaded the ring file during switchover. + +A final update is made to the ring file: the `previous_partition_power` +attribute is set to `None` and the ring file is once again distributed. Once +object servers have reloaded the update ring file they will cease to create +object file links in old partition directories. + +At this point the old partition directories may be deleted - there is no need +to create tombstone files when deleting objects in the old partitions since +these partition directories are no longer used by any swift process. + +A cleanup process will crawl the filesystem and delete any partition +directories that are not part of the current epoch or a future epoch. This +cleanup process should repeat periodically in case any devices that were +offline during the partition power change come back online - the old epoch +partition directories discovered on those devices may be deleted. Normal +replication may cause current epoch partition directories to be created on a +resurrected disk. + +(The cleanup function could be added to an existing process such as the +auditor). + +Other considerations +-------------------- + +swift-dispersion-[populate|report] +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The swift-dispersion-[populate|report] tools will need to be made epoch-aware. +After increasing partition power, swift-dispersion-populate may need to be +run to achieve the desired coverage. (Although initially the device coverage +will remain unchanged, the percentage of partitions covered will have reduced +by whatever factor the partition power has increased.) + +Auditing +^^^^^^^^ + +During preparation and switchover, the auditor may find a corrupt object. The +quarantine directory is not in the epoch partition directory filesystem branch, +so a quarantined object will not be lost when old partitions are deleted. + +The quarantining of an object in a current partition directory will not remove +the object from a future partition, so after switchover the auditor will +discover the object again, and quarantine it again. The diskfile quarantine +renamer could optionally be made 'relinker' aware and unlink duplicate object +references when quarantining an object. + + +Alternatives +------------ + +Prior work +^^^^^^^^^^ + +The swift_ring_tool_ enables ring power increases while swift services are +disabled. It takes a similar approach to this proposal in that the ring +mapping is changed so that every resource remains on the same device when +moved to its new partition. However, new partitions are created in the +same filesystem branch as existing (hence the need for services to be suspended +during the relocation). + +.. _swift_ring_tool: https://github.com/enovance/swift-ring-tool/ + +Previous proposals have been made to upstream swift: + +https://bugs.launchpad.net/swift/+bug/933803 suggests a 'same-device' +partition re-mapping, as does this proposal, but did not provide for +relocation of resources to new partition directories. + +https://review.openstack.org/#/c/21888/ suggests maintaining a partition power +per device (so only new devices use the increase partition power) but appears +to have been abandoned due to complexities with replication. + + +Create future partitions in existing `objects[-policy]` directory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The duplication of filesystem entries for objects and creation of (potentially +duplicate) partitions during the preparation phase could have undesirable +effects on other backend processes if they are not isolated in another +filesystem branch. + +For example, the object replicator is likely to discover newly created future +partition directories that appear to be 'misplaced'. The replicator will +attempt to sync these to their primary nodes (according to the old ring +mapping) which is unnecessary. Worse, the replicator might then delete the +future partitions from their current nodes, undoing the work of the relinker +process. + +If the replicator were to adopt the future ring mappings from the outset of the +preparation phase then the same problems arise with respect to current +partitions that now appear to be misplaced. Furthermore, the replication +process is likely to race with the relinker process on remote nodes to +populate future partitions: if relocation proceeds faster on node A than B then +the replicator may start to sync objects from A to B, which is again +unnecessary and expensive. + +The auditor will also be impacted as it will discover objects in the future +partition directories and audit them, being unable to distinguish them as +duplicates of the object still stored in the current partition. + +These issues could of course be avoided by disabling replication and auditing +during the preparation phase, but instead we propose to make the future ring +partition naming be mutually exclusive from current ring partition naming, and +simply restrict the replicator and auditor to only process partitions that are +in the current ring partition set. In other words we isolate these processes +from the future partition directories that are being created by the relinker. + + +Use mutually exclusive future partitions in existing `objects` directory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current algorithm for calculating the partition for an object is to +calculate a 32 bit hash of the object and then use its P most significant bits, +resulting in partitions in the range {0, 2**P - 1}. i.e.:: + + part = H(object name) >> (32 - P) + +A ring with partition power P+1 will re-use all the partition numbers of a ring +with partition power P. + +To eliminate overlap of future ring partitions with current ring partitions we +could change the partition number algortihm to add an offset to each partition +number when a ring's partition power is increased: + +offset = 2**P part = (H(object name) >> (32 - P)) + offset + +This is backwards compatible: if `offset` is not defined in a ring file then it +is set to zero. + +To ensure that partition numbers remain < 2**32, this change will reduce the +maximum partition power from 32 to 31. + +Proxy servers start to use the new ring at outset of relocation phase +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This would mean that GETs to backends would use the new rings partitions in +object urls. Objects may not yet have been relocated to their new partition +directory and the object servers would therefore need to fall back to looking +in the old ring partition for the object. PUTs and DELETEs to the new partition +would need to be made conditional upon a newer object timestamp not existing in +the old location. This is more complicated than the proposed method. + +Enable partition power reduction +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Ring power reduction is not easily achieved with the approach presented in this +proposal because there is no guarantee that partitions in the current epoch +that will be merged into partitions in the next epoch are located on the same +device. File contents are therefore likely to need copying between devices +during a preparation phase. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + alistair.coles@hp.com + +Work Items +---------- + + #. modify ring classes to support new attributes + #. modify ringbuilder to manage new attributes + #. modify backend servers to duplicate links to files in future epoch partition + directories + #. make backend servers and relinker report their status in a way that recon + can report e.g. servers report when a new ring epoch has been loaded, the + relinker reports when all relinking has been completed. + #. make recon support reporting these states + #. modify code that assumes storage-directory is objects[-policy_index] to + be aware of epoch prefix + #. make swift-dispersion-populate and swift-dispersion-report epoch-aware + #. implement relinker daemon + #. document process + +Repositories +------------ + +No new git repositories will be created. + +Servers +------- + +No new servers are created. + +DNS Entries +----------- + +No DNS entries will to be created or updated. + +Documentation +------------- + +Process will be documented in the administrator's guide. Additions will be made +to the ring-builder documents. + +Security +-------- + +No security issues are foreseen. + +Testing +------- + +Unit tests will be added for changes to ring-builder, ring classes and +object server. + +Probe tests will be needed to verify the process of increasing ring power. + +Functional tests will be unchanged. + + +Dependencies +============ + +None