Merge "Increasing ring partition power"

This commit is contained in:
Jenkins 2015-05-28 06:24:48 +00:00 committed by Gerrit Code Review
commit 77ff4dc4d0
1 changed files with 449 additions and 0 deletions

View File

@ -0,0 +1,449 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============================
Increasing ring partition power
===============================
This document describes a process and modifications to swift code that
together enable ring partition power to be increased without cluster downtime.
Swift operators sometimes pick a ring partition power when deploying swift
and later wish to change the partition power:
#. The operator chooses a partition power that proves to be too small and
subsequently constrains their ability to rebalance a growing cluster.
#. Perhaps more likely, in an attempt to avoid the above problem, operators
choose a partition power that proves to be unnecessarily large and would
subsequently like to reduce it.
This proposal directly addresses the first problem by enabling partition power
to be increased. Although it does not directly address the second problem
(i.e. it does not enable ring power reduction), it does indirectly help to
avoid that problem by removing the motivation to choose large partition power
when first deploying a cluster.
Problem Description
===================
The ring power determines the partition to which a resource (account, container
or object) is mapped. The partition is included in the path under which the
resource is stored in a backend filesystem. Changing the partition power
therefore requires relocating resources to new paths in backend filesystems.
In a heavily populated cluster a relocation process could be time-consuming and
so to avoid down-time it is desirable to relocate resources while the cluster
is still operating. However, it is necessary to do so without (temporary) loss
of access to data and without compromising the performance of processes such as
replication and auditing.
Proposed Change
===============
Overview
--------
The proposed solution avoids copying any file contents during a partition power
change. Objects are 'moved' from their current partition to a new partition,
but the current and new partitions are arranged to be on the same device, so
the 'move' is achieved using filesystem links without copying data.
(It may well be that the motivation for increasing partition power is to allow
a rebalancing of the ring. Any rebalancing would occur after the partition
power increase has completed - during partition power changes the ring balance
is not changed.)
To allow the cluster to continue operating during a partition power change (in
particular, to avoid any disruption or incorrect behavior of the replicator and
auditor processes), new partition directories are created in a separate
filesystem branch from the current partition directories. When all new
partition directories have been populated, the ring transitions to using the
new filesystem branch.
During this transition, object servers maintain links to resource files from
both the current and new partition directories. However, as already discussed,
no file content is duplicated or copied. The old partition directories are
eventually deleted.
Detailed description
--------------------
The process of changing a ring's partition power comprises three phases:
1. Preparation - during this phase the current partition directories continue
to be used but existing resources are also linked to new partition
directories in anticipation of the new ring partition power.
2. Switchover - during this phase the ring transitions to using the new
partition directories; proxy and backend servers rollover to using the new
ring partition power.
3. Cleanup - once all servers are using the new ring partition power,
resource files in old partition directories are removed.
For simplicity, we describe the details of each phase in terms of an object
ring but note that the same process can be applied to account and container
rings and servers.
Preparation phase
^^^^^^^^^^^^^^^^^
During the preparation phase two new attributes are set in the ring file:
* the ring's `epoch`: if not already set, a new `epoch` attribute is added to
the ring. The ring epoch is used to determine the parent directory for
partition directories. Similar to the way in which a ring's policy index is
appended to the `objects` directory name, the epoch will be prefixed to the
`objects` directory name. For simplicity, the ring epoch will be a
monotonically increasing integer starting at 0. A 'legacy' ring having no
epoch attribute will be treated as having epoch 0.
* the `next_part_power` attribute indicates the partition power that will be
used in the next epoch of the ring. The `next_part_power` attribute is used
during the preparation phase to determine the partition directory in which
an object should be stored in the next epoch of the ring.
At this point in time no other changes are made to the ring file:
the current part power and the mapping of partitions to devices are unchanged.
The updated ring file is distributed to all servers. During this preparation
phase, proxy servers will continue to use the current ring partition mapping to
determine the backend url for objects. Object servers, along with replicator
and auditor processes, also continue to use the current ring
parameters. However, during PUT and DELETE operations object servers will
create additional links to object files in the object's future partition
directory in preparation for an eventual switchover to the ring's next
epoch. This does not require any additional copying or writing of object
contents.
The filesystem path for future partition directories is determined as follows.
In general, the path to an object file on an object server's filesystem has the
form::
dev/[<epoch>-]objects[-<policy>]/<partition>/<suffix>/<hash>/<ts>.<ext>
where:
* `epoch` is the ring's epoch, if non-zero
* `policy` is the object container's policy index, if non-zero
* `dev` is the device to which `partition` is mapped by the ring file
* `partition` is the object's partition,
calculated using `partition = F(hash) >> (32 - P)`,
where `P` is the ring partition power
* `suffix` is the last three digits of `hash`
* `hash` is a hash of the object name
* `ts` is the object timestamp
* `ext` is the filename extension (`data`, `meta` or `ts`)
Given `next_part_power` and `epoch` in the ring file, it is possible to
calculate::
future_partition = F(hash) >> (32 - next_part_power)
next_epoch = epoch + 1
The future partition directory is then::
dev/<next_epoch>-objects[-<policy>]/<next_partition>/<suffix>/<hash>/<ts>.<ext>
For example, consider a ring in its first epoch, with current partition power
P, containing an object currently in partition X, where 0 <= X < 2**P. If the
partition power increases by a factor of 2, the object's future partition will
be either 2X or 2X+1 in the ring's next epoch. During a DELETE an additional
filesystem link will be created at one of::
dev/1-objects/<2X>/<suffix>/<hash>/<ts>.ts
dev/1-objects/<2X+1>/<suffix>/<hash>/<ts>.ts
Once object servers are known to be using the updated ring file a new relinker
process is started. The relinker prepares an object server's filesystem for a
partition power change by crawling the filesystem and linking existing objects
to future partition directories. The relinker determines each object's future
partition directory in the same way as described above for the object server.
The relinker does not remove links from current partition directories. Once the
relinker has successfully completed, every existing object should be linked
from both a current partition directory and a future partition directory. Any
subsequent object PUTs or DELETEs will be reflected in both the current and
future partition directory as described above.
To avoid newly created objects being 'lost', it is important that an object
server is using the updated ring file before the relinker process starts in
order to guarantee that either the object server or the relinker create future
partition links for every object. This may require object servers to be
restarted prior to the relinker process being started, or to otherwise report
that they have reloaded the ring file.
The relinker will report successful completion in a file
`/var/cache/swift/relinker.recon` that can be queried via (modified) recon
middleware.
Once the relinker process has successfully completed on all object servers, the
partition power change process may move on to the switchover phase.
Switchover phase
^^^^^^^^^^^^^^^^
To begin the switchover to using the next partition power, the ring file is
updated once more:
* the current partition power is stored as `previous_part_power`
* the current partition power is set to `next_partition_power`
* `next_partition_power` is set to None
* the ring's `epoch` is incremented
* the mapping of partitions to devices is re-created so that partitions 2X and
2X+1 map to the same devices to which partition X was mapped in the previous
epoch. This is a simple transformation. Since no object content is moved
between devices the actual ring balance remains unchanged.
The updated ring file is then distributed to all proxy and object servers.
Since ring file distribution and loading is not instantaneous, there is a
window of time during which a proxy server may direct object requests to either
an old partition or a current partition (note that the partitions previously
referred to as 'future' are now referred to as 'current'). Object servers will
therefore create additional filesystem links during PUT and DELETE requests,
pointing from old partition directories to files in the current partition
directories. The paths to the old partition directories are determined in the
same way as future partition directories were determined during the preparation
phase, but now using the `previous_part_power` and decrementing the current
ring `epoch`.
This means that if one proxy PUTs an object using a current partition, then
another proxy subsequently attempts to GET the object using the old partition,
the object will be found, since both current and old partitions map to the same
device. Similarly if one proxy PUTs an object using the old partition and
another proxy then GETs the object using the current partition, the object will
be found in the current partition on the object server.
The object auditor and replicator processes are restarted to force reloading of
the ring file and commence to operate using the current ring parameters.
Cleanup phase
^^^^^^^^^^^^^
The cleanup phase may start once all servers are known to be using the updated
ring file. Once again, this may require servers to be restarted or to report
that they have reloaded the ring file during switchover.
A final update is made to the ring file: the `previous_partition_power`
attribute is set to `None` and the ring file is once again distributed. Once
object servers have reloaded the update ring file they will cease to create
object file links in old partition directories.
At this point the old partition directories may be deleted - there is no need
to create tombstone files when deleting objects in the old partitions since
these partition directories are no longer used by any swift process.
A cleanup process will crawl the filesystem and delete any partition
directories that are not part of the current epoch or a future epoch. This
cleanup process should repeat periodically in case any devices that were
offline during the partition power change come back online - the old epoch
partition directories discovered on those devices may be deleted. Normal
replication may cause current epoch partition directories to be created on a
resurrected disk.
(The cleanup function could be added to an existing process such as the
auditor).
Other considerations
--------------------
swift-dispersion-[populate|report]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The swift-dispersion-[populate|report] tools will need to be made epoch-aware.
After increasing partition power, swift-dispersion-populate may need to be
run to achieve the desired coverage. (Although initially the device coverage
will remain unchanged, the percentage of partitions covered will have reduced
by whatever factor the partition power has increased.)
Auditing
^^^^^^^^
During preparation and switchover, the auditor may find a corrupt object. The
quarantine directory is not in the epoch partition directory filesystem branch,
so a quarantined object will not be lost when old partitions are deleted.
The quarantining of an object in a current partition directory will not remove
the object from a future partition, so after switchover the auditor will
discover the object again, and quarantine it again. The diskfile quarantine
renamer could optionally be made 'relinker' aware and unlink duplicate object
references when quarantining an object.
Alternatives
------------
Prior work
^^^^^^^^^^
The swift_ring_tool_ enables ring power increases while swift services are
disabled. It takes a similar approach to this proposal in that the ring
mapping is changed so that every resource remains on the same device when
moved to its new partition. However, new partitions are created in the
same filesystem branch as existing (hence the need for services to be suspended
during the relocation).
.. _swift_ring_tool: https://github.com/enovance/swift-ring-tool/
Previous proposals have been made to upstream swift:
https://bugs.launchpad.net/swift/+bug/933803 suggests a 'same-device'
partition re-mapping, as does this proposal, but did not provide for
relocation of resources to new partition directories.
https://review.openstack.org/#/c/21888/ suggests maintaining a partition power
per device (so only new devices use the increase partition power) but appears
to have been abandoned due to complexities with replication.
Create future partitions in existing `objects[-policy]` directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The duplication of filesystem entries for objects and creation of (potentially
duplicate) partitions during the preparation phase could have undesirable
effects on other backend processes if they are not isolated in another
filesystem branch.
For example, the object replicator is likely to discover newly created future
partition directories that appear to be 'misplaced'. The replicator will
attempt to sync these to their primary nodes (according to the old ring
mapping) which is unnecessary. Worse, the replicator might then delete the
future partitions from their current nodes, undoing the work of the relinker
process.
If the replicator were to adopt the future ring mappings from the outset of the
preparation phase then the same problems arise with respect to current
partitions that now appear to be misplaced. Furthermore, the replication
process is likely to race with the relinker process on remote nodes to
populate future partitions: if relocation proceeds faster on node A than B then
the replicator may start to sync objects from A to B, which is again
unnecessary and expensive.
The auditor will also be impacted as it will discover objects in the future
partition directories and audit them, being unable to distinguish them as
duplicates of the object still stored in the current partition.
These issues could of course be avoided by disabling replication and auditing
during the preparation phase, but instead we propose to make the future ring
partition naming be mutually exclusive from current ring partition naming, and
simply restrict the replicator and auditor to only process partitions that are
in the current ring partition set. In other words we isolate these processes
from the future partition directories that are being created by the relinker.
Use mutually exclusive future partitions in existing `objects` directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current algorithm for calculating the partition for an object is to
calculate a 32 bit hash of the object and then use its P most significant bits,
resulting in partitions in the range {0, 2**P - 1}. i.e.::
part = H(object name) >> (32 - P)
A ring with partition power P+1 will re-use all the partition numbers of a ring
with partition power P.
To eliminate overlap of future ring partitions with current ring partitions we
could change the partition number algortihm to add an offset to each partition
number when a ring's partition power is increased:
offset = 2**P part = (H(object name) >> (32 - P)) + offset
This is backwards compatible: if `offset` is not defined in a ring file then it
is set to zero.
To ensure that partition numbers remain < 2**32, this change will reduce the
maximum partition power from 32 to 31.
Proxy servers start to use the new ring at outset of relocation phase
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This would mean that GETs to backends would use the new rings partitions in
object urls. Objects may not yet have been relocated to their new partition
directory and the object servers would therefore need to fall back to looking
in the old ring partition for the object. PUTs and DELETEs to the new partition
would need to be made conditional upon a newer object timestamp not existing in
the old location. This is more complicated than the proposed method.
Enable partition power reduction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Ring power reduction is not easily achieved with the approach presented in this
proposal because there is no guarantee that partitions in the current epoch
that will be merged into partitions in the next epoch are located on the same
device. File contents are therefore likely to need copying between devices
during a preparation phase.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
alistair.coles@hp.com
Work Items
----------
#. modify ring classes to support new attributes
#. modify ringbuilder to manage new attributes
#. modify backend servers to duplicate links to files in future epoch partition
directories
#. make backend servers and relinker report their status in a way that recon
can report e.g. servers report when a new ring epoch has been loaded, the
relinker reports when all relinking has been completed.
#. make recon support reporting these states
#. modify code that assumes storage-directory is objects[-policy_index] to
be aware of epoch prefix
#. make swift-dispersion-populate and swift-dispersion-report epoch-aware
#. implement relinker daemon
#. document process
Repositories
------------
No new git repositories will be created.
Servers
-------
No new servers are created.
DNS Entries
-----------
No DNS entries will to be created or updated.
Documentation
-------------
Process will be documented in the administrator's guide. Additions will be made
to the ring-builder documents.
Security
--------
No security issues are foreseen.
Testing
-------
Unit tests will be added for changes to ring-builder, ring classes and
object server.
Probe tests will be needed to verify the process of increasing ring power.
Functional tests will be unchanged.
Dependencies
============
None