Merge "Increasing ring partition power"
This commit is contained in:
commit
77ff4dc4d0
|
@ -0,0 +1,449 @@
|
|||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
|
||||
===============================
|
||||
Increasing ring partition power
|
||||
===============================
|
||||
|
||||
This document describes a process and modifications to swift code that
|
||||
together enable ring partition power to be increased without cluster downtime.
|
||||
|
||||
Swift operators sometimes pick a ring partition power when deploying swift
|
||||
and later wish to change the partition power:
|
||||
|
||||
#. The operator chooses a partition power that proves to be too small and
|
||||
subsequently constrains their ability to rebalance a growing cluster.
|
||||
#. Perhaps more likely, in an attempt to avoid the above problem, operators
|
||||
choose a partition power that proves to be unnecessarily large and would
|
||||
subsequently like to reduce it.
|
||||
|
||||
This proposal directly addresses the first problem by enabling partition power
|
||||
to be increased. Although it does not directly address the second problem
|
||||
(i.e. it does not enable ring power reduction), it does indirectly help to
|
||||
avoid that problem by removing the motivation to choose large partition power
|
||||
when first deploying a cluster.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The ring power determines the partition to which a resource (account, container
|
||||
or object) is mapped. The partition is included in the path under which the
|
||||
resource is stored in a backend filesystem. Changing the partition power
|
||||
therefore requires relocating resources to new paths in backend filesystems.
|
||||
|
||||
In a heavily populated cluster a relocation process could be time-consuming and
|
||||
so to avoid down-time it is desirable to relocate resources while the cluster
|
||||
is still operating. However, it is necessary to do so without (temporary) loss
|
||||
of access to data and without compromising the performance of processes such as
|
||||
replication and auditing.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposed solution avoids copying any file contents during a partition power
|
||||
change. Objects are 'moved' from their current partition to a new partition,
|
||||
but the current and new partitions are arranged to be on the same device, so
|
||||
the 'move' is achieved using filesystem links without copying data.
|
||||
|
||||
(It may well be that the motivation for increasing partition power is to allow
|
||||
a rebalancing of the ring. Any rebalancing would occur after the partition
|
||||
power increase has completed - during partition power changes the ring balance
|
||||
is not changed.)
|
||||
|
||||
To allow the cluster to continue operating during a partition power change (in
|
||||
particular, to avoid any disruption or incorrect behavior of the replicator and
|
||||
auditor processes), new partition directories are created in a separate
|
||||
filesystem branch from the current partition directories. When all new
|
||||
partition directories have been populated, the ring transitions to using the
|
||||
new filesystem branch.
|
||||
|
||||
During this transition, object servers maintain links to resource files from
|
||||
both the current and new partition directories. However, as already discussed,
|
||||
no file content is duplicated or copied. The old partition directories are
|
||||
eventually deleted.
|
||||
|
||||
Detailed description
|
||||
--------------------
|
||||
|
||||
The process of changing a ring's partition power comprises three phases:
|
||||
|
||||
1. Preparation - during this phase the current partition directories continue
|
||||
to be used but existing resources are also linked to new partition
|
||||
directories in anticipation of the new ring partition power.
|
||||
|
||||
2. Switchover - during this phase the ring transitions to using the new
|
||||
partition directories; proxy and backend servers rollover to using the new
|
||||
ring partition power.
|
||||
|
||||
3. Cleanup - once all servers are using the new ring partition power,
|
||||
resource files in old partition directories are removed.
|
||||
|
||||
For simplicity, we describe the details of each phase in terms of an object
|
||||
ring but note that the same process can be applied to account and container
|
||||
rings and servers.
|
||||
|
||||
Preparation phase
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
During the preparation phase two new attributes are set in the ring file:
|
||||
|
||||
* the ring's `epoch`: if not already set, a new `epoch` attribute is added to
|
||||
the ring. The ring epoch is used to determine the parent directory for
|
||||
partition directories. Similar to the way in which a ring's policy index is
|
||||
appended to the `objects` directory name, the epoch will be prefixed to the
|
||||
`objects` directory name. For simplicity, the ring epoch will be a
|
||||
monotonically increasing integer starting at 0. A 'legacy' ring having no
|
||||
epoch attribute will be treated as having epoch 0.
|
||||
|
||||
* the `next_part_power` attribute indicates the partition power that will be
|
||||
used in the next epoch of the ring. The `next_part_power` attribute is used
|
||||
during the preparation phase to determine the partition directory in which
|
||||
an object should be stored in the next epoch of the ring.
|
||||
|
||||
At this point in time no other changes are made to the ring file:
|
||||
the current part power and the mapping of partitions to devices are unchanged.
|
||||
|
||||
The updated ring file is distributed to all servers. During this preparation
|
||||
phase, proxy servers will continue to use the current ring partition mapping to
|
||||
determine the backend url for objects. Object servers, along with replicator
|
||||
and auditor processes, also continue to use the current ring
|
||||
parameters. However, during PUT and DELETE operations object servers will
|
||||
create additional links to object files in the object's future partition
|
||||
directory in preparation for an eventual switchover to the ring's next
|
||||
epoch. This does not require any additional copying or writing of object
|
||||
contents.
|
||||
|
||||
The filesystem path for future partition directories is determined as follows.
|
||||
In general, the path to an object file on an object server's filesystem has the
|
||||
form::
|
||||
|
||||
dev/[<epoch>-]objects[-<policy>]/<partition>/<suffix>/<hash>/<ts>.<ext>
|
||||
|
||||
where:
|
||||
|
||||
* `epoch` is the ring's epoch, if non-zero
|
||||
* `policy` is the object container's policy index, if non-zero
|
||||
* `dev` is the device to which `partition` is mapped by the ring file
|
||||
* `partition` is the object's partition,
|
||||
calculated using `partition = F(hash) >> (32 - P)`,
|
||||
where `P` is the ring partition power
|
||||
* `suffix` is the last three digits of `hash`
|
||||
* `hash` is a hash of the object name
|
||||
* `ts` is the object timestamp
|
||||
* `ext` is the filename extension (`data`, `meta` or `ts`)
|
||||
|
||||
Given `next_part_power` and `epoch` in the ring file, it is possible to
|
||||
calculate::
|
||||
|
||||
future_partition = F(hash) >> (32 - next_part_power)
|
||||
next_epoch = epoch + 1
|
||||
|
||||
The future partition directory is then::
|
||||
|
||||
dev/<next_epoch>-objects[-<policy>]/<next_partition>/<suffix>/<hash>/<ts>.<ext>
|
||||
|
||||
For example, consider a ring in its first epoch, with current partition power
|
||||
P, containing an object currently in partition X, where 0 <= X < 2**P. If the
|
||||
partition power increases by a factor of 2, the object's future partition will
|
||||
be either 2X or 2X+1 in the ring's next epoch. During a DELETE an additional
|
||||
filesystem link will be created at one of::
|
||||
|
||||
dev/1-objects/<2X>/<suffix>/<hash>/<ts>.ts
|
||||
dev/1-objects/<2X+1>/<suffix>/<hash>/<ts>.ts
|
||||
|
||||
Once object servers are known to be using the updated ring file a new relinker
|
||||
process is started. The relinker prepares an object server's filesystem for a
|
||||
partition power change by crawling the filesystem and linking existing objects
|
||||
to future partition directories. The relinker determines each object's future
|
||||
partition directory in the same way as described above for the object server.
|
||||
|
||||
The relinker does not remove links from current partition directories. Once the
|
||||
relinker has successfully completed, every existing object should be linked
|
||||
from both a current partition directory and a future partition directory. Any
|
||||
subsequent object PUTs or DELETEs will be reflected in both the current and
|
||||
future partition directory as described above.
|
||||
|
||||
To avoid newly created objects being 'lost', it is important that an object
|
||||
server is using the updated ring file before the relinker process starts in
|
||||
order to guarantee that either the object server or the relinker create future
|
||||
partition links for every object. This may require object servers to be
|
||||
restarted prior to the relinker process being started, or to otherwise report
|
||||
that they have reloaded the ring file.
|
||||
|
||||
The relinker will report successful completion in a file
|
||||
`/var/cache/swift/relinker.recon` that can be queried via (modified) recon
|
||||
middleware.
|
||||
|
||||
Once the relinker process has successfully completed on all object servers, the
|
||||
partition power change process may move on to the switchover phase.
|
||||
|
||||
Switchover phase
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
To begin the switchover to using the next partition power, the ring file is
|
||||
updated once more:
|
||||
|
||||
* the current partition power is stored as `previous_part_power`
|
||||
* the current partition power is set to `next_partition_power`
|
||||
* `next_partition_power` is set to None
|
||||
* the ring's `epoch` is incremented
|
||||
* the mapping of partitions to devices is re-created so that partitions 2X and
|
||||
2X+1 map to the same devices to which partition X was mapped in the previous
|
||||
epoch. This is a simple transformation. Since no object content is moved
|
||||
between devices the actual ring balance remains unchanged.
|
||||
|
||||
The updated ring file is then distributed to all proxy and object servers.
|
||||
|
||||
Since ring file distribution and loading is not instantaneous, there is a
|
||||
window of time during which a proxy server may direct object requests to either
|
||||
an old partition or a current partition (note that the partitions previously
|
||||
referred to as 'future' are now referred to as 'current'). Object servers will
|
||||
therefore create additional filesystem links during PUT and DELETE requests,
|
||||
pointing from old partition directories to files in the current partition
|
||||
directories. The paths to the old partition directories are determined in the
|
||||
same way as future partition directories were determined during the preparation
|
||||
phase, but now using the `previous_part_power` and decrementing the current
|
||||
ring `epoch`.
|
||||
|
||||
This means that if one proxy PUTs an object using a current partition, then
|
||||
another proxy subsequently attempts to GET the object using the old partition,
|
||||
the object will be found, since both current and old partitions map to the same
|
||||
device. Similarly if one proxy PUTs an object using the old partition and
|
||||
another proxy then GETs the object using the current partition, the object will
|
||||
be found in the current partition on the object server.
|
||||
|
||||
The object auditor and replicator processes are restarted to force reloading of
|
||||
the ring file and commence to operate using the current ring parameters.
|
||||
|
||||
Cleanup phase
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
The cleanup phase may start once all servers are known to be using the updated
|
||||
ring file. Once again, this may require servers to be restarted or to report
|
||||
that they have reloaded the ring file during switchover.
|
||||
|
||||
A final update is made to the ring file: the `previous_partition_power`
|
||||
attribute is set to `None` and the ring file is once again distributed. Once
|
||||
object servers have reloaded the update ring file they will cease to create
|
||||
object file links in old partition directories.
|
||||
|
||||
At this point the old partition directories may be deleted - there is no need
|
||||
to create tombstone files when deleting objects in the old partitions since
|
||||
these partition directories are no longer used by any swift process.
|
||||
|
||||
A cleanup process will crawl the filesystem and delete any partition
|
||||
directories that are not part of the current epoch or a future epoch. This
|
||||
cleanup process should repeat periodically in case any devices that were
|
||||
offline during the partition power change come back online - the old epoch
|
||||
partition directories discovered on those devices may be deleted. Normal
|
||||
replication may cause current epoch partition directories to be created on a
|
||||
resurrected disk.
|
||||
|
||||
(The cleanup function could be added to an existing process such as the
|
||||
auditor).
|
||||
|
||||
Other considerations
|
||||
--------------------
|
||||
|
||||
swift-dispersion-[populate|report]
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The swift-dispersion-[populate|report] tools will need to be made epoch-aware.
|
||||
After increasing partition power, swift-dispersion-populate may need to be
|
||||
run to achieve the desired coverage. (Although initially the device coverage
|
||||
will remain unchanged, the percentage of partitions covered will have reduced
|
||||
by whatever factor the partition power has increased.)
|
||||
|
||||
Auditing
|
||||
^^^^^^^^
|
||||
|
||||
During preparation and switchover, the auditor may find a corrupt object. The
|
||||
quarantine directory is not in the epoch partition directory filesystem branch,
|
||||
so a quarantined object will not be lost when old partitions are deleted.
|
||||
|
||||
The quarantining of an object in a current partition directory will not remove
|
||||
the object from a future partition, so after switchover the auditor will
|
||||
discover the object again, and quarantine it again. The diskfile quarantine
|
||||
renamer could optionally be made 'relinker' aware and unlink duplicate object
|
||||
references when quarantining an object.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Prior work
|
||||
^^^^^^^^^^
|
||||
|
||||
The swift_ring_tool_ enables ring power increases while swift services are
|
||||
disabled. It takes a similar approach to this proposal in that the ring
|
||||
mapping is changed so that every resource remains on the same device when
|
||||
moved to its new partition. However, new partitions are created in the
|
||||
same filesystem branch as existing (hence the need for services to be suspended
|
||||
during the relocation).
|
||||
|
||||
.. _swift_ring_tool: https://github.com/enovance/swift-ring-tool/
|
||||
|
||||
Previous proposals have been made to upstream swift:
|
||||
|
||||
https://bugs.launchpad.net/swift/+bug/933803 suggests a 'same-device'
|
||||
partition re-mapping, as does this proposal, but did not provide for
|
||||
relocation of resources to new partition directories.
|
||||
|
||||
https://review.openstack.org/#/c/21888/ suggests maintaining a partition power
|
||||
per device (so only new devices use the increase partition power) but appears
|
||||
to have been abandoned due to complexities with replication.
|
||||
|
||||
|
||||
Create future partitions in existing `objects[-policy]` directory
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The duplication of filesystem entries for objects and creation of (potentially
|
||||
duplicate) partitions during the preparation phase could have undesirable
|
||||
effects on other backend processes if they are not isolated in another
|
||||
filesystem branch.
|
||||
|
||||
For example, the object replicator is likely to discover newly created future
|
||||
partition directories that appear to be 'misplaced'. The replicator will
|
||||
attempt to sync these to their primary nodes (according to the old ring
|
||||
mapping) which is unnecessary. Worse, the replicator might then delete the
|
||||
future partitions from their current nodes, undoing the work of the relinker
|
||||
process.
|
||||
|
||||
If the replicator were to adopt the future ring mappings from the outset of the
|
||||
preparation phase then the same problems arise with respect to current
|
||||
partitions that now appear to be misplaced. Furthermore, the replication
|
||||
process is likely to race with the relinker process on remote nodes to
|
||||
populate future partitions: if relocation proceeds faster on node A than B then
|
||||
the replicator may start to sync objects from A to B, which is again
|
||||
unnecessary and expensive.
|
||||
|
||||
The auditor will also be impacted as it will discover objects in the future
|
||||
partition directories and audit them, being unable to distinguish them as
|
||||
duplicates of the object still stored in the current partition.
|
||||
|
||||
These issues could of course be avoided by disabling replication and auditing
|
||||
during the preparation phase, but instead we propose to make the future ring
|
||||
partition naming be mutually exclusive from current ring partition naming, and
|
||||
simply restrict the replicator and auditor to only process partitions that are
|
||||
in the current ring partition set. In other words we isolate these processes
|
||||
from the future partition directories that are being created by the relinker.
|
||||
|
||||
|
||||
Use mutually exclusive future partitions in existing `objects` directory
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The current algorithm for calculating the partition for an object is to
|
||||
calculate a 32 bit hash of the object and then use its P most significant bits,
|
||||
resulting in partitions in the range {0, 2**P - 1}. i.e.::
|
||||
|
||||
part = H(object name) >> (32 - P)
|
||||
|
||||
A ring with partition power P+1 will re-use all the partition numbers of a ring
|
||||
with partition power P.
|
||||
|
||||
To eliminate overlap of future ring partitions with current ring partitions we
|
||||
could change the partition number algortihm to add an offset to each partition
|
||||
number when a ring's partition power is increased:
|
||||
|
||||
offset = 2**P part = (H(object name) >> (32 - P)) + offset
|
||||
|
||||
This is backwards compatible: if `offset` is not defined in a ring file then it
|
||||
is set to zero.
|
||||
|
||||
To ensure that partition numbers remain < 2**32, this change will reduce the
|
||||
maximum partition power from 32 to 31.
|
||||
|
||||
Proxy servers start to use the new ring at outset of relocation phase
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This would mean that GETs to backends would use the new rings partitions in
|
||||
object urls. Objects may not yet have been relocated to their new partition
|
||||
directory and the object servers would therefore need to fall back to looking
|
||||
in the old ring partition for the object. PUTs and DELETEs to the new partition
|
||||
would need to be made conditional upon a newer object timestamp not existing in
|
||||
the old location. This is more complicated than the proposed method.
|
||||
|
||||
Enable partition power reduction
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Ring power reduction is not easily achieved with the approach presented in this
|
||||
proposal because there is no guarantee that partitions in the current epoch
|
||||
that will be merged into partitions in the next epoch are located on the same
|
||||
device. File contents are therefore likely to need copying between devices
|
||||
during a preparation phase.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
alistair.coles@hp.com
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
#. modify ring classes to support new attributes
|
||||
#. modify ringbuilder to manage new attributes
|
||||
#. modify backend servers to duplicate links to files in future epoch partition
|
||||
directories
|
||||
#. make backend servers and relinker report their status in a way that recon
|
||||
can report e.g. servers report when a new ring epoch has been loaded, the
|
||||
relinker reports when all relinking has been completed.
|
||||
#. make recon support reporting these states
|
||||
#. modify code that assumes storage-directory is objects[-policy_index] to
|
||||
be aware of epoch prefix
|
||||
#. make swift-dispersion-populate and swift-dispersion-report epoch-aware
|
||||
#. implement relinker daemon
|
||||
#. document process
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
No new git repositories will be created.
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
No new servers are created.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
No DNS entries will to be created or updated.
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Process will be documented in the administrator's guide. Additions will be made
|
||||
to the ring-builder documents.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
No security issues are foreseen.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Unit tests will be added for changes to ring-builder, ring classes and
|
||||
object server.
|
||||
|
||||
Probe tests will be needed to verify the process of increasing ring power.
|
||||
|
||||
Functional tests will be unchanged.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
Loading…
Reference in New Issue