2010-07-12 17:03:45 -05:00
|
|
|
|
=========
|
|
|
|
|
The Rings
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
The rings determine where data should reside in the cluster. There is a
|
|
|
|
|
separate ring for account databases, container databases, and individual
|
2015-11-11 19:05:09 -08:00
|
|
|
|
object storage policies but each ring works in the same way. These rings are
|
2017-05-16 14:33:31 -07:00
|
|
|
|
externally managed. The server processes themselves do not modify the
|
|
|
|
|
rings; they are instead given new rings modified by other tools.
|
|
|
|
|
|
|
|
|
|
The ring uses a configurable number of bits from the MD5 hash of an item's path
|
|
|
|
|
as a partition index that designates the device(s) on which that item should
|
|
|
|
|
be stored. The number of bits kept from the hash is known as the partition
|
|
|
|
|
power, and 2 to the partition power indicates the partition count. Partitioning
|
|
|
|
|
the full MD5 hash ring allows the cluster components to process resources in
|
|
|
|
|
batches. This ends up either more efficient or at least less complex than
|
|
|
|
|
working with each item separately or the entire cluster all at once.
|
|
|
|
|
|
|
|
|
|
Another configurable value is the replica count, which indicates how many
|
|
|
|
|
devices to assign for each partition in the ring. By having multiple devices
|
|
|
|
|
responsible for each partition, the cluster can recover from drive or network
|
|
|
|
|
failures.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
|
|
|
|
Devices are added to the ring to describe the capacity available for
|
2017-05-16 14:33:31 -07:00
|
|
|
|
partition replica assignments. Devices are placed into failure domains
|
|
|
|
|
consisting of region, zone, and server. Regions can be used to describe
|
|
|
|
|
geographical systems characterized by lower bandwidth or higher latency between
|
|
|
|
|
machines in different regions. Many rings will consist of only a single
|
|
|
|
|
region. Zones can be used to group devices based on physical locations, power
|
|
|
|
|
separations, network separations, or any other attribute that would lessen
|
|
|
|
|
multiple replicas being unavailable at the same time.
|
|
|
|
|
|
|
|
|
|
Devices are given a weight which describes the relative storage capacity
|
|
|
|
|
contributed by the device in comparison to other devices.
|
|
|
|
|
|
|
|
|
|
When building a ring, replicas for each partition will be assigned to devices
|
|
|
|
|
according to the devices' weights. Additionally, each replica of a partition
|
|
|
|
|
will preferentially be assigned to a device whose failure domain does not
|
|
|
|
|
already have a replica for that partition. Only a single replica of a
|
|
|
|
|
partition may be assigned to each device - you must have at least as many
|
|
|
|
|
devices as replicas.
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
2017-03-15 23:22:12 -07:00
|
|
|
|
.. _ring_builder:
|
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
|
------------
|
|
|
|
|
Ring Builder
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
The rings are built and managed manually by a utility called the ring-builder.
|
2017-05-16 14:33:31 -07:00
|
|
|
|
The ring-builder assigns partitions to devices and writes an optimized
|
|
|
|
|
structure to a gzipped, serialized file on disk for shipping out to the
|
|
|
|
|
servers. The server processes check the modification time of the file
|
|
|
|
|
occasionally and reload their in-memory copies of the ring structure as needed.
|
|
|
|
|
Because of how the ring-builder manages changes to the ring, using a slightly
|
|
|
|
|
older ring usually just means that for a subset of the partitions the device
|
|
|
|
|
for one of the replicas will be incorrect, which can be easily worked around.
|
|
|
|
|
|
|
|
|
|
The ring-builder also keeps a separate builder file which includes the ring
|
|
|
|
|
information as well as additional data required to build future rings. It is
|
|
|
|
|
very important to keep multiple backup copies of these builder files. One
|
|
|
|
|
option is to copy the builder files out to every server while copying the ring
|
|
|
|
|
files themselves. Another is to upload the builder files into the cluster
|
|
|
|
|
itself. Complete loss of a builder file will mean creating a new ring from
|
|
|
|
|
scratch, nearly all partitions will end up assigned to different devices, and
|
|
|
|
|
therefore nearly all data stored will have to be replicated to new locations.
|
|
|
|
|
So, recovery from a builder file loss is possible, but data will definitely be
|
|
|
|
|
unreachable for an extended time.
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
|
|
-------------------
|
|
|
|
|
Ring Data Structure
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
The ring data structure consists of three top level fields: a list of devices
|
|
|
|
|
in the cluster, a list of lists of device ids indicating partition to device
|
|
|
|
|
assignments, and an integer indicating the number of bits to shift an MD5 hash
|
|
|
|
|
to calculate the partition for the hash.
|
|
|
|
|
|
|
|
|
|
***************
|
|
|
|
|
List of Devices
|
|
|
|
|
***************
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
The list of devices is known internally to the Ring class as ``devs``. Each
|
|
|
|
|
item in the list of devices is a dictionary with the following keys:
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
|
|
====== ======= ==============================================================
|
2017-05-16 14:33:31 -07:00
|
|
|
|
id integer The index into the list of devices.
|
|
|
|
|
zone integer The zone in which the device resides.
|
|
|
|
|
region integer The region in which the zone resides.
|
2010-07-12 17:03:45 -05:00
|
|
|
|
weight float The relative weight of the device in comparison to other
|
|
|
|
|
devices. This usually corresponds directly to the amount of
|
|
|
|
|
disk space the device has compared to other devices. For
|
|
|
|
|
instance a device with 1 terabyte of space might have a weight
|
|
|
|
|
of 100.0 and another device with 2 terabytes of space might
|
|
|
|
|
have a weight of 200.0. This weight can also be used to bring
|
|
|
|
|
back into balance a device that has ended up with more or less
|
|
|
|
|
data than desired over time. A good average weight of 100.0
|
|
|
|
|
allows flexibility in lowering the weight later if necessary.
|
2014-12-09 04:44:52 +09:00
|
|
|
|
ip string The IP address or hostname of the server containing the device.
|
2017-05-16 14:33:31 -07:00
|
|
|
|
port int The TCP port on which the server process listens to serve
|
2010-07-12 17:03:45 -05:00
|
|
|
|
requests for the device.
|
2017-05-16 14:33:31 -07:00
|
|
|
|
device string The on-disk name of the device on the server.
|
|
|
|
|
For example: ``sdb1``
|
2010-07-12 17:03:45 -05:00
|
|
|
|
meta string A general-use field for storing additional information for the
|
|
|
|
|
device. This information isn't used directly by the server
|
|
|
|
|
processes, but can be useful in debugging. For example, the
|
|
|
|
|
date and time of installation and hardware manufacturer could
|
|
|
|
|
be stored here.
|
|
|
|
|
====== ======= ==============================================================
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
.. note::
|
|
|
|
|
The list of devices may contain holes, or indexes set to ``None``, for
|
|
|
|
|
devices that have been removed from the cluster. However, device ids are
|
|
|
|
|
reused. Device ids are reused to avoid potentially running out of device id
|
|
|
|
|
slots when there are available slots (from prior removal of devices). A
|
|
|
|
|
consequence of this device id reuse is that the device id (integer value)
|
|
|
|
|
does not necessarily correspond with the chronology of when the device was
|
|
|
|
|
added to the ring. Also, some devices may be temporarily disabled by
|
|
|
|
|
setting their weight to ``0.0``. To obtain a list of active devices (for
|
|
|
|
|
uptime polling, for example) the Python code would look like::
|
|
|
|
|
|
|
|
|
|
devices = list(self._iter_devs())
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
|
|
*************************
|
|
|
|
|
Partition Assignment List
|
|
|
|
|
*************************
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
The partition assignment list is known internally to the Ring class as
|
|
|
|
|
``_replica2part2dev_id``. This is a list of ``array('H')``\s, one for each
|
|
|
|
|
replica. Each ``array('H')`` has a length equal to the partition count for the
|
|
|
|
|
ring. Each integer in the ``array('H')`` is an index into the above list of
|
|
|
|
|
devices.
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
|
|
So, to create a list of device dictionaries assigned to a partition, the Python
|
2017-05-16 14:33:31 -07:00
|
|
|
|
code would look like::
|
|
|
|
|
|
|
|
|
|
devices = [self.devs[part2dev_id[partition]]
|
|
|
|
|
for part2dev_id in self._replica2part2dev_id]
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
``array('H')`` is used for memory conservation as there may be millions of
|
2010-07-12 17:03:45 -05:00
|
|
|
|
partitions.
|
|
|
|
|
|
2015-11-11 19:05:09 -08:00
|
|
|
|
*********************
|
|
|
|
|
Partition Shift Value
|
|
|
|
|
*********************
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
The partition shift value is known internally to the Ring class as
|
|
|
|
|
``_part_shift``. This value is used to shift an MD5 hash of an item's path to
|
|
|
|
|
calculate the partition on which the data for that item should reside. Only the
|
|
|
|
|
top four bytes of the hash are used in this process. For example, to compute
|
|
|
|
|
the partition for the path ``/account/container/object``, the Python code might
|
|
|
|
|
look like::
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
objhash = md5('/account/container/object').digest()
|
|
|
|
|
partition = struct.unpack_from('>I', objhash)[0] >> self._part_shift
|
|
|
|
|
|
|
|
|
|
For a ring generated with partition power ``P``, the partition shift value is
|
|
|
|
|
``32 - P``.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
|
*******************
|
|
|
|
|
Fractional Replicas
|
|
|
|
|
*******************
|
|
|
|
|
|
|
|
|
|
A ring is not restricted to having an integer number of replicas. In order to
|
|
|
|
|
support the gradual changing of replica counts, the ring is able to have a real
|
|
|
|
|
number of replicas.
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
When the number of replicas is not an integer, the last element of
|
|
|
|
|
``_replica2part2dev_id`` will have a length that is less than the partition
|
|
|
|
|
count for the ring. This means that some partitions will have more replicas
|
|
|
|
|
than others. For example, if a ring has ``3.25`` replicas, then 25% of its
|
|
|
|
|
partitions will have four replicas, while the remaining 75% will have just
|
|
|
|
|
three.
|
2013-03-04 17:05:43 -08:00
|
|
|
|
|
2016-05-25 14:35:54 +10:00
|
|
|
|
.. _ring_dispersion:
|
|
|
|
|
|
2015-11-11 19:05:09 -08:00
|
|
|
|
**********
|
|
|
|
|
Dispersion
|
|
|
|
|
**********
|
|
|
|
|
|
|
|
|
|
With each rebalance, the ring builder calculates a dispersion metric. This is
|
|
|
|
|
the percentage of partitions in the ring that have too many replicas within a
|
|
|
|
|
particular failure domain.
|
|
|
|
|
|
|
|
|
|
For example, if you have three servers in a cluster but two replicas for a
|
|
|
|
|
partition get placed onto the same server, that partition will count towards
|
|
|
|
|
the dispersion metric.
|
|
|
|
|
|
|
|
|
|
A lower dispersion value is better, and the value can be used to find the
|
|
|
|
|
proper value for "overload".
|
|
|
|
|
|
2016-05-25 14:35:54 +10:00
|
|
|
|
.. _ring_overload:
|
|
|
|
|
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
********
|
|
|
|
|
Overload
|
|
|
|
|
********
|
|
|
|
|
|
|
|
|
|
The ring builder tries to keep replicas as far apart as possible while
|
|
|
|
|
still respecting device weights. When it can't do both, the overload
|
2017-05-16 14:33:31 -07:00
|
|
|
|
factor determines what happens. Each device may take some extra
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
fraction of its desired partitions to allow for replica dispersion;
|
|
|
|
|
once that extra fraction is exhausted, replicas will be placed closer
|
2017-05-16 14:33:31 -07:00
|
|
|
|
together than is optimal for durability.
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
|
|
|
|
|
Essentially, the overload factor lets the operator trade off replica
|
2017-05-16 14:33:31 -07:00
|
|
|
|
dispersion (durability) against device balance (uniform disk usage).
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
The default overload factor is ``0``, so device weights will be strictly
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
followed.
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
With an overload factor of ``0.1``, each device will accept 10% more
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
partitions than it otherwise would, but only if needed to maintain
|
2017-05-16 14:33:31 -07:00
|
|
|
|
dispersion.
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
|
|
|
|
|
Example: Consider a 3-node cluster of machines with equal-size disks;
|
|
|
|
|
let node A have 12 disks, node B have 12 disks, and node C have only
|
2017-05-16 14:33:31 -07:00
|
|
|
|
11 disks. Let the ring have an overload factor of ``0.1`` (10%).
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
|
|
|
|
|
Without the overload, some partitions would end up with replicas only
|
|
|
|
|
on nodes A and B. However, with the overload, every device is willing
|
|
|
|
|
to accept up to 10% more partitions for the sake of dispersion. The
|
|
|
|
|
missing disk in C means there is one disk's worth of partitions that
|
|
|
|
|
would like to spread across the remaining 11 disks, which gives each
|
|
|
|
|
disk in C an extra 9.09% load. Since this is less than the 10%
|
|
|
|
|
overload, there is one replica of each partition on each node.
|
|
|
|
|
|
|
|
|
|
However, this does mean that the disks in node C will have more data
|
|
|
|
|
on them than the disks in nodes A and B. If 80% full is the warning
|
|
|
|
|
threshold for the cluster, node C's disks will reach 80% full while A
|
|
|
|
|
and B's disks are only 72.7% full.
|
2013-03-04 17:05:43 -08:00
|
|
|
|
|
2015-11-11 19:05:09 -08:00
|
|
|
|
-------------------------------
|
|
|
|
|
Partition & Replica Terminology
|
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
|
|
All descriptions of consistent hashing describe the process of breaking the
|
|
|
|
|
keyspace up into multiple ranges (vnodes, buckets, etc.) - many more than the
|
|
|
|
|
number of "nodes" to which keys in the keyspace must be assigned. Swift calls
|
|
|
|
|
these ranges `partitions` - they are partitions of the total keyspace.
|
|
|
|
|
|
|
|
|
|
Each partition will have multiple replicas. Every replica of each partition
|
2017-05-16 14:33:31 -07:00
|
|
|
|
must be assigned to a device in the ring. When describing a specific replica
|
|
|
|
|
of a partition (like when it's assigned a device) it is described as a
|
2015-11-11 19:05:09 -08:00
|
|
|
|
`part-replica` in that it is a specific `replica` of the specific `partition`.
|
2017-05-16 14:33:31 -07:00
|
|
|
|
A single device will likely be assigned different replicas from many
|
|
|
|
|
partitions, but it may not be assigned multiple replicas of a single partition.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
|
|
|
|
The total number of partitions in a ring is calculated as ``2 **
|
|
|
|
|
<part-power>``. The total number of part-replicas in a ring is calculated as
|
|
|
|
|
``<replica-count> * 2 ** <part-power>``.
|
|
|
|
|
|
|
|
|
|
When considering a device's `weight` it is useful to describe the number of
|
2017-05-16 14:33:31 -07:00
|
|
|
|
part-replicas it would like to be assigned. A single device, regardless of
|
|
|
|
|
weight, will never hold more than ``2 ** <part-power>`` part-replicas because
|
|
|
|
|
it can not have more than one replica of any partition assigned. The number of
|
|
|
|
|
part-replicas a device can take by weights is calculated as its `parts-wanted`.
|
|
|
|
|
The true number of part-replicas assigned to a device can be compared to its
|
|
|
|
|
parts-wanted similarly to a calculation of percentage error - this deviation in
|
|
|
|
|
the observed result from the idealized target is called a device's `balance`.
|
|
|
|
|
|
|
|
|
|
When considering a device's `failure domain` it is useful to describe the number
|
|
|
|
|
of part-replicas it would like to be assigned. The number of part-replicas
|
|
|
|
|
wanted in a failure domain of a tier is the sum of the part-replicas wanted in
|
|
|
|
|
the failure domains of its sub-tier. However, collectively when the total
|
|
|
|
|
number of part-replicas in a failure domain exceeds or is equal to ``2 **
|
|
|
|
|
<part-power>`` it is most obvious that it's no longer sufficient to consider
|
|
|
|
|
only the number of total part-replicas, but rather the fraction of each
|
|
|
|
|
replica's partitions. Consider for example a ring with 3 replicas and 3
|
|
|
|
|
servers: while dispersion requires that each server hold only ⅓ of the total
|
|
|
|
|
part-replicas, placement is additionally constrained to require ``1.0`` replica
|
|
|
|
|
of *each* partition per server. It would not be sufficient to satisfy
|
|
|
|
|
dispersion if two devices on one of the servers each held a replica of a single
|
|
|
|
|
partition, while another server held none. By considering a decimal fraction
|
|
|
|
|
of one replica's worth of partitions in a failure domain we can derive the
|
|
|
|
|
total part-replicas wanted in a failure domain (``1.0 * 2 ** <part-power>``).
|
|
|
|
|
Additionally we infer more about `which` part-replicas must go in the failure
|
|
|
|
|
domain. Consider a ring with three replicas and two zones, each with two
|
|
|
|
|
servers (four servers total). The three replicas worth of partitions will be
|
|
|
|
|
assigned into two failure domains at the zone tier. Each zone must hold more
|
|
|
|
|
than one replica of some partitions. We represent this improper fraction of a
|
|
|
|
|
replica's worth of partitions in decimal form as ``1.5`` (``3.0 / 2``). This
|
|
|
|
|
tells us not only the *number* of total partitions (``1.5 * 2 **
|
|
|
|
|
<part-power>``) but also that *each* partition must have `at least` one replica
|
|
|
|
|
in this failure domain (in fact ``0.5`` of the partitions will have 2
|
|
|
|
|
replicas). Within each zone the two servers will hold ``0.75`` of a replica's
|
|
|
|
|
worth of partitions - this is equal both to "the fraction of a replica's worth
|
|
|
|
|
of partitions assigned to each zone (``1.5``) divided evenly among the number
|
|
|
|
|
of failure domains in its sub-tier (2 servers in each zone, i.e. ``1.5 / 2``)"
|
|
|
|
|
but *also* "the total number of replicas (``3.0``) divided evenly among the
|
|
|
|
|
total number of failure domains in the server tier (2 servers × 2 zones = 4,
|
|
|
|
|
i.e. ``3.0 / 4``)". It is useful to consider that each server in this ring
|
|
|
|
|
will hold only ``0.75`` of a replica's worth of partitions which tells that any
|
|
|
|
|
server should have `at most` one replica of a given partition assigned. In the
|
|
|
|
|
interests of brevity, some variable names will often refer to the concept
|
|
|
|
|
representing the fraction of a replica's worth of partitions in decimal form as
|
|
|
|
|
*replicanths* - this is meant to invoke connotations similar to ordinal numbers
|
|
|
|
|
as applied to fractions, but generalized to a replica instead of a four\*th* or
|
|
|
|
|
a fif\*th*. The "n" was probably thrown in because of Blade Runner.
|
2013-03-04 17:05:43 -08:00
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
|
-----------------
|
|
|
|
|
Building the Ring
|
|
|
|
|
-----------------
|
|
|
|
|
|
2015-11-11 19:05:09 -08:00
|
|
|
|
First the ring builder calculates the replicanths wanted at each tier in the
|
|
|
|
|
ring's topology based on weight.
|
|
|
|
|
|
|
|
|
|
Then the ring builder calculates the replicanths wanted at each tier in the
|
|
|
|
|
ring's topology based on dispersion.
|
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
Then the ring builder calculates the maximum deviation on a single device
|
|
|
|
|
between its weighted replicanths and wanted replicanths.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
|
|
|
|
Next we interpolate between the two replicanth values (weighted & wanted) at
|
|
|
|
|
each tier using the specified overload (up to the maximum required overload).
|
|
|
|
|
It's a linear interpolation, similar to solving for a point on a line between
|
|
|
|
|
two points - we calculate the slope across the max required overload and then
|
|
|
|
|
calculate the intersection of the line with the desired overload. This
|
|
|
|
|
becomes the target.
|
|
|
|
|
|
|
|
|
|
From the target we calculate the minimum and maximum number of replicas any
|
2017-05-16 14:33:31 -07:00
|
|
|
|
partition may have in a tier. This becomes the `replica-plan`.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
|
|
|
|
Finally, we calculate the number of partitions that should ideally be assigned
|
2017-05-16 14:33:31 -07:00
|
|
|
|
to each device based the replica-plan.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
On initial balance (i.e., the first time partitions are placed to generate a
|
|
|
|
|
ring) we must assign each replica of each partition to the device that desires
|
|
|
|
|
the most partitions excluding any devices that already have their maximum
|
|
|
|
|
number of replicas of that partition assigned to some parent tier of that
|
|
|
|
|
device's failure domain.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
|
|
|
|
|
When building a new ring based on an old ring, the desired number of
|
2017-05-16 14:33:31 -07:00
|
|
|
|
partitions each device wants is recalculated from the current replica-plan.
|
2015-11-11 19:05:09 -08:00
|
|
|
|
Next the partitions to be reassigned are gathered up. Any removed devices have
|
|
|
|
|
all their assigned partitions unassigned and added to the gathered list. Any
|
|
|
|
|
partition replicas that (due to the addition of new devices) can be spread out
|
|
|
|
|
for better durability are unassigned and added to the gathered list. Any
|
|
|
|
|
devices that have more partitions than they now desire have random partitions
|
|
|
|
|
unassigned from them and added to the gathered list. Lastly, the gathered
|
|
|
|
|
partitions are then reassigned to devices using a similar method as in the
|
|
|
|
|
initial assignment described above.
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
|
|
Whenever a partition has a replica reassigned, the time of the reassignment is
|
|
|
|
|
recorded. This is taken into account when gathering partitions to reassign so
|
|
|
|
|
that no partition is moved twice in a configurable amount of time. This
|
|
|
|
|
configurable amount of time is known internally to the RingBuilder class as
|
2017-05-16 14:33:31 -07:00
|
|
|
|
``min_part_hours``. This restriction is ignored for replicas of partitions on
|
|
|
|
|
devices that have been removed, as device removal should only happens on device
|
2010-07-12 17:03:45 -05:00
|
|
|
|
failure and there's no choice but to make a reassignment.
|
|
|
|
|
|
|
|
|
|
The above processes don't always perfectly rebalance a ring due to the random
|
|
|
|
|
nature of gathering partitions for reassignment. To help reach a more balanced
|
2015-11-11 19:05:09 -08:00
|
|
|
|
ring, the rebalance process is repeated a fixed number of times until the
|
2017-05-16 14:33:31 -07:00
|
|
|
|
replica-plan is fulfilled or unable to be fulfilled (indicating we probably
|
2015-11-11 19:05:09 -08:00
|
|
|
|
can't get perfect balance due to too many partitions recently moved).
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
2017-05-25 12:46:35 +01:00
|
|
|
|
|
|
|
|
|
.. _composite_rings:
|
|
|
|
|
|
2017-03-15 23:22:12 -07:00
|
|
|
|
---------------
|
|
|
|
|
Composite Rings
|
|
|
|
|
---------------
|
|
|
|
|
.. automodule:: swift.common.ring.composite_builder
|
2015-06-11 15:40:28 -07:00
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
---------------------
|
|
|
|
|
Ring Builder Analyzer
|
|
|
|
|
---------------------
|
|
|
|
|
.. automodule:: swift.cli.ring_builder_analyzer
|
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
|
-------
|
|
|
|
|
History
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
The ring code went through many iterations before arriving at what it is now
|
2015-11-11 19:05:09 -08:00
|
|
|
|
and while it has largely been stable, the algorithm has seen a few tweaks or
|
|
|
|
|
perhaps even fundamentally changed as new ideas emerge. This section will try
|
2010-07-12 17:03:45 -05:00
|
|
|
|
to describe the previous ideas attempted and attempt to explain why they were
|
|
|
|
|
discarded.
|
|
|
|
|
|
|
|
|
|
A "live ring" option was considered where each server could maintain its own
|
|
|
|
|
copy of the ring and the servers would use a gossip protocol to communicate the
|
|
|
|
|
changes they made. This was discarded as too complex and error prone to code
|
2017-05-16 14:33:31 -07:00
|
|
|
|
correctly in the project timespan available. One bug could easily gossip bad
|
2010-07-12 17:03:45 -05:00
|
|
|
|
data out to the entire cluster and be difficult to recover from. Having an
|
|
|
|
|
externally managed ring simplifies the process, allows full validation of data
|
|
|
|
|
before it's shipped out to the servers, and guarantees each server is using a
|
|
|
|
|
ring from the same timeline. It also means that the servers themselves aren't
|
|
|
|
|
spending a lot of resources maintaining rings.
|
|
|
|
|
|
|
|
|
|
A couple of "ring server" options were considered. One was where all ring
|
|
|
|
|
lookups would be done by calling a service on a separate server or set of
|
|
|
|
|
servers, but this was discarded due to the latency involved. Another was much
|
|
|
|
|
like the current process but where servers could submit change requests to the
|
|
|
|
|
ring server to have a new ring built and shipped back out to the servers. This
|
|
|
|
|
was discarded due to project time constraints and because ring changes are
|
|
|
|
|
currently infrequent enough that manual control was sufficient. However, lack
|
2017-05-16 14:33:31 -07:00
|
|
|
|
of quick automatic ring changes did mean that other components of the system
|
|
|
|
|
had to be coded to handle devices being unavailable for a period of hours until
|
2010-07-12 17:03:45 -05:00
|
|
|
|
someone could manually update the ring.
|
|
|
|
|
|
|
|
|
|
The current ring process has each replica of a partition independently assigned
|
|
|
|
|
to a device. A version of the ring that used a third of the memory was tried,
|
|
|
|
|
where the first replica of a partition was directly assigned and the other two
|
|
|
|
|
were determined by "walking" the ring until finding additional devices in other
|
2017-05-16 14:33:31 -07:00
|
|
|
|
zones. This was discarded due to the loss of control over how many replicas for
|
|
|
|
|
a given partition moved at once. Keeping each replica independent allows for
|
2010-07-12 17:03:45 -05:00
|
|
|
|
moving only one partition replica within a given time window (except due to
|
2014-07-22 12:30:54 +02:00
|
|
|
|
device failures). Using the additional memory was deemed a good trade-off for
|
2010-07-12 17:03:45 -05:00
|
|
|
|
moving data around the cluster much less often.
|
|
|
|
|
|
|
|
|
|
Another ring design was tried where the partition to device assignments weren't
|
|
|
|
|
stored in a big list in memory but instead each device was assigned a set of
|
|
|
|
|
hashes, or anchors. The partition would be determined from the data item's hash
|
|
|
|
|
and the nearest device anchors would determine where the replicas should be
|
|
|
|
|
stored. However, to get reasonable distribution of data each device had to have
|
|
|
|
|
a lot of anchors and walking through those anchors to find replicas started to
|
|
|
|
|
add up. In the end, the memory savings wasn't that great and more processing
|
|
|
|
|
power was used, so the idea was discarded.
|
|
|
|
|
|
|
|
|
|
A completely non-partitioned ring was also tried but discarded as the
|
2017-05-16 14:33:31 -07:00
|
|
|
|
partitioning helps many other components of the system, especially replication.
|
2010-07-12 17:03:45 -05:00
|
|
|
|
Replication can be attempted and retried in a partition batch with the other
|
|
|
|
|
replicas rather than each data item independently attempted and retried. Hashes
|
|
|
|
|
of directory structures can be calculated and compared with other replicas to
|
|
|
|
|
reduce directory walking and network traffic.
|
|
|
|
|
|
|
|
|
|
Partitioning and independently assigning partition replicas also allowed for
|
2017-05-16 14:33:31 -07:00
|
|
|
|
the best-balanced cluster. The best of the other strategies tended to give
|
|
|
|
|
±10% variance on device balance with devices of equal weight and ±15% with
|
|
|
|
|
devices of varying weights. The current strategy allows us to get ±3% and ±8%
|
2010-07-12 17:03:45 -05:00
|
|
|
|
respectively.
|
|
|
|
|
|
|
|
|
|
Various hashing algorithms were tried. SHA offers better security, but the ring
|
|
|
|
|
doesn't need to be cryptographically secure and SHA is slower. Murmur was much
|
|
|
|
|
faster, but MD5 was built-in and hash computation is a small percentage of the
|
|
|
|
|
overall request handling time. In all, once it was decided the servers wouldn't
|
|
|
|
|
be maintaining the rings themselves anyway and only doing hash lookups, MD5 was
|
|
|
|
|
chosen for its general availability, good distribution, and adequate speed.
|
Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.
Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.
Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.
If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.
This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.
For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.
Overload only has an effect when replica-dispersion and device weights
come into conflict.
The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.
In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.
DocImpact
Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
2014-12-17 13:48:42 -08:00
|
|
|
|
|
|
|
|
|
The placement algorithm has seen a number of behavioral changes for
|
2015-11-11 19:05:09 -08:00
|
|
|
|
unbalanceable rings. The ring builder wants to keep replicas as far apart as
|
|
|
|
|
possible while still respecting device weights. In most cases, the ring
|
|
|
|
|
builder can achieve both, but sometimes they conflict. At first, the behavior
|
|
|
|
|
was to keep the replicas far apart and ignore device weight, but that made it
|
|
|
|
|
impossible to gradually go from one region to two, or from two to three. Then
|
|
|
|
|
it was changed to favor device weight over dispersion, but that wasn't so good
|
|
|
|
|
for rings that were close to balanceable, like 3 machines with 60TB, 60TB, and
|
|
|
|
|
57TB of disk space; operators were expecting one replica per machine, but
|
|
|
|
|
didn't always get it. After that, overload was added to the ring builder so
|
|
|
|
|
that operators could choose a balance between dispersion and device weights.
|
|
|
|
|
In time the overload concept was improved and made more accurate.
|
2016-05-09 20:28:42 +00:00
|
|
|
|
|
2017-05-16 14:33:31 -07:00
|
|
|
|
For more background on consistent hashing rings, please see
|
|
|
|
|
:doc:`ring_background`.
|