97334b6859
This patch is necessary because of I53e999fc91336871e1c32c70745f7d7cf2e256cf. The following unicode characters will be removed: * “...” * ‘...’ * ― and — Change-Id: If11a2d4ebd98b53f9f0d077b319983735f2e4b6b
227 lines
13 KiB
XML
227 lines
13 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
version="5.0"
|
|
xml:id="section_objectstorage-ringbuilder">
|
|
<title>Ring-builder</title>
|
|
<para>Use the swift-ring-builder utility to build and manage rings. This
|
|
utility assigns partitions to devices and writes an optimized
|
|
Python structure to a gzipped, serialized file on disk for
|
|
transmission to the servers. The server processes occasionally
|
|
check the modification time of the file and reload in-memory
|
|
copies of the ring structure as needed. If you use a slightly
|
|
older version of the ring, one of the three replicas for a
|
|
partition subset will be incorrect because of the way the
|
|
ring-builder manages changes to the ring. You can work around
|
|
this issue.</para>
|
|
<para>The ring-builder also keeps its own builder file with the
|
|
ring information and additional data required to build future
|
|
rings. It is very important to keep multiple backup copies of
|
|
these builder files. One option is to copy the builder files
|
|
out to every server while copying the ring files themselves.
|
|
Another is to upload the builder files into the cluster
|
|
itself. If you lose the builder file, you have to create a new
|
|
ring from scratch. Nearly all partitions would be assigned to
|
|
different devices and, therefore, nearly all of the stored
|
|
data would have to be replicated to new locations. So,
|
|
recovery from a builder file loss is possible, but data would
|
|
be unreachable for an extended time.</para>
|
|
<section xml:id="section_ring-data-structure">
|
|
<title>Ring data structure</title>
|
|
<para>The ring data structure consists of three top level
|
|
fields: a list of devices in the cluster, a list of lists
|
|
of device ids indicating partition to device assignments,
|
|
and an integer indicating the number of bits to shift an
|
|
MD5 hash to calculate the partition for the hash.</para>
|
|
</section>
|
|
<section xml:id="section_partition-assignment">
|
|
<title>Partition assignment list</title>
|
|
<para>This is a list of <literal>array('H')</literal> of
|
|
devices ids. The outermost list contains an
|
|
<literal>array('H')</literal> for each replica. Each
|
|
<literal>array('H')</literal> has a length equal to
|
|
the partition count for the ring. Each integer in the
|
|
<literal>array('H')</literal> is an index into the
|
|
above list of devices. The partition list is known
|
|
internally to the Ring class as
|
|
<literal>_replica2part2dev_id</literal>.</para>
|
|
<para>So, to create a list of device dictionaries assigned to
|
|
a partition, the Python code would look like:
|
|
<programlisting>devices = [self.devs[part2dev_id[partition]] for
|
|
part2dev_id in self._replica2part2dev_id]</programlisting></para>
|
|
<para>That code is a little simplistic because it does not
|
|
account for the removal of duplicate devices. If a ring
|
|
has more replicas than devices, a partition will have more
|
|
than one replica on a device.</para>
|
|
<para><literal>array('H')</literal> is used for memory
|
|
conservation as there may be millions of
|
|
partitions.</para>
|
|
</section>
|
|
<section xml:id="section_fractional-replicas">
|
|
<title>Replica counts</title>
|
|
<para>To support the gradual change in replica counts, a ring
|
|
can have a real number of replicas and is not restricted
|
|
to an integer number of replicas.</para>
|
|
<para>A fractional replica count is for the whole ring and not
|
|
for individual partitions. It indicates the average number
|
|
of replicas for each partition. For example, a replica
|
|
count of 3.2 means that 20 percent of partitions have four
|
|
replicas and 80 percent have three replicas.</para>
|
|
<para>The replica count is adjustable.</para>
|
|
<para>Example:</para>
|
|
<screen><prompt>$</prompt> <userinput>swift-ring-builder account.builder set_replicas 4</userinput>
|
|
<prompt>$</prompt> <userinput>swift-ring-builder account.builder rebalance</userinput></screen>
|
|
<para>You must rebalance the replica ring in globally
|
|
distributed clusters. Operators of these clusters
|
|
generally want an equal number of replicas and regions.
|
|
Therefore, when an operator adds or removes a region, the
|
|
operator adds or removes a replica. Removing unneeded
|
|
replicas saves on the cost of disks.</para>
|
|
<para>You can gradually increase the replica count at a rate
|
|
that does not adversely affect cluster performance.</para>
|
|
<para>For example:</para>
|
|
<screen><prompt>$</prompt> <userinput>swift-ring-builder object.builder set_replicas 3.01</userinput>
|
|
<prompt>$</prompt> <userinput>swift-ring-builder object.builder rebalance</userinput>
|
|
<computeroutput><distribute rings and wait>...</computeroutput>
|
|
|
|
<prompt>$</prompt> <userinput>swift-ring-builder object.builder set_replicas 3.02</userinput>
|
|
<prompt>$</prompt> <userinput>swift-ring-builder object.builder rebalance</userinput>
|
|
<computeroutput><creatdistribute rings and wait>...</computeroutput></screen>
|
|
<para>Changes take effect after the ring is rebalanced.
|
|
Therefore, if you intend to change from 3 replicas to 3.01
|
|
but you accidentally type <literal>2.01</literal>, no data
|
|
is lost.</para>
|
|
<para>Additionally, <command>swift-ring-builder
|
|
<replaceable>X.builder</replaceable>
|
|
create</command> can now take a decimal argument for
|
|
the number of replicas.</para>
|
|
</section>
|
|
<section xml:id="section_partition-shift-value">
|
|
<title>Partition shift value</title>
|
|
<para>The partition shift value is known internally to the
|
|
Ring class as <literal>_part_shift</literal>. This value
|
|
is used to shift an MD5 hash to calculate the partition
|
|
where the data for that hash should reside. Only the top
|
|
four bytes of the hash is used in this process. For
|
|
example, to compute the partition for the
|
|
<literal>/account/container/object</literal> path, the
|
|
Python code might look like the following code:
|
|
<programlisting>partition = unpack_from('>I',
|
|
md5('/account/container/object').digest())[0] >>
|
|
self._part_shift</programlisting></para>
|
|
<para>For a ring generated with part_power P, the partition
|
|
shift value is <literal>32 - P</literal>.</para>
|
|
</section>
|
|
<section xml:id="section_build-ring">
|
|
<title>Build the ring</title>
|
|
<para>The ring builder process includes these high-level
|
|
steps:</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>The utility calculates the number of partitions to
|
|
assign to each device based on the weight of the
|
|
device. For example, for a partition at the power
|
|
of 20, the ring has 1,048,576 partitions. One
|
|
thousand devices of equal weight each want
|
|
1,048.576 partitions. The devices are sorted by
|
|
the number of partitions they desire and kept in
|
|
order throughout the initialization
|
|
process.</para>
|
|
<note>
|
|
<para>Each device is also assigned a random
|
|
tiebreaker value that is used when two devices
|
|
desire the same number of partitions. This
|
|
tiebreaker is not stored on disk anywhere, and
|
|
so two different rings created with the same
|
|
parameters will have different partition
|
|
assignments. For repeatable partition
|
|
assignments,
|
|
<literal>RingBuilder.rebalance()</literal>
|
|
takes an optional seed value that seeds the
|
|
Python pseudo-random number generator.</para>
|
|
</note>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The ring builder assigns each partition replica
|
|
to the device that requires most partitions at
|
|
that point while keeping it as far away as
|
|
possible from other replicas. The ring builder
|
|
prefers to assign a replica to a device in a
|
|
region that does not already have a replica. If no
|
|
such region is available, the ring builder
|
|
searches for a device in a different zone, or on a
|
|
different server. If it does not find one, it
|
|
looks for a device with no replicas. Finally, if
|
|
all options are exhausted, the ring builder
|
|
assigns the replica to the device that has the
|
|
fewest replicas already assigned.</para>
|
|
<note>
|
|
<para>The ring builder assigns multiple replicas
|
|
to one device only if the ring has fewer
|
|
devices than it has replicas.</para>
|
|
</note>
|
|
</listitem>
|
|
<listitem>
|
|
<para>When building a new ring from an old ring, the
|
|
ring builder recalculates the desired number of
|
|
partitions that each device wants.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The ring builder unassigns partitions and
|
|
gathers these partitions for reassignment, as
|
|
follows: <itemizedlist>
|
|
<listitem>
|
|
<para>The ring builder unassigns any
|
|
assigned partitions from any removed
|
|
devices and adds these partitions to
|
|
the gathered list.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The ring builder unassigns any
|
|
partition replicas that can be spread
|
|
out for better durability and adds
|
|
these partitions to the gathered list.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The ring builder unassigns random
|
|
partitions from any devices that have
|
|
more partitions than they need and
|
|
adds these partitions to the gathered
|
|
list.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The ring builder reassigns the gathered
|
|
partitions to devices by using a similar method to
|
|
the one described previously.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>When the ring builder reassigns a replica to a
|
|
partition, the ring builder records the time of
|
|
the reassignment. The ring builder uses this value
|
|
when it gathers partitions for reassignment so
|
|
that no partition is moved twice in a configurable
|
|
amount of time. The RingBuilder class knows this
|
|
configurable amount of time as
|
|
<literal>min_part_hours</literal>. The ring
|
|
builder ignores this restriction for replicas of
|
|
partitions on removed devices because removal of a
|
|
device happens on device failure only, and
|
|
reassignment is the only choice.</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
<para>Theses steps do not always perfectly rebalance a ring
|
|
due to the random nature of gathering partitions for
|
|
reassignment. To help reach a more balanced ring, the
|
|
rebalance process is repeated until near perfect (less
|
|
than 1 percent off) or when the balance does not improve
|
|
by at least 1 percent (indicating we probably cannot get
|
|
perfect balance due to wildly imbalanced zones or too many
|
|
partitions recently moved).</para>
|
|
</section>
|
|
</section>
|