Context for this is at https://specs.openstack.org/openstack/docs-specs/specs/pike/os-manuals-migration.html Change-Id: I9a4da27ce1d56b6406e2db979698038488f3cf6f
		
			
				
	
	
		
			229 lines
		
	
	
		
			9.7 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			229 lines
		
	
	
		
			9.7 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
============
 | 
						|
Ring-builder
 | 
						|
============
 | 
						|
 | 
						|
Use the swift-ring-builder utility to build and manage rings. This
 | 
						|
utility assigns partitions to devices and writes an optimized Python
 | 
						|
structure to a gzipped, serialized file on disk for transmission to the
 | 
						|
servers. The server processes occasionally check the modification time
 | 
						|
of the file and reload in-memory copies of the ring structure as needed.
 | 
						|
If you use a slightly older version of the ring, one of the three
 | 
						|
replicas for a partition subset will be incorrect because of the way the
 | 
						|
ring-builder manages changes to the ring. You can work around this
 | 
						|
issue.
 | 
						|
 | 
						|
The ring-builder also keeps its own builder file with the ring
 | 
						|
information and additional data required to build future rings. It is
 | 
						|
very important to keep multiple backup copies of these builder files.
 | 
						|
One option is to copy the builder files out to every server while
 | 
						|
copying the ring files themselves. Another is to upload the builder
 | 
						|
files into the cluster itself. If you lose the builder file, you have to
 | 
						|
create a new ring from scratch. Nearly all partitions would be assigned
 | 
						|
to different devices and, therefore, nearly all of the stored data would
 | 
						|
have to be replicated to new locations. So, recovery from a builder file
 | 
						|
loss is possible, but data would be unreachable for an extended time.
 | 
						|
 | 
						|
Ring data structure
 | 
						|
~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The ring data structure consists of three top level fields: a list of
 | 
						|
devices in the cluster, a list of lists of device ids indicating
 | 
						|
partition to device assignments, and an integer indicating the number of
 | 
						|
bits to shift an MD5 hash to calculate the partition for the hash.
 | 
						|
 | 
						|
Partition assignment list
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
This is a list of ``array('H')`` of devices ids. The outermost list
 | 
						|
contains an ``array('H')`` for each replica. Each ``array('H')`` has a
 | 
						|
length equal to the partition count for the ring. Each integer in the
 | 
						|
``array('H')`` is an index into the above list of devices. The partition
 | 
						|
list is known internally to the Ring class as ``_replica2part2dev_id``.
 | 
						|
 | 
						|
So, to create a list of device dictionaries assigned to a partition, the
 | 
						|
Python code would look like:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
   devices = [self.devs[part2dev_id[partition]] for
 | 
						|
   part2dev_id in self._replica2part2dev_id]
 | 
						|
 | 
						|
That code is a little simplistic because it does not account for the
 | 
						|
removal of duplicate devices. If a ring has more replicas than devices,
 | 
						|
a partition will have more than one replica on a device.
 | 
						|
 | 
						|
``array('H')`` is used for memory conservation as there may be millions
 | 
						|
of partitions.
 | 
						|
 | 
						|
Overload
 | 
						|
~~~~~~~~
 | 
						|
 | 
						|
The ring builder tries to keep replicas as far apart as possible while
 | 
						|
still respecting device weights. When it can not do both, the overload
 | 
						|
factor determines what happens. Each device takes an extra
 | 
						|
fraction of its desired partitions to allow for replica dispersion;
 | 
						|
after that extra fraction is exhausted, replicas are placed closer
 | 
						|
together than optimal.
 | 
						|
 | 
						|
The overload factor lets the operator trade off replica
 | 
						|
dispersion (durability) against data dispersion (uniform disk usage).
 | 
						|
 | 
						|
The default overload factor is 0, so device weights are strictly
 | 
						|
followed.
 | 
						|
 | 
						|
With an overload factor of 0.1, each device accepts 10% more
 | 
						|
partitions than it otherwise would, but only if it needs to maintain
 | 
						|
partition dispersion.
 | 
						|
 | 
						|
For example, consider a 3-node cluster of machines with equal-size disks;
 | 
						|
node A has 12 disks, node B has 12 disks, and node C has
 | 
						|
11 disks. The ring has an overload factor of 0.1 (10%).
 | 
						|
 | 
						|
Without the overload, some partitions would end up with replicas only
 | 
						|
on nodes A and B. However, with the overload, every device can accept
 | 
						|
up to 10% more partitions for the sake of dispersion. The
 | 
						|
missing disk in C means there is one disk's worth of partitions
 | 
						|
to spread across the remaining 11 disks, which gives each
 | 
						|
disk in C an extra 9.09% load. Since this is less than the 10%
 | 
						|
overload, there is one replica of each partition on each node.
 | 
						|
 | 
						|
However, this does mean that the disks in node C have more data
 | 
						|
than the disks in nodes A and B. If 80% full is the warning
 | 
						|
threshold for the cluster, node C's disks reach 80% full while A
 | 
						|
and B's disks are only 72.7% full.
 | 
						|
 | 
						|
 | 
						|
Replica counts
 | 
						|
~~~~~~~~~~~~~~
 | 
						|
 | 
						|
To support the gradual change in replica counts, a ring can have a real
 | 
						|
number of replicas and is not restricted to an integer number of
 | 
						|
replicas.
 | 
						|
 | 
						|
A fractional replica count is for the whole ring and not for individual
 | 
						|
partitions. It indicates the average number of replicas for each
 | 
						|
partition. For example, a replica count of 3.2 means that 20 percent of
 | 
						|
partitions have four replicas and 80 percent have three replicas.
 | 
						|
 | 
						|
The replica count is adjustable. For example:
 | 
						|
 | 
						|
.. code-block:: console
 | 
						|
 | 
						|
   $ swift-ring-builder account.builder set_replicas 4
 | 
						|
   $ swift-ring-builder account.builder rebalance
 | 
						|
 | 
						|
You must rebalance the replica ring in globally distributed clusters.
 | 
						|
Operators of these clusters generally want an equal number of replicas
 | 
						|
and regions. Therefore, when an operator adds or removes a region, the
 | 
						|
operator adds or removes a replica. Removing unneeded replicas saves on
 | 
						|
the cost of disks.
 | 
						|
 | 
						|
You can gradually increase the replica count at a rate that does not
 | 
						|
adversely affect cluster performance. For example:
 | 
						|
 | 
						|
.. code-block:: console
 | 
						|
 | 
						|
   $ swift-ring-builder object.builder set_replicas 3.01
 | 
						|
   $ swift-ring-builder object.builder rebalance
 | 
						|
   <distribute rings and wait>...
 | 
						|
 | 
						|
   $ swift-ring-builder object.builder set_replicas 3.02
 | 
						|
   $ swift-ring-builder object.builder rebalance
 | 
						|
   <distribute rings and wait>...
 | 
						|
 | 
						|
Changes take effect after the ring is rebalanced. Therefore, if you
 | 
						|
intend to change from 3 replicas to 3.01 but you accidentally type
 | 
						|
2.01, no data is lost.
 | 
						|
 | 
						|
Additionally, the :command:`swift-ring-builder X.builder create` command can
 | 
						|
now take a decimal argument for the number of replicas.
 | 
						|
 | 
						|
Partition shift value
 | 
						|
~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The partition shift value is known internally to the Ring class as
 | 
						|
``_part_shift``. This value is used to shift an MD5 hash to calculate
 | 
						|
the partition where the data for that hash should reside. Only the top
 | 
						|
four bytes of the hash is used in this process. For example, to compute
 | 
						|
the partition for the ``/account/container/object`` path using Python:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
   partition = unpack_from('>I',
 | 
						|
   md5('/account/container/object').digest())[0] >>
 | 
						|
   self._part_shift
 | 
						|
 | 
						|
For a ring generated with part\_power P, the partition shift value is
 | 
						|
``32 - P``.
 | 
						|
 | 
						|
Build the ring
 | 
						|
~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The ring builder process includes these high-level steps:
 | 
						|
 | 
						|
#. The utility calculates the number of partitions to assign to each
 | 
						|
   device based on the weight of the device. For example, for a
 | 
						|
   partition at the power of 20, the ring has 1,048,576 partitions. One
 | 
						|
   thousand devices of equal weight each want 1,048.576 partitions. The
 | 
						|
   devices are sorted by the number of partitions they desire and kept
 | 
						|
   in order throughout the initialization process.
 | 
						|
 | 
						|
   .. note::
 | 
						|
 | 
						|
      Each device is also assigned a random tiebreaker value that is
 | 
						|
      used when two devices desire the same number of partitions. This
 | 
						|
      tiebreaker is not stored on disk anywhere, and so two different
 | 
						|
      rings created with the same parameters will have different
 | 
						|
      partition assignments. For repeatable partition assignments,
 | 
						|
      ``RingBuilder.rebalance()`` takes an optional seed value that
 | 
						|
      seeds the Python pseudo-random number generator.
 | 
						|
 | 
						|
#. The ring builder assigns each partition replica to the device that
 | 
						|
   requires most partitions at that point while keeping it as far away
 | 
						|
   as possible from other replicas. The ring builder prefers to assign a
 | 
						|
   replica to a device in a region that does not already have a replica.
 | 
						|
   If no such region is available, the ring builder searches for a
 | 
						|
   device in a different zone, or on a different server. If it does not
 | 
						|
   find one, it looks for a device with no replicas. Finally, if all
 | 
						|
   options are exhausted, the ring builder assigns the replica to the
 | 
						|
   device that has the fewest replicas already assigned.
 | 
						|
 | 
						|
   .. note::
 | 
						|
 | 
						|
      The ring builder assigns multiple replicas to one device only if
 | 
						|
      the ring has fewer devices than it has replicas.
 | 
						|
 | 
						|
#. When building a new ring from an old ring, the ring builder
 | 
						|
   recalculates the desired number of partitions that each device wants.
 | 
						|
 | 
						|
#. The ring builder unassigns partitions and gathers these partitions
 | 
						|
   for reassignment, as follows:
 | 
						|
 | 
						|
   - The ring builder unassigns any assigned partitions from any
 | 
						|
     removed devices and adds these partitions to the gathered list.
 | 
						|
   - The ring builder unassigns any partition replicas that can be
 | 
						|
     spread out for better durability and adds these partitions to the
 | 
						|
     gathered list.
 | 
						|
   - The ring builder unassigns random partitions from any devices that
 | 
						|
     have more partitions than they need and adds these partitions to
 | 
						|
     the gathered list.
 | 
						|
 | 
						|
#. The ring builder reassigns the gathered partitions to devices by
 | 
						|
   using a similar method to the one described previously.
 | 
						|
 | 
						|
#. When the ring builder reassigns a replica to a partition, the ring
 | 
						|
   builder records the time of the reassignment. The ring builder uses
 | 
						|
   this value when it gathers partitions for reassignment so that no
 | 
						|
   partition is moved twice in a configurable amount of time. The
 | 
						|
   RingBuilder class knows this configurable amount of time as
 | 
						|
   ``min_part_hours``. The ring builder ignores this restriction for
 | 
						|
   replicas of partitions on removed devices because removal of a device
 | 
						|
   happens on device failure only, and reassignment is the only choice.
 | 
						|
 | 
						|
These steps do not always perfectly rebalance a ring due to the random
 | 
						|
nature of gathering partitions for reassignment. To help reach a more
 | 
						|
balanced ring, the rebalance process is repeated until near perfect
 | 
						|
(less than 1 percent off) or when the balance does not improve by at
 | 
						|
least 1 percent (indicating we probably cannot get perfect balance due
 | 
						|
to wildly imbalanced zones or too many partitions recently moved).
 |