swift/doc/source/ring_partpower.rst

198 lines
8.1 KiB
ReStructuredText
Raw Normal View History

Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
==============================
Modifying Ring Partition Power
==============================
The ring partition power determines the on-disk location of data files and is
selected when creating a new ring. In normal operation, it is a fixed value.
This is because a different partition power results in a different on-disk
location for all data files.
However, increasing the partition power by 1 can be done by choosing locations
that are on the same disk. As a result, we can create hard-links for both the
new and old locations, avoiding data movement without impacting availability.
To enable a partition power change without interrupting user access, object
servers need to be aware of it in advance. Therefore a partition power change
needs to be done in multiple steps.
.. note::
Do not increase the partition power on account and container rings.
Increasing the partition power is *only* supported for object rings.
Trying to increase the part_power for account and container rings *will*
result in unavailability, maybe even data loss.
-------
Caveats
-------
Before increasing the partition power, consider the possible drawbacks.
There are a few caveats when increasing the partition power:
relinker: Rehash as we complete partitions Previously, we would rely on replication/reconstruction to rehash once the part power increase was complete. This would lead to large I/O spikes if operators didn't proactively tune down replication. Now, do the rehashing in the relinker as it completes work. Operators should already be mindful of the relinker's I/O usage and are likely limiting it through some combination of cgroups, ionice, and/or --files-per-second. Also clean up empty partitions as we clean up. Operators with metrics on cluster-wide primary/handoff partition counts can use them to monitor relinking/cleanup progress: P 3N .----------. a / '. r / '. t 2N / .----------------- i / | t / | i N -----'---------' o n s 0 ---------------------------------- t0 t1 t2 t3 t4 t5 At t0, prior to relinking, there are N := <replica count> * 2 ** <old part power> primary partitions throughout the cluster and a negligible number of handoffs. At t1, relinking begins. In any non-trivial, low-replica-count cluster, the probability of a new-part-power partition being assigned to the same device as its old-part-power equivalent is low, so handoffs grow while primaries remain constant. At t2, relinking is complete. The total number of partitions is now 3N (N primaries + 2N handoffs). At t3, the ring with increased part power is distributed. The notion of what's a handoff and what's a primary inverts. At t4, cleanup begins. The "handoffs" are cleaned up as hard links and now-empty partitions are removed. At t5, cleanup is complete and there are now 2N total partitions. Change-Id: Ib5bf426cf38559091917f2d25f4f60183cd16354
2021-02-02 15:16:43 -08:00
* Almost all diskfiles in the cluster need to be relinked then cleaned up,
and all partition directories need to be rehashed. This imposes significant
I/O load on object servers, which may impact client requests. Consider using
cgroups, ``ionice``, or even just the built-in ``--files-per-second``
rate-limiting to reduce client impact.
* Object replicators and reconstructors will skip affected policies during the
partition power increase. Replicators are not aware of hard-links, and would
simply copy the content; this would result in heavy data movement and the
worst case would be that all data is stored twice.
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
* Due to the fact that each object will now be hard linked from two locations,
relinker: Rehash as we complete partitions Previously, we would rely on replication/reconstruction to rehash once the part power increase was complete. This would lead to large I/O spikes if operators didn't proactively tune down replication. Now, do the rehashing in the relinker as it completes work. Operators should already be mindful of the relinker's I/O usage and are likely limiting it through some combination of cgroups, ionice, and/or --files-per-second. Also clean up empty partitions as we clean up. Operators with metrics on cluster-wide primary/handoff partition counts can use them to monitor relinking/cleanup progress: P 3N .----------. a / '. r / '. t 2N / .----------------- i / | t / | i N -----'---------' o n s 0 ---------------------------------- t0 t1 t2 t3 t4 t5 At t0, prior to relinking, there are N := <replica count> * 2 ** <old part power> primary partitions throughout the cluster and a negligible number of handoffs. At t1, relinking begins. In any non-trivial, low-replica-count cluster, the probability of a new-part-power partition being assigned to the same device as its old-part-power equivalent is low, so handoffs grow while primaries remain constant. At t2, relinking is complete. The total number of partitions is now 3N (N primaries + 2N handoffs). At t3, the ring with increased part power is distributed. The notion of what's a handoff and what's a primary inverts. At t4, cleanup begins. The "handoffs" are cleaned up as hard links and now-empty partitions are removed. At t5, cleanup is complete and there are now 2N total partitions. Change-Id: Ib5bf426cf38559091917f2d25f4f60183cd16354
2021-02-02 15:16:43 -08:00
many more inodes will be used temporarily - expect around twice the amount.
You need to check the free inode count *before* increasing the partition
power. Even after the increase is complete and extra hardlinks are cleaned
up, expect increased inode usage since there will be twice as many partition
and suffix directories.
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
* Also, object auditors might read each object twice before cleanup removes the
second hard link.
* Due to the new inodes more memory is needed to cache them, and your
object servers should have plenty of available memory to avoid running out of
inode cache. Setting ``vfs_cache_pressure`` to 1 might help with that.
* All nodes in the cluster *must* run at least Swift version 2.13.0 or later.
Due to these caveats you should only increase the partition power if really
needed, i.e. if the number of partitions per disk is extremely low and the data
is distributed unevenly across disks.
-----------------------------------
1. Prepare partition power increase
-----------------------------------
The swift-ring-builder is used to prepare the ring for an upcoming partition
power increase. It will store a new variable ``next_part_power`` with the current
partition power + 1. Object servers recognize this, and hard links to the new
location will be created (or deleted) on every PUT or DELETE. This will make
it possible to access newly written objects using the future partition power::
swift-ring-builder <builder-file> prepare_increase_partition_power
swift-ring-builder <builder-file> write_ring
Now you need to copy the updated .ring.gz to all nodes. Already existing data
needs to be relinked too; therefore an operator has to run a relinker command
on all object servers in this phase::
swift-object-relinker relink
.. note::
Start relinking after *all* the servers re-read the modified ring files,
which normally happens within 15 seconds after writing a modified ring.
Also, make sure the modified rings are pushed to all nodes running object
services (replicators, reconstructors and reconcilers)- they have to skip
relinker: Rehash as we complete partitions Previously, we would rely on replication/reconstruction to rehash once the part power increase was complete. This would lead to large I/O spikes if operators didn't proactively tune down replication. Now, do the rehashing in the relinker as it completes work. Operators should already be mindful of the relinker's I/O usage and are likely limiting it through some combination of cgroups, ionice, and/or --files-per-second. Also clean up empty partitions as we clean up. Operators with metrics on cluster-wide primary/handoff partition counts can use them to monitor relinking/cleanup progress: P 3N .----------. a / '. r / '. t 2N / .----------------- i / | t / | i N -----'---------' o n s 0 ---------------------------------- t0 t1 t2 t3 t4 t5 At t0, prior to relinking, there are N := <replica count> * 2 ** <old part power> primary partitions throughout the cluster and a negligible number of handoffs. At t1, relinking begins. In any non-trivial, low-replica-count cluster, the probability of a new-part-power partition being assigned to the same device as its old-part-power equivalent is low, so handoffs grow while primaries remain constant. At t2, relinking is complete. The total number of partitions is now 3N (N primaries + 2N handoffs). At t3, the ring with increased part power is distributed. The notion of what's a handoff and what's a primary inverts. At t4, cleanup begins. The "handoffs" are cleaned up as hard links and now-empty partitions are removed. At t5, cleanup is complete and there are now 2N total partitions. Change-Id: Ib5bf426cf38559091917f2d25f4f60183cd16354
2021-02-02 15:16:43 -08:00
the policy during relinking.
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
.. note::
The relinking command must run as the same user as the daemon processes
(usually swift). It will create files and directories that must be
manipulable by the daemon processes (server, auditor, replicator, ...).
relinker: Rehash as we complete partitions Previously, we would rely on replication/reconstruction to rehash once the part power increase was complete. This would lead to large I/O spikes if operators didn't proactively tune down replication. Now, do the rehashing in the relinker as it completes work. Operators should already be mindful of the relinker's I/O usage and are likely limiting it through some combination of cgroups, ionice, and/or --files-per-second. Also clean up empty partitions as we clean up. Operators with metrics on cluster-wide primary/handoff partition counts can use them to monitor relinking/cleanup progress: P 3N .----------. a / '. r / '. t 2N / .----------------- i / | t / | i N -----'---------' o n s 0 ---------------------------------- t0 t1 t2 t3 t4 t5 At t0, prior to relinking, there are N := <replica count> * 2 ** <old part power> primary partitions throughout the cluster and a negligible number of handoffs. At t1, relinking begins. In any non-trivial, low-replica-count cluster, the probability of a new-part-power partition being assigned to the same device as its old-part-power equivalent is low, so handoffs grow while primaries remain constant. At t2, relinking is complete. The total number of partitions is now 3N (N primaries + 2N handoffs). At t3, the ring with increased part power is distributed. The notion of what's a handoff and what's a primary inverts. At t4, cleanup begins. The "handoffs" are cleaned up as hard links and now-empty partitions are removed. At t5, cleanup is complete and there are now 2N total partitions. Change-Id: Ib5bf426cf38559091917f2d25f4f60183cd16354
2021-02-02 15:16:43 -08:00
If necessary, the ``--user`` option may be used to drop privileges.
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
Relinking might take some time; while there is no data copied or actually
moved, the tool still needs to walk the whole file system and create new hard
links as required.
---------------------------
2. Increase partition power
---------------------------
Now that all existing data can be found using the new location, it's time to
actually increase the partition power itself::
swift-ring-builder <builder-file> increase_partition_power
swift-ring-builder <builder-file> write_ring
Now you need to copy the updated .ring.gz again to all nodes. Object servers
are now using the new, increased partition power and no longer create
additional hard links.
.. note::
The object servers will create additional hard links for each modified or
new object, and this requires more inodes.
.. note::
If you decide you don't want to increase the partition power, you should
instead cancel the increase. It is not possible to revert this operation
once started. To abort the partition power increase, execute the following
commands, copy the updated .ring.gz files to all nodes and continue with
`3. Cleanup`_ afterwards::
swift-ring-builder <builder-file> cancel_increase_partition_power
swift-ring-builder <builder-file> write_ring
----------
3. Cleanup
----------
Existing hard links in the old locations need to be removed, and a cleanup tool
is provided to do this. Run the following command on each storage node::
swift-object-relinker cleanup
.. note::
relinker: Rehash as we complete partitions Previously, we would rely on replication/reconstruction to rehash once the part power increase was complete. This would lead to large I/O spikes if operators didn't proactively tune down replication. Now, do the rehashing in the relinker as it completes work. Operators should already be mindful of the relinker's I/O usage and are likely limiting it through some combination of cgroups, ionice, and/or --files-per-second. Also clean up empty partitions as we clean up. Operators with metrics on cluster-wide primary/handoff partition counts can use them to monitor relinking/cleanup progress: P 3N .----------. a / '. r / '. t 2N / .----------------- i / | t / | i N -----'---------' o n s 0 ---------------------------------- t0 t1 t2 t3 t4 t5 At t0, prior to relinking, there are N := <replica count> * 2 ** <old part power> primary partitions throughout the cluster and a negligible number of handoffs. At t1, relinking begins. In any non-trivial, low-replica-count cluster, the probability of a new-part-power partition being assigned to the same device as its old-part-power equivalent is low, so handoffs grow while primaries remain constant. At t2, relinking is complete. The total number of partitions is now 3N (N primaries + 2N handoffs). At t3, the ring with increased part power is distributed. The notion of what's a handoff and what's a primary inverts. At t4, cleanup begins. The "handoffs" are cleaned up as hard links and now-empty partitions are removed. At t5, cleanup is complete and there are now 2N total partitions. Change-Id: Ib5bf426cf38559091917f2d25f4f60183cd16354
2021-02-02 15:16:43 -08:00
The cleanup must be finished within your object servers ``reclaim_age``
period (which is by default 1 week). Otherwise objects that have been
overwritten between step #1 and step #2 and deleted afterwards can't be
cleaned up anymore. You may want to increase your ``reclaim_age`` before
or during relinking.
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
Afterwards it is required to update the rings one last
time to inform servers that all steps to increase the partition power are done,
and replicators should resume their job::
swift-ring-builder <builder-file> finish_increase_partition_power
swift-ring-builder <builder-file> write_ring
Now you need to copy the updated .ring.gz again to all nodes.
----------
Background
----------
An existing object that is currently located on partition X will be placed
either on partition 2*X or 2*X+1 after the partition power is increased. The
reason for this is the Ring.get_part() method, that does a bitwise shift to the
right.
To avoid actual data movement to different disks or even nodes, the allocation
of partitions to nodes needs to be changed. The allocation is pairwise due to
the above mentioned new partition scheme. Therefore devices are allocated like
this, with the partition being the index and the value being the device id::
old new
part dev part dev
---- --- ---- ---
0 0 0 0
1 0
1 3 2 3
3 3
2 7 4 7
5 7
3 5 6 5
7 5
4 2 8 2
9 2
5 1 10 1
11 1
There is a helper method to compute the new path, and the following example
shows the mapping between old and new location::
>>> from swift.common.utils import replace_partition_in_path
>>> old='objects/16003/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
>>> replace_partition_in_path('', '/sda/' + old, 14)
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
'objects/16003/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
>>> replace_partition_in_path('', '/sda/' + old, 15)
Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2016-07-04 18:21:54 +02:00
'objects/32007/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
Using the original partition power (14) it returned the same path; however
after an increase to 15 it returns the new path, and the new partition is 2*X+1
in this case.