Currently, the rsync module where the replicators send data is static. It
forbids administrators to set rsync configuration based on their current
deployment or needs.
As an example, the rsyncd configuration example encourages to set a connections
limit for the modules account, container and object. It permits to protect
devices from excessives parallels connections, because it would impact
performances.
On a server with many devices, it is tempting to increase this number
proportionally, but nothing guarantees that the distribution of the connections
will be balanced. In the worst scenario, a single device can receive all the
connections, which is a severe impact on performances.
This commit adds a new option named 'rsync_module' to the *-replicator sections
of the *-server configuration file. This configuration variable can be
extrapolated with device attributes like ip, port, device, zone, ... by using
the format {NAME}. eg:
rsync_module = {replication_ip}::object_{device}
With this configuration, an administrators can solve the problem of connections
distribution by creating one module per device in rsyncd configuration.
The default values are backward compatible:
{replication_ip}::account
{replication_ip}::container
{replication_ip}::object
Option vm_test_mode is deprecated by this commit, but backward compatibility is
maintained. The option is only effective when rsync_module is not set. In that
case, {replication_port} is appended to the default value of rsync_module.
Change-Id: Iad91df50dadbe96c921181797799b4444323ce2e
'print' function is compatible with 2.x and 3.x python versions
Link : https://www.python.org/dev/peps/pep-3105/
Python 2.6 has a __future__ import that removes print as language syntax,
letting you use the functional form instead
Change-Id: I94e1bc6bd83ad6b05695c7ebdf7cbfd8f6d9f9af
The assert_() method is deprecated and can be safely replaced by assertTrue().
This patch makes sure that running the tests does not create undesired
warnings.
Change-Id: I0602ba39ef93263386644ee68088d5f65fcb4a71
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
The Python 2 next() method of iterators was renamed to __next__() on
Python 3. Use the builtin next() function instead which works on Python
2 and Python 3.
Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d
renamer() method now does a fsync on containing directory of target path
and also on parent dirs of newly created directories, by default.
This can be explicitly turned off in cases where it is not
necessary (For example- quarantines).
The following article explains why this is necessary:
http://lwn.net/Articles/457667/
Although, it may seem like the right thing to do, this change does come
at a performance penalty. However, no configurable option is provided to
turn it off.
Also, lock_path() inside invalidate_hash() was always creating part of
object path in filesystem. Those are never fsync'd. This has been
fixed.
Change-Id: Id8e02f84f48370edda7fb0c46e030db3b53a71e3
Signed-off-by: Prashanth Pai <ppai@redhat.com>
From rsync's man page:
-z, --compress
With this option, rsync compresses the file data as it is sent to the
destination machine, which reduces the amount of data being transmitted --
something that is useful over a slow connection.
A configurable option has been added to allow rsync to compress, but only
if the remote node is in a different region than the local one.
NOTE: Objects that are already compressed (for example: .tar.gz, .mp3)
might slow down the syncing process.
On wire compression can also be extended to ssync later in a different
change if required. In case of ssync, we could explore faster
compression libraries like lz4. rsync uses zlib which is slow but offers
higher compression ratio.
Change-Id: Ic9b9cbff9b5e68bef8257b522cc352fc3544db3c
Signed-off-by: Prashanth Pai <ppai@redhat.com>
It removes test_dispatch test from test_db_replicator
which has been commented out for a while.
Change-Id: Ia28fa923a65ad7d85804cbf6f7acef244741bab1
Closes-Bug: #1408502
As I understand it db replication starts with a preflight sync request
to the remote container server who's response will include the last
synced row_id that it has on file for the sending nodes database id.
If the difference in the last sync point returned is more than 50% of
the local sending db's rows, it'll fall back to sending the whole db
over rsync and let the remote end merge items locally - but generally
there's just a few rows missing and they're shipped over the wire as
json and stuffed into some rather normal looking merge_items calls.
The one thing that's a bit different with these remote merge_items calls
(compared to your average run of the mill eat a bunch of entries out of
a .pending file) is the is source kwarg. When this optional kwarg comes
into merge_items it's the remote sending db's uuid, and after we eat all
the rows it sent us we update our local incoming_sync table for that
uuid so that next time when it makes it's pre-flight sync request we can
tell it where it left off.
Now normally the sending db is going to push out it's rows up from the
returned sync_point in 1000 item diffs, up to 10 batches total (per_diff
and max_diffs options) - 10K rows. If that goes well then everything is
in sync up to at least the point it started, and the sending db will
*also* ship over *it's* incoming_sync rows to merge_syncs on the remote
end. Since the sending db is in sync with these other db's up to those
points so is the remote db now by way of the transitive property. Also
note through some weird artifact that I'm not entirely convinced isn't
an unrelated and possibly benign bug the incoming_sync table on the
sending db will often also happen to include it's own uuid - maybe it
got pushed back to it from another node?
Anyway, that seemed to work well enough until a sending db got diff
capped (i.e. sent it's 10K rows and wasn't finished), when this happened
the final merge_syncs call never gets sent because the remote end is
definitely *not* up to date with the other databases that the sending db
is - it's not even up-to-date with the sending db yet! But the hope is
certainly that on the next pass it'll be able to finish sending the
remaining items. But since the remote end is who decides what the last
successfully synced row with this local sending db was - it's super
important that the incoming_sync table is getting updated in merge_items
when that source kwarg is there.
I observed this simple and straight forward process wasn't working well
in one case - which is weird considering it didn't have much in the way
of tests. After I had the test and started looking into it seemed maybe
the source kwarg handling got over-indented a bit in the bulk insert
merge_items refactor. I think this is correct - maybe we could send
someone up to the mountain temple to seek out gholt?
Change-Id: I4137388a97925814748ecc36b3ab5f1ac3309659
The common db replicator's code path for reclaiming deleted db's beyond the
reclaim age was not covered by unittests, and a AttributeError snuck in. In
writing the test that would cover the common code both for accounts and
containers I discovered another KeyError with the container conditional for
validating the container's fully reported status.
This fixes both those issues and adds additional tests for the cleanup empty
account container partition and suffix directories.
Change-Id: I2a1bfaefebd05b01231bf71dd908fcc49adb4c36
Because we iterate over these directories on a replication run,
and they are not (previously) cleaned up, the time to start the
replication increases incrementally for each stale directory
lying around. Thousands of directories across dozens of disks
on a single machine can make for non-trivial startup times.
Plus it just seems like good housekeeping.
Closes-Bug: #1396152
Change-Id: Iab607b03b7f011e87b799d1f9af7ab3b4ff30019
Account/Container-replicator checks connection generation and timeout
in HTTP REPLICATE Request in _repl_to_node, but it doesn't really checks
connection but only construction of ReplConnection class.
This patch removes that invalid checking.
Change-Id: Ie6b4062123d998e69c15638b741e7d1ba8a08b62
Closes-Bug: #1359018
After a container database is replicated, a _post_replicate_hook will enqueue
misplaced objects for the container-reconciler into the .misplaced_objects
containers. Items to be reconciled are "batch loaded" into the reconciler
queue and the end of a container replication cycle by levering container
replication itself.
DocImpact
Implements: blueprint storage-policies
Change-Id: I3627efcdea75403586dffee46537a60add08bfda
Keep status_changed_at in container databases current with status changes that
occur as a result of container creation, deletion, or re-creation.
Merge container put/delete/created timestamps when handling replicate
responses from remote servers in addition to during the handling of the
REPLICATE request.
When storage policies are configured on a cluster send status_changed_at,
object_count and storage_policy_index as part of container replication sync
args.
Use status_changed_at during replication to determine the oldest active
container and merge storage_policy_index.
DocImpact
Implements: blueprint storage-policies
Change-Id: Ib9a0dd42c271145e641437dc04d0ebea1e11fc47
FakeLogger gets better log level handling
Parameterize logger on some daemons which were previously
unparameterized and try and use the interface in tests.
FakeRing use more real code
The existing FakeRing mock's implementation bit me on some pretty subtle
character encoding issue by-passing the hash_path code that is normally
part of get_part_nodes. This change tries to exercise more of the real
ring code paths when it makes sense and provide a better Fake for use in
testing.
Add write_fake_ring helper to test.unit for when you need a real ring.
DocImpact
Implements: blueprint storage-policies
Change-Id: Id2e3740b1dd569050f4e083617e7dd6a4249027e
It simply makes sense that the definition of DATADIR belongs to
backends. After all, some of them may not even have any.
Coincidentially, a few unnecessary imports are dropped.
By the way, on the object server side, diskfile.py provides DATADIR
in the same way already.
Change-Id: I60bfd522c77c4a0ee13697a2e31141777c7e2398
Adds 20 unit tests to increase the coverage of db_replicator.py
from 71% to 90%
Change-Id: Ia63cb8f2049fb3182bbf7af695087bfe15cede54
Closes-Bug: #948179
This patch adds a test for ReplicatorRpc.complete_rsync()
and complete extract_device() coverage.
test_extract_device:
test the case the parameter is invalid
test_complete_rsync_with_bad_input:
ensure the use of invalid parameters return a 404 erro
test_complete_rsync:
validate the returned code in case of success
Change-Id: I59e0d26a1efe59d8beff1e81c2a7edc6de0872e9
This reverts commit 7760f41c3ce436cb23b4b8425db3749a3da33d32
Change-Id: I95e57a2563784a8cd5e995cc826afeac0eadbe62
Signed-off-by: Peter Portante <peter.portante@redhat.com>
Place all the methods related to on-disk layout and / or configuration
into a new common module that can be shared by the various modules
using the same on-disk layout.
Change-Id: I27ffd4665d5115ffdde649c48a4d18e12017e6a9
Signed-off-by: Peter Portante <peter.portante@redhat.com>
* Create class for testing _repl_to_not and replicate_object fuctions to
prevent duplication code by adding all preparation into setUp function.
* Move existed test function which testin _repl_to_not and
replicate_object into created classes.
* Add tests for replicate_object and _repl_to_node functions.
Change-Id: I75ac7c6f0230e71bfb24328e44c33734b520b4cd
See Bug 1187200 for a full description of the problem.
Part 1:
X-Delete-At-Container added to X-Delete-At-* info
This fixes the bug by passing the expiring-objects-account's
container name onward to the backend object servers. This is in case
the object servers' expiring_objects_container_divisor happens to be
different than the proxy server's, we want to make sure the host,
partition, and device match up with the container name. Different
container names would be fine, but not with mismatched host,
partition, and device info.
Part 2:
The db_replicator now double checks the disk path's partition against
the partition the ring gives back. If they don't match, it logs the
problem but continues to replicate the database to where it should be
and, on success to all proper nodes, removes the local out of place
database.
Bug 1187200
Change-Id: Id0873a3f2198ce285fe0b0c777738eff38bc2438
Attribute get_repl_missing_table in FakeBroker class was changed in
test_replicate_object_quarantine function and not returned back. That's
why next test cases takes not expexted values from FakeBroker.
fixes bug 1180354
Change-Id: Iba55255771e6483832c7782fcbe331e20e818f4e
Support separate replication ip address:
- Added new function in utils. This function provides ability
to select separate IP address for replication service.
- Db_replicator and object replicators were changed.
Replication process uses new function now.
Replication network parameters:
- Replication network fields (replication_ip, replication_port)
support was added to device dictionary in swift-ring-builder script.
- Changes were made to support new fields in search, show and set_info
functions.
Implementation of replication servers:
- Separate replication servers use the same code as normal replication
servers, but with replication_server parameter = True. When using a
separate replication network, the non-replication servers set
replication_server = False. When there is no separate replication
network (the default case), replication_server is not included in the config.
DocImpact
Change-Id: Ie9af5bdcdf9241c355e36053ca4adfe49dc35bd0
Implements: blueprint dedicated-replication-network
roundrobin_datadirs was returning any .db file at any depth in the
accounts/containers structure. Since xfs corruption can cause such
files to appear in odd places at times (only happened on one drive of
ours so far, but still...), I've refactored this function to only
return .db files at the proper depth.
Change-Id: Id06ef6584941f8a572e286f69dfa3d96fe451355
When a db is reclaimed it removes the hash dir the db files are in,
but it does not try to remove the parent suffix dir though it might
be empty now. This eventually leads to a bunch of empty suffix dirs
lying around. This patch fixes that by attempting to remove the
parent suffix dir after a hash dir reclamation.
Here's a quick script to see how bad a given drive might be:
import os, os.path, sys
if len(sys.argv) != 2:
sys.exit('%s <mount-point>' % sys.argv[0])
in_use = 0
empty = 0
containers = os.path.join(sys.argv[1], 'containers')
for p in os.listdir(containers):
partition = os.path.join(containers, p)
for s in os.listdir(partition):
suffix = os.path.join(partition, s)
if os.listdir(suffix):
in_use += 1
else:
empty += 1
print in_use, 'in use,', empty, 'empty,', '%.02f%%' % (
100.0 * empty / (in_use + empty)), 'empty'
And here's a quick script to clean up a drive:
NOTE THAT I HAVEN'T ACTUALLY RUN THIS ON A LIVE NODE YET!
import errno, os, os.path, sys
if len(sys.argv) != 2:
sys.exit('%s <mount-point>' % sys.argv[0])
containers = os.path.join(sys.argv[1], 'containers')
for p in os.listdir(containers):
partition = os.path.join(containers, p)
for s in os.listdir(partition):
suffix = os.path.join(partition, s)
try:
os.rmdir(suffix)
except OSError, err:
if err.errno not in (errno.ENOENT, errno.ENOTEMPTY):
print err
Change-Id: I2e6463a4cd40597fc236ebe3e73b4b31347f2309
To tell when replication for a device has finished, it's important to
know when the replicator is removing objects. This was previously
handled for the object-replicator
(object-replicator.partition.delete.count.<device> and
object-replicator.partition.update.count.<device> metrics) but not the
account and container replicators.
This patch extends the existing DB removal count metrics to make them
per-device. The new metrics are:
account-replicator.removes.<device>
container-replicator.removes.<device>
There's also a bonus refactoring and increased test coverage of the DB
replicator code.
Change-Id: I2067317d4a5f8ad2a496834147954bdcdfc541c1