...which helps us differentiate between a drive that's not mounted vs.
not a dir better in log messages. We were already doing that a bit in
diskfile.py, and it seems like a useful distinction; let's do it more.
While we're at it, remove some log translations.
Related-Change: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Related-Change: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: Ife0d34f9482adb4524d1ab1fe6c335c6b287c2fd
Partial-Bug: 1674543
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.
The workflow is, in overview:
- perform an audit of the container for sharding purposes.
- move any misplaced objects that do not belong in the container
to their correct shard.
- move shard ranges from FOUND state to CREATED state by creating
shard containers.
- move shard ranges from CREATED to CLEAVED state by cleaving objects
to shard dbs and replicating those dbs. By default this is done in
batches of 2 shard ranges per visit.
Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.
The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
With this patch the ContainerBroker gains several new features:
1. A shard_ranges table to persist ShardRange data, along with
methods to merge and access ShardRange instances to that table,
and to remove expired shard ranges.
2. The ability to create a fresh db file to replace the existing db
file. Fresh db files are named using the hash of the container path
plus an epoch which is a serialized Timestamp value, in the form:
<hash>_<epoch>.db
During sharding both the fresh and retiring db files co-exist on
disk. The ContainerBroker is now able to choose the newest on disk db
file when instantiated. It also provides a method (get_brokers()) to
gain access to broker instance for either on disk file.
3. Methods to access the current state of the on disk db files i.e.
UNSHARDED (old file only), SHARDING (fresh and retiring files), or
SHARDED (fresh file only with shard ranges).
Container replication is also modified:
1. shard ranges are replicated between container db peers. Unlike
objects, shard ranges are both pushed and pulled during a REPLICATE
event.
2. If a container db is capable of being sharded (i.e. it has a set of
shard ranges) then it will no longer attempt to replicate objects to
its peers. Object record durability is achieved by sharding rather than
peer to peer replication.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Ie4d2816259e6c25c346976e181fb9d350f947190
...in preparation for the container sharding feature.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I4455677abb114a645cff93cd41b394d227e805de
We've already got it in the response, may as well apply it now rather
than wait for the other end to get around to running its replicators.
Change-Id: Ie36a6dd075beda04b9726dfa2bba9ffed025c9ef
Similar to the object replicator and reconstructor, these arguments
are comma-separated lists of device names and partitions,
respectively, on which the account or container replicator will
operate. Other devices and partitions are ignored.
Change-Id: Ic108f5c38f700ac4c7bcf8315bf4c55306951361
The object reconstructor has a handoffs-only mode that is very useful
when a cluster requires rapid rebalancing, like when disks are nearing
fullness. This mode's goal is to remove handoff partitions from disks
without spending effort on primary partitions. The object replicator
has a similar mode, though it varies in some details.
This commit adds a handoffs-only mode to the account and container
replicators.
Change-Id: I588b151ee65ae49d204bd6bf58555504c15edf9f
Closes-Bug: 1668399
If a cluster operator has some tooling that makes directories in
/srv/node/<disk>/accounts, then the account replicator will treat
those directories as partition dirs and may remove empty
subdirectories contained therein. This wastes time and confuses the
operator.
This commit makes DB replicators skip partition directories whose
names don't look like positive integers. This doesn't completely avoid
the problem since an operator can still use an all-digit name, but it
will skip directories like "tmp21945".
Change-Id: I8d6682915a555f537fc0ce8c39c3d52c99ff3056
We added check_drive to the account/container servers to unify how all
the storage wsgi servers treat device dirs/mounts. Thus pushes that
unification down into the consistency engine.
Drive-by:
* use FakeLogger less
* clean up some repeititon in probe utility for device re-"mounting"
Related-Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Use assertIsNone() instead of assertEqual(), because assertEqual()
still fails on false values when compared to None
Change-Id: Ic52c319e3e55135df834fdf857982e1721bc44bb
Following OpenStack Style Guidelines:
[1] http://docs.openstack.org/developer/hacking/#unit-tests-and-assertraises
[H203] Unit test assertions tend to give better messages for more specific
assertions. As a result, assertIsNone(...) is preferred over
assertEqual(None, ...) and assertIs(..., None)
Change-Id: If4db8872c4f5705c1fff017c4891626e9ce4d1e4
In previous 'rsync_then_merge' remote objects are merged with
rsync'ed local objects, but remote metadata is not merged with local
one. Account/Container replicator sometimes uses rsync for db sync if
there is a big difference of record history in db files between 'local'
and 'remote' servers. If replicator needs to rsync local db to remote
but metadata in local db is older, older info of metadata can be
distributed then some metadata values can be missing or go back to older.
This patch fixes this problem by merging 'remote' metadata with
rsync'ed local db file.
Closes-Bug: #1570118
Change-Id: Icdf0a936fc456c5462471938cbc365bd012b05d4
When drive with container or account database is unmounted
replicator pushes database to handoff location. But this
handoff location finds replica with unmounted drive and
pushes database to the *next* handoff until all handoffs has
a replica - all container/account servers has replicas of
all unmounted drives.
This patch solves:
- Consumption of iterator on handoff location that results in
replication to the next and next handoff.
- StopIteration exception stopped not finished loop over
available handoffs if no more nodes exists for db replication
candidency.
Regression was introduced in 2.4.0 with rsync compression.
Co-Author: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Closes-Bug: 1675500
Remove some cruft and cleanup tests from related change.
Related-Change: I721fa5fe9a7ae22eead8d5141f93e116847ca058
Change-Id: Id3addaca9057569a535e4df1c4209a0ddad84d20
If a db gets quarantined we may fail to cleanup an empty suffix dir or
a hash dir.
Change-Id: I721fa5fe9a7ae22eead8d5141f93e116847ca058
Closes-Bug: #1583719
Replace reload() builtin function with six.moves.reload_module() to
make the code compatible with Python 2 and Python 3.
Change-Id: I7572d613fef700b392d412501facc3bd5ee72a66
If one uses only a single replica and a database file is placed on a
wrong partition, it will be removed instead of replicated to the correct
partition.
There are two reasons for this:
1. The list of nodes is empty when there is only a single replica
2. all(responses) is True even if there is no response at all, and the
latter is always True if there is no node to replicate to.
This patch fixes this by adding a special case if used with only one
replica to the node selection loop and ensures that the list of
responses is not empty. Also adds a test that fails on current master
and passes with this change.
Closes-Bug: 1568591
Change-Id: I028ea8c1928e8c9a401db31fb266ff82606f8371
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
Example:
* Different port in config and in ring file.
* Running daemon on server not in ring file.
In both cases replication daemon is running but nothing is replicated.
Error log helps to distinguish a local device can't be identified.
Closes-Bug: 1508228
Change-Id: I99351b7d9946f250b7750df91c13d09352a145ce
a1c32702, 736cf54a, and 38787d0f remove uses of `simplejson` from
various parts of Swift in favor of the standard libary `json`
module (introduced in Python 2.6). This commit performs the remaining
`simplejson` to `json` replacements, removes two comments highlighting
quirks of simplejson with respect to Unicode, and removes the references
to it in setup documentation and requirements.txt.
There were a lot of places where we were importing json from
swift.common.utils, which is less intuitive than a direct `import json`,
so that replacement is made as well.
(And in two more tiny drive-bys, we add some pretty-indenting to an XML
fragment and use `super` rather than naming a base class explicitly.)
Change-Id: I769e88dda7f76ce15cf7ce930dc1874d24f9498a
The current rule inside the db_replicator is to rsync+merge
containers during replication if the difference between rowids
differ by more than 50%:
# if the difference in rowids between the two differs by
# more than 50%, rsync then do a remote merge.
if rinfo['max_row'] / float(info['max_row']) < 0.5:
This mean on smaller containers, that only have few rows, and differ
by a small number still rsync+merge rather then copying rows.
This change adds a new condition, the difference in the rowids must
be greater than the defined per_diff otherwise usync will be used:
# if the difference in rowids between the two differs by
# more than 50% and the difference is greater than per_diff,
# rsync then do a remote merge.
# NOTE: difference > per_diff stops us from dropping to rsync
# on smaller containers, who have only a few rows to sync.
if rinfo['max_row'] / float(info['max_row']) < 0.5 and \
info['max_row'] - rinfo['max_row'] > self.per_diff:
Change-Id: I9e779f71bf37714919a525404565dd075762b0d4
Closes-bug: #1019712
assertEquals is deprecated in py3, replacing it.
Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>
Currently, the rsync module where the replicators send data is static. It
forbids administrators to set rsync configuration based on their current
deployment or needs.
As an example, the rsyncd configuration example encourages to set a connections
limit for the modules account, container and object. It permits to protect
devices from excessives parallels connections, because it would impact
performances.
On a server with many devices, it is tempting to increase this number
proportionally, but nothing guarantees that the distribution of the connections
will be balanced. In the worst scenario, a single device can receive all the
connections, which is a severe impact on performances.
This commit adds a new option named 'rsync_module' to the *-replicator sections
of the *-server configuration file. This configuration variable can be
extrapolated with device attributes like ip, port, device, zone, ... by using
the format {NAME}. eg:
rsync_module = {replication_ip}::object_{device}
With this configuration, an administrators can solve the problem of connections
distribution by creating one module per device in rsyncd configuration.
The default values are backward compatible:
{replication_ip}::account
{replication_ip}::container
{replication_ip}::object
Option vm_test_mode is deprecated by this commit, but backward compatibility is
maintained. The option is only effective when rsync_module is not set. In that
case, {replication_port} is appended to the default value of rsync_module.
Change-Id: Iad91df50dadbe96c921181797799b4444323ce2e
'print' function is compatible with 2.x and 3.x python versions
Link : https://www.python.org/dev/peps/pep-3105/
Python 2.6 has a __future__ import that removes print as language syntax,
letting you use the functional form instead
Change-Id: I94e1bc6bd83ad6b05695c7ebdf7cbfd8f6d9f9af
The assert_() method is deprecated and can be safely replaced by assertTrue().
This patch makes sure that running the tests does not create undesired
warnings.
Change-Id: I0602ba39ef93263386644ee68088d5f65fcb4a71
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
The Python 2 next() method of iterators was renamed to __next__() on
Python 3. Use the builtin next() function instead which works on Python
2 and Python 3.
Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d
renamer() method now does a fsync on containing directory of target path
and also on parent dirs of newly created directories, by default.
This can be explicitly turned off in cases where it is not
necessary (For example- quarantines).
The following article explains why this is necessary:
http://lwn.net/Articles/457667/
Although, it may seem like the right thing to do, this change does come
at a performance penalty. However, no configurable option is provided to
turn it off.
Also, lock_path() inside invalidate_hash() was always creating part of
object path in filesystem. Those are never fsync'd. This has been
fixed.
Change-Id: Id8e02f84f48370edda7fb0c46e030db3b53a71e3
Signed-off-by: Prashanth Pai <ppai@redhat.com>
From rsync's man page:
-z, --compress
With this option, rsync compresses the file data as it is sent to the
destination machine, which reduces the amount of data being transmitted --
something that is useful over a slow connection.
A configurable option has been added to allow rsync to compress, but only
if the remote node is in a different region than the local one.
NOTE: Objects that are already compressed (for example: .tar.gz, .mp3)
might slow down the syncing process.
On wire compression can also be extended to ssync later in a different
change if required. In case of ssync, we could explore faster
compression libraries like lz4. rsync uses zlib which is slow but offers
higher compression ratio.
Change-Id: Ic9b9cbff9b5e68bef8257b522cc352fc3544db3c
Signed-off-by: Prashanth Pai <ppai@redhat.com>
It removes test_dispatch test from test_db_replicator
which has been commented out for a while.
Change-Id: Ia28fa923a65ad7d85804cbf6f7acef244741bab1
Closes-Bug: #1408502
As I understand it db replication starts with a preflight sync request
to the remote container server who's response will include the last
synced row_id that it has on file for the sending nodes database id.
If the difference in the last sync point returned is more than 50% of
the local sending db's rows, it'll fall back to sending the whole db
over rsync and let the remote end merge items locally - but generally
there's just a few rows missing and they're shipped over the wire as
json and stuffed into some rather normal looking merge_items calls.
The one thing that's a bit different with these remote merge_items calls
(compared to your average run of the mill eat a bunch of entries out of
a .pending file) is the is source kwarg. When this optional kwarg comes
into merge_items it's the remote sending db's uuid, and after we eat all
the rows it sent us we update our local incoming_sync table for that
uuid so that next time when it makes it's pre-flight sync request we can
tell it where it left off.
Now normally the sending db is going to push out it's rows up from the
returned sync_point in 1000 item diffs, up to 10 batches total (per_diff
and max_diffs options) - 10K rows. If that goes well then everything is
in sync up to at least the point it started, and the sending db will
*also* ship over *it's* incoming_sync rows to merge_syncs on the remote
end. Since the sending db is in sync with these other db's up to those
points so is the remote db now by way of the transitive property. Also
note through some weird artifact that I'm not entirely convinced isn't
an unrelated and possibly benign bug the incoming_sync table on the
sending db will often also happen to include it's own uuid - maybe it
got pushed back to it from another node?
Anyway, that seemed to work well enough until a sending db got diff
capped (i.e. sent it's 10K rows and wasn't finished), when this happened
the final merge_syncs call never gets sent because the remote end is
definitely *not* up to date with the other databases that the sending db
is - it's not even up-to-date with the sending db yet! But the hope is
certainly that on the next pass it'll be able to finish sending the
remaining items. But since the remote end is who decides what the last
successfully synced row with this local sending db was - it's super
important that the incoming_sync table is getting updated in merge_items
when that source kwarg is there.
I observed this simple and straight forward process wasn't working well
in one case - which is weird considering it didn't have much in the way
of tests. After I had the test and started looking into it seemed maybe
the source kwarg handling got over-indented a bit in the bulk insert
merge_items refactor. I think this is correct - maybe we could send
someone up to the mountain temple to seek out gholt?
Change-Id: I4137388a97925814748ecc36b3ab5f1ac3309659
The common db replicator's code path for reclaiming deleted db's beyond the
reclaim age was not covered by unittests, and a AttributeError snuck in. In
writing the test that would cover the common code both for accounts and
containers I discovered another KeyError with the container conditional for
validating the container's fully reported status.
This fixes both those issues and adds additional tests for the cleanup empty
account container partition and suffix directories.
Change-Id: I2a1bfaefebd05b01231bf71dd908fcc49adb4c36
Because we iterate over these directories on a replication run,
and they are not (previously) cleaned up, the time to start the
replication increases incrementally for each stale directory
lying around. Thousands of directories across dozens of disks
on a single machine can make for non-trivial startup times.
Plus it just seems like good housekeeping.
Closes-Bug: #1396152
Change-Id: Iab607b03b7f011e87b799d1f9af7ab3b4ff30019
Account/Container-replicator checks connection generation and timeout
in HTTP REPLICATE Request in _repl_to_node, but it doesn't really checks
connection but only construction of ReplConnection class.
This patch removes that invalid checking.
Change-Id: Ie6b4062123d998e69c15638b741e7d1ba8a08b62
Closes-Bug: #1359018
After a container database is replicated, a _post_replicate_hook will enqueue
misplaced objects for the container-reconciler into the .misplaced_objects
containers. Items to be reconciled are "batch loaded" into the reconciler
queue and the end of a container replication cycle by levering container
replication itself.
DocImpact
Implements: blueprint storage-policies
Change-Id: I3627efcdea75403586dffee46537a60add08bfda
Keep status_changed_at in container databases current with status changes that
occur as a result of container creation, deletion, or re-creation.
Merge container put/delete/created timestamps when handling replicate
responses from remote servers in addition to during the handling of the
REPLICATE request.
When storage policies are configured on a cluster send status_changed_at,
object_count and storage_policy_index as part of container replication sync
args.
Use status_changed_at during replication to determine the oldest active
container and merge storage_policy_index.
DocImpact
Implements: blueprint storage-policies
Change-Id: Ib9a0dd42c271145e641437dc04d0ebea1e11fc47
FakeLogger gets better log level handling
Parameterize logger on some daemons which were previously
unparameterized and try and use the interface in tests.
FakeRing use more real code
The existing FakeRing mock's implementation bit me on some pretty subtle
character encoding issue by-passing the hash_path code that is normally
part of get_part_nodes. This change tries to exercise more of the real
ring code paths when it makes sense and provide a better Fake for use in
testing.
Add write_fake_ring helper to test.unit for when you need a real ring.
DocImpact
Implements: blueprint storage-policies
Change-Id: Id2e3740b1dd569050f4e083617e7dd6a4249027e
It simply makes sense that the definition of DATADIR belongs to
backends. After all, some of them may not even have any.
Coincidentially, a few unnecessary imports are dropped.
By the way, on the object server side, diskfile.py provides DATADIR
in the same way already.
Change-Id: I60bfd522c77c4a0ee13697a2e31141777c7e2398