Currently when the replicator or auditor hits an ENODATA corruption
we bail. This patch will allow them to skip and continue in an effort
to make progress on the work that needs to be done.
Change-Id: Id4c6e31332a93eb64a7660cec2a42ef686a84d22
This will be used when finding their own devices in rings, defaulting to
the bind_ip.
Notably, this allows services to be containerized while servers_per_port
is enabled:
* For the object-server, the ring_ip should be set to the host ip and
will be used to discover which ports need binding. Sockets will still
be bound to the bind_ip (likely 0.0.0.0), with the assumption that the
host will publish ports 1:1.
* For the replicator and reconstructor, the ring_ip will be used to
discover which devices should be replicated. While bind_ip could
previously be used for this, it would have required a separate config
from the object-server.
Also rename object deamon's bind_ip attribute to ring_ip so that it's
more obvious wherever we're using the IP for ring lookups instead of
socket binding.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218
Nothing ever looked at ObjectReplicator.my_replication_ips, and only one
test was depending on the ring-loading side-effect down into replicate().
So we can just fix the test and rip it out.
Change-Id: I82447bf9b2883c16b6e32f92179fc496f4e86dea
- Drop log level for successful rsyncs to debug; ops don't usually care.
- Add an option to skip "send" lines entirely -- in a large cluster,
during a meaningful expansion, there's too much information getting
logged; it's just wasting disk space.
Note that we already have similar filtering for directory creation;
that's been present since the initial commit of Swift code.
Drive-by: make it a little more clear that more than one suffix was
likely replicated when logging about success.
Change-Id: I02ba67e77e3378b2c2c8c682d5d230d31cd1bfa9
We only need the invalidation post-rsync, since rsync was changing data
on disk behind Swift's back. Move the REPLICATE call down into the
rsync() helper function and drop it from the reconstructor entirely.
Change-Id: I576901344f1f3abb33b52b36fde0b25b43e54c8a
Closes-Bug: #1818709
This only applies to post-sync REPLICATE calls, none of which actually
look at the response anyway.
Change-Id: I1de62140e7eb9a23152bb9fdb1fa0934e827bfda
This is to fix invalid assert states like:
self.assertTrue('sync_point2: 5', lines.pop().strip())
self.assertTrue('sync_point1: 5', lines.pop().strip())
self.assertTrue('bytes: 1100', lines.pop().strip())
self.assertTrue('deletes: 2', lines.pop().strip())
self.assertTrue('puts: 3', lines.pop().strip())
self.assertTrue('1', jobs_to_delete[0]['partition'])
in which assertEqual should be used.
Change-Id: Ide5af2ae68fae0e5d6eb5c233a24388bb9942144
An earlier revision of the related change had a bug that this test would
have discovered.
Related-Change-Id: I85dcaf65b9f19ac4659fa5937f9b0b0e804fc54e
Change-Id: Iedd85ec65a11de189ce73e650d674aee0dc7e402
Object-replicator let user to input target partition numbers and
run the replicator job, and partition input will be stored as
integer list. But in build_replication_jobs method, it use
os.listdir method to retreve partitions in string list, so it is
failed to compare with target partition numbers.
Change-Id: I85dcaf65b9f19ac4659fa5937f9b0b0e804fc54e
Closes-Bug: #1822731
Previously, we'd sometimes shove strings into HASH_PATH_PREFIX or
HASH_PATH_SUFFIX, which would blow up on py3. Now, always use bytes.
Change-Id: Icab9981e8920da505c2395eb040f8261f2da6d2e
There's already a nice way to reap a process if it's exited while not
waiting, and it doesn't require timeouts.
Change-Id: Ie327fecb6a3055ff146a94e1171ec0ec68d7179f
Related-Change: If6dc7b003e18ab4e8a5ed687c965025ebd417dfa
Verify the log line truncation method is actually called.
Related-Change: If063a12cac74b67078b6db1c4f489160a2a69de1
Change-Id: I8dcd0eac1396b251a2c2a31e167598bc1e48c463
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
[worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
Sometimes, an rsync process just won't die. You can send SIGKILL, but
it isn't very effective. This is sometimes seen due to attempted I/O
on a failing disk; with some disks, an rsync process won't die until
Linux finishes the current I/O operation (whether success or failure),
but the disk can't succeed and will retry forever instead of
failing. The net effect is an unkillable rsync process.
The replicator was dealing with this by sending SIGKILL to any rsync
that ran too long, then calling waitpid() in a loop[1] until the rsync
died so it could reap the child process. This worked pretty well
unless it met an unkillable rsync; in that case, one greenthread would
end up blocked for a very long time. Since the replicator's main loop
works by (a) gathering all replication jobs, (b) performing them in
parallel with some limited concurrency, then (c) waiting for all jobs
to complete, an unkillable rsync would block the entire replicator.
There was an attempt to address this by adding a lockup detector: if
the replicator failed to complete any replication cycle in N seconds
[2], all greenthreads except the main one would be terminated and the
replication cycle restarted. It works okay, but only handles total
failure. If you have 20 greenthreads working and 19 of them are
blocked on unkillable rsyncs, then as long as the 20th greenthread
manages to replicate at least one partition every N seconds, the
replicator will just keep limping along.
This commit removes the lockup detector. Instead, when a replicator
greenthread happens upon an rsync that doesn't die promptly after
receiving SIGKILL, the process handle is sent to a background
greenthread; that background greenthread simply waits for those rsync
processes to finally die and reaps them. This lets the replicator make
better progress in the presence of unkillable rsyncs.
[1] It's a call to subprocess.Popen.wait(); the looping and sleeping
happens in eventlet.
[2] The default is 1800 seconds = 30 minutes, but the value is
configurable.
Change-Id: If6dc7b003e18ab4e8a5ed687c965025ebd417dfa
test.unit.obj.test_replicator.TestObjectReplicator.test_replicate_lockup_detector
is failing a lot in the gate. Let's disable it for now so that other
patches can continue to land.
Change-Id: I1790ebcbc0c8d075c2786aebca4e8ccf7547b178
Also, add some guards against a NameError in particularly-bad races.
Change-Id: If90662b6996e25bde74e0a202301b52a1d266e92
Related-Change: Ifd14ce82de1f7ebb636d6131849e0fadb113a701
Because the replicator in the master doesn't propergate the kill
signal to the subprocess in the coroutine. With the behavior, the lockup
detector causes a lot of rsync processes even it tries to reset the process.
This patch fixes the replicator kill rsync procs when the lockup detector
calls kill of eventlet threads.
Change-Id: Ifd14ce82de1f7ebb636d6131849e0fadb113a701
Otherwise, swift-in-the-small can fill up logs with
object-replicator: Error syncing partition:
Traceback (most recent call last):
File ".../swift/obj/replicator.py", line 419, in update
node = next(nodes)
StopIteration
...which simultaneously sounds worse than it is and isn't helpful in
diagnosing/debugging the issue.
Change-Id: I2f5bb12f3704880df1750229425f64f419ff9aef
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.
This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.
Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.
As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.
The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?
Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.
Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).
References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.
THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS
With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.
So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.
`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.
The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.
Co-Authored-By: John Dickinson <me@not.mn>
Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
We added check_drive to the account/container servers to unify how all
the storage wsgi servers treat device dirs/mounts. Thus pushes that
unification down into the consistency engine.
Drive-by:
* use FakeLogger less
* clean up some repeititon in probe utility for device re-"mounting"
Related-Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Insufficient arguments are passed to create MockProcess instances
resulting in StopIteration errors being raised during the repeated
replicator run_once cycles added in [1]. The test passes because
the replicator just logs these exceptions, but the logger noise is
distracting when running the test [2].
[1] Related-Change: Ib5c9dd17e40150450ec57a728ae8652fbc730af6
[2] nosetests ./test/unit/obj/test_replicator.py:\
TestObjectReplicator.test_run_once -s
Change-Id: I36208e93c81744068a3454577a30d0c5a8d9cb9b
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.
This commit changes DiskFileManager _get_hashes() arguments from:
- partition_path, recalculate=None, do_listdir=False
to :
- device, partition, policy, recalculate=None, do_listdir=False
Callers are modified accordingly, in diskfile.py, reconstructor.py,
and replicator.py
Change-Id: I8e2d7075572e466ae2fa5ebef5e31d87eed90fec
Because random.randint includeds both endpoints so that
random.randint(0, 9) which is assigned in replicatore
should be [0-9]. Hence, the assertion for replication_cycle should be
*less or equal to* 9. And the replication_cycle should be mod of 10.
Change-Id: I81da375a4864256e8f3b473d4399402f83fc6aeb
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.
I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age. This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.
There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.
UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup. All object
services must use the same reclaim_age. If you require a non-default
reclaim age it should be set in the [DEFAULT] section. If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.
If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade. If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.
Closes-Bug: #1626296
Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
This patch fixes the object-reconstructor to calculate device_count
as the total number of local devices in all policies. Previously
Swift counts it for each policy but reconstruction_device_count
which means the number of devices actually swift needs to reconstruct
is counted as sum of ones for all polices.
With this patch, Swift will gather all local devices for all policies
at first, and then, collect parts for each devices as well as current.
To do so, we can see the statuses for remaining job/disks percentage via
stats_line output.
To enable this change, this patch also touchs the object replicator
to get a DiskFileManager via the DiskFileRouter class so that
DiskFileManager instances are policy specific. Currently the same
replication policy DiskFileManager class is always used, but this
change future proofs the replicator for possible other DiskFileManager
implementations.
The change also gives the ObjectReplicator a _df_router variable,
making it consistent with the ObjectReconstructor, and allowing a
common way for ssync.Sender to access DiskFileManager instances via
it's daemon's _df_router instance.
Also, remove the use of FakeReplicator from the ssync test suite. It
was not necessary and risked masking divergence between ssync and the
replicator and reconstructor daemon implementations.
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Closes-Bug: #1488608
Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068
Right now the do_listdir option was set on every 10th replication run.
Due to the randomness of the job listing this might update a given
partition much less often than expected, for example with 1000
partitions per replicator only every ~70th run.
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Christian Schwede <cschwede@redhat.com>
Related-Bug: #1634967
Closes-Bug: 1644807
Change-Id: Ib5c9dd17e40150450ec57a728ae8652fbc730af6