This is to fix invalid assert states like:
self.assertTrue('sync_point2: 5', lines.pop().strip())
self.assertTrue('sync_point1: 5', lines.pop().strip())
self.assertTrue('bytes: 1100', lines.pop().strip())
self.assertTrue('deletes: 2', lines.pop().strip())
self.assertTrue('puts: 3', lines.pop().strip())
self.assertTrue('1', jobs_to_delete[0]['partition'])
in which assertEqual should be used.
Change-Id: Ide5af2ae68fae0e5d6eb5c233a24388bb9942144
An earlier revision of the related change had a bug that this test would
have discovered.
Related-Change-Id: I85dcaf65b9f19ac4659fa5937f9b0b0e804fc54e
Change-Id: Iedd85ec65a11de189ce73e650d674aee0dc7e402
Object-replicator let user to input target partition numbers and
run the replicator job, and partition input will be stored as
integer list. But in build_replication_jobs method, it use
os.listdir method to retreve partitions in string list, so it is
failed to compare with target partition numbers.
Change-Id: I85dcaf65b9f19ac4659fa5937f9b0b0e804fc54e
Closes-Bug: #1822731
Previously, we'd sometimes shove strings into HASH_PATH_PREFIX or
HASH_PATH_SUFFIX, which would blow up on py3. Now, always use bytes.
Change-Id: Icab9981e8920da505c2395eb040f8261f2da6d2e
There's already a nice way to reap a process if it's exited while not
waiting, and it doesn't require timeouts.
Change-Id: Ie327fecb6a3055ff146a94e1171ec0ec68d7179f
Related-Change: If6dc7b003e18ab4e8a5ed687c965025ebd417dfa
Verify the log line truncation method is actually called.
Related-Change: If063a12cac74b67078b6db1c4f489160a2a69de1
Change-Id: I8dcd0eac1396b251a2c2a31e167598bc1e48c463
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
[worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
Sometimes, an rsync process just won't die. You can send SIGKILL, but
it isn't very effective. This is sometimes seen due to attempted I/O
on a failing disk; with some disks, an rsync process won't die until
Linux finishes the current I/O operation (whether success or failure),
but the disk can't succeed and will retry forever instead of
failing. The net effect is an unkillable rsync process.
The replicator was dealing with this by sending SIGKILL to any rsync
that ran too long, then calling waitpid() in a loop[1] until the rsync
died so it could reap the child process. This worked pretty well
unless it met an unkillable rsync; in that case, one greenthread would
end up blocked for a very long time. Since the replicator's main loop
works by (a) gathering all replication jobs, (b) performing them in
parallel with some limited concurrency, then (c) waiting for all jobs
to complete, an unkillable rsync would block the entire replicator.
There was an attempt to address this by adding a lockup detector: if
the replicator failed to complete any replication cycle in N seconds
[2], all greenthreads except the main one would be terminated and the
replication cycle restarted. It works okay, but only handles total
failure. If you have 20 greenthreads working and 19 of them are
blocked on unkillable rsyncs, then as long as the 20th greenthread
manages to replicate at least one partition every N seconds, the
replicator will just keep limping along.
This commit removes the lockup detector. Instead, when a replicator
greenthread happens upon an rsync that doesn't die promptly after
receiving SIGKILL, the process handle is sent to a background
greenthread; that background greenthread simply waits for those rsync
processes to finally die and reaps them. This lets the replicator make
better progress in the presence of unkillable rsyncs.
[1] It's a call to subprocess.Popen.wait(); the looping and sleeping
happens in eventlet.
[2] The default is 1800 seconds = 30 minutes, but the value is
configurable.
Change-Id: If6dc7b003e18ab4e8a5ed687c965025ebd417dfa
test.unit.obj.test_replicator.TestObjectReplicator.test_replicate_lockup_detector
is failing a lot in the gate. Let's disable it for now so that other
patches can continue to land.
Change-Id: I1790ebcbc0c8d075c2786aebca4e8ccf7547b178
Also, add some guards against a NameError in particularly-bad races.
Change-Id: If90662b6996e25bde74e0a202301b52a1d266e92
Related-Change: Ifd14ce82de1f7ebb636d6131849e0fadb113a701
Because the replicator in the master doesn't propergate the kill
signal to the subprocess in the coroutine. With the behavior, the lockup
detector causes a lot of rsync processes even it tries to reset the process.
This patch fixes the replicator kill rsync procs when the lockup detector
calls kill of eventlet threads.
Change-Id: Ifd14ce82de1f7ebb636d6131849e0fadb113a701
Otherwise, swift-in-the-small can fill up logs with
object-replicator: Error syncing partition:
Traceback (most recent call last):
File ".../swift/obj/replicator.py", line 419, in update
node = next(nodes)
StopIteration
...which simultaneously sounds worse than it is and isn't helpful in
diagnosing/debugging the issue.
Change-Id: I2f5bb12f3704880df1750229425f64f419ff9aef
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.
This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.
Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.
As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.
The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?
Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.
Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).
References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.
THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS
With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.
So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.
`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.
The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.
Co-Authored-By: John Dickinson <me@not.mn>
Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
We added check_drive to the account/container servers to unify how all
the storage wsgi servers treat device dirs/mounts. Thus pushes that
unification down into the consistency engine.
Drive-by:
* use FakeLogger less
* clean up some repeititon in probe utility for device re-"mounting"
Related-Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Insufficient arguments are passed to create MockProcess instances
resulting in StopIteration errors being raised during the repeated
replicator run_once cycles added in [1]. The test passes because
the replicator just logs these exceptions, but the logger noise is
distracting when running the test [2].
[1] Related-Change: Ib5c9dd17e40150450ec57a728ae8652fbc730af6
[2] nosetests ./test/unit/obj/test_replicator.py:\
TestObjectReplicator.test_run_once -s
Change-Id: I36208e93c81744068a3454577a30d0c5a8d9cb9b
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
Some public functions in the diskfile manager expect or return full
file paths. It implies a filesystem diskfile implementation.
To make it easier to plug alternate diskfile implementations, patch
functions to take more generic arguments.
This commit changes DiskFileManager _get_hashes() arguments from:
- partition_path, recalculate=None, do_listdir=False
to :
- device, partition, policy, recalculate=None, do_listdir=False
Callers are modified accordingly, in diskfile.py, reconstructor.py,
and replicator.py
Change-Id: I8e2d7075572e466ae2fa5ebef5e31d87eed90fec
Because random.randint includeds both endpoints so that
random.randint(0, 9) which is assigned in replicatore
should be [0-9]. Hence, the assertion for replication_cycle should be
*less or equal to* 9. And the replication_cycle should be mod of 10.
Change-Id: I81da375a4864256e8f3b473d4399402f83fc6aeb
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.
I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age. This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.
There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.
UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup. All object
services must use the same reclaim_age. If you require a non-default
reclaim age it should be set in the [DEFAULT] section. If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.
If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade. If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.
Closes-Bug: #1626296
Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
This patch fixes the object-reconstructor to calculate device_count
as the total number of local devices in all policies. Previously
Swift counts it for each policy but reconstruction_device_count
which means the number of devices actually swift needs to reconstruct
is counted as sum of ones for all polices.
With this patch, Swift will gather all local devices for all policies
at first, and then, collect parts for each devices as well as current.
To do so, we can see the statuses for remaining job/disks percentage via
stats_line output.
To enable this change, this patch also touchs the object replicator
to get a DiskFileManager via the DiskFileRouter class so that
DiskFileManager instances are policy specific. Currently the same
replication policy DiskFileManager class is always used, but this
change future proofs the replicator for possible other DiskFileManager
implementations.
The change also gives the ObjectReplicator a _df_router variable,
making it consistent with the ObjectReconstructor, and allowing a
common way for ssync.Sender to access DiskFileManager instances via
it's daemon's _df_router instance.
Also, remove the use of FakeReplicator from the ssync test suite. It
was not necessary and risked masking divergence between ssync and the
replicator and reconstructor daemon implementations.
Co-Author: Alistair Coles <alistair.coles@hpe.com>
Closes-Bug: #1488608
Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068
Right now the do_listdir option was set on every 10th replication run.
Due to the randomness of the job listing this might update a given
partition much less often than expected, for example with 1000
partitions per replicator only every ~70th run.
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Christian Schwede <cschwede@redhat.com>
Related-Bug: #1634967
Closes-Bug: 1644807
Change-Id: Ib5c9dd17e40150450ec57a728ae8652fbc730af6
Ignore `auditor_status_*.json` files during the collecting jobs
and replicator won't use these wrong paths to find objects that
causes an exception to increase failure count in replicator report.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
Change-Id: Ib15a0987288d9ee32432c1998aefe638ca3b223b
Closes-Bug: #1583305
The object replicator can log some junk about the cluster ip instead
of the replication ip in some specific error log lines that can make
you think either you're crazy or your rings are crazy.
... in this case it was just the logging was crazy - so fix that.
Change-Id: Ie5cbb2d1b30feb2529c17fc3d72af7df1aa3ffdd
Before this commit, when a local device has not found been found
in a object-replication run, the policy was not mentioned in the
error log. But it is of interest to know the policy, for example for
error searching, when no local device has been found.
Change-Id: Icb9f9f1d4aec5c4a70dd8abdf5483d4816720418
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
In situations where rsync may inadvertently be unable to cleanup it's
temporary files we shouldn't spread them around the cluster.
By asking our rsync subexec to --exclude patterns that match it's own
convention for temporary naming we'll only ever transfer real replicated
artifacts and never temporary artifacts which should always be ignored
until they are fully transfered.
Cleanup of stale rsync droppings should be performed by the auditor and
will be addressed in a separate change related to lp bug #1554005.
Closes-Bug: #1553995
Change-Id: Ibe598b339af024d05e4d89c34d696e972d8189ff
Based on experience using handoffs_first and feedback from other
operators it has become clear that handoffs_first is only used during
periods of problematic cluster behavior (e.g. full disks) when
replication attempts are failing to quickly drain off the partitions
from the nodes which they have been rebalanced from.
In order to focus on the most important work (getting handoff partitions
off the node) handoffs_first mode will abort the current replication
sweep before attempting any primary suffix syncing if any of the handoff
partitions were not removed for any reason - and start over with
replication of handoffs jobs as the highest priority.
Note that handoffs_first being enabled will emit a warning on start up,
even if no handoff jobs fail, because of the negative impact it can have
during normal operations by dog piling on a node that was temporarily
unavailable.
Change-Id: Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea
Example:
* Different port in config and in ring file.
* Running daemon on server not in ring file.
In both cases replication daemon is running but nothing is replicated.
Error log helps to distinguish a local device can't be identified.
Closes-Bug: 1508228
Change-Id: I99351b7d9946f250b7750df91c13d09352a145ce
assertEquals is deprecated in py3, replacing it.
Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>
This patch fixed the exception (AttributeError: 'list' object has no
attribute 'intersection') when replicator try to sync data from
handoff to primary partition in more than one remote region.
Change-Id: I565c45dda8c99d36e24dbf1145f2d2527d593ac0
Closes-Bug: 1503152