205 Commits

Author SHA1 Message Date
Tim Burke
69b8165cd8 obj: _finalize_durable may succeed even when data file is missing
Ordinarily, an ENOENT in _finalize_durable should mean something's gone
off the rails -- we expected to be able to mark data durable, but
couldn't!

If there are concurrent writers, though, it might actually be OK:

   Client A writes .data
   Client B writes .data
   Client B finalizes .data *and cleans up on-disk files*
   Client A tries to finalize but .data is gone

Previously, the above would cause the object server to 500, and if
enough of them did this, the client may see a 503. Now, call it good so
clients get 201s.

Change-Id: I4e322a7be23870a62aaa6acee8435598a056c544
Closes-Bug: #1719860
2020-04-24 15:39:37 -07:00
Tim Burke
668242c422 pep8: Turn on E305
Change-Id: Ia968ec7375ab346a2155769a46e74ce694a57fc2
2020-04-03 21:22:38 +02:00
Andreas Jaeger
96b56519bf Update hacking for Python3
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.

Fix problems found.

Remove hacking and friends from lower-constraints, they are not needed
for installation.

Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
2020-04-03 21:21:07 +02:00
Tim Burke
55cdb55075 Lower the log level for falling back to mkstemp
We recently started trying to use O_TMPFILE without the kernel version
check, we gracefully degrade when it isn't available, and we don't give
operators any way to avoid the O_TMPFILE attempt -- logging at warning
seems way too noisy.

Change-Id: I8f42127fd6eb930282161930c65e8614f3e3175f
Related-Change: I3599e2ab257bcd99467aee83b747939afac639d8
2020-01-22 11:00:18 -08:00
Zuul
4316c53e4a Merge "Remove invalid dict entries from hashes.pkl" 2019-06-17 20:38:45 +00:00
Christian Schwede
c9e78d15e1 Remove invalid dict entries from hashes.pkl
If the data in a hashes.pkl is corrupted but still de-serialized without
errors, it will mess up the replication and gets never fixed. This
happens for example if one of the keys is a NULL byte.

This patch checks if the dict keys in hashes.pkl are valid strings and
invalidates it if not.

Closes-Bug: 1830881
Change-Id: I84b062d062ff49935feed0aee3e1963bb72eb5ea
2019-05-31 11:07:42 +02:00
Tim Burke
40bd9f16a3 py3: port diskfile
Change-Id: Ie73f9df460f9014014168d6e60767e66b24d30b0
2019-05-22 13:25:54 -07:00
Tim Burke
8868e2cd50 Clean up test_write_read_metadata
I'm pretty sure I botched the intent of the test when I touched it in
https://github.com/openstack/swift/commit/36c4297

As a side-effect, clean up a warning on py2:

    UnicodeWarning: Unicode equal comparison failed to convert
    both arguments to Unicode - interpreting them as being unequal

Change-Id: I4da8264f85e688bc5109016c197378ca6dfe06a5
2019-05-22 13:25:28 -07:00
Alexandre Lécuyer
6008b0a83f Fix quarantine when object path is not a directory
In the diskfile module, there are a couple of cases where we would
quarantine the suffix dir if a single object within is not a directory.

Change the code so that we quarantine only the object.

Change-Id: I35eb16f568c00d973eb089aedd0e5a5619c25a2f
2019-05-21 15:10:56 -07:00
Clay Gerrard
585bf40cc0 Simplify empty suffix handling
We really only need to have one way to cleanup empty suffix dirs, and
that's normally during suffix hashing which only happens when invalid
suffixes get rehashed.

When we iterate a suffix tree using yield hashes, we may discover an
expired or otherwise reapable hashdir - when this happens we will now
simply invalidate the suffix so that the next rehash can clean it up.

This simplification removes an mis-behavior in the handling between the
normal suffix rehashing cleanup and what was implemented in ssync.

Change-Id: I5629de9f2e9b2331ed3f455d253efc69d030df72
Related-Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1816501
2019-03-18 15:09:54 -05:00
Zuul
32fee1dc13 Merge "Change how O_TMPFILE support is detected" 2019-02-01 00:58:39 +00:00
Thiago da Silva
0668731839 Change how O_TMPFILE support is detected
Previously o_tmpfile support was detected by checking the
kernel version as it was officially introduced in XFS in 3.15.
The problem is that RHEL has backported the support to at least
RHEL 7.6 but the kernel version is not updated.

This patch changes o_tmpfile is detected by actually attempting to
open a file with the O_TMPFILE flag and keeps the information cached
in DiskFileManager so that the check only happens once while process
is running.

Change-Id: I3599e2ab257bcd99467aee83b747939afac639d8
2019-01-31 18:35:39 +00:00
Zuul
b608c9cda7 Merge "Treat all invalid frag indexes the same" 2019-01-26 13:29:34 +00:00
Zuul
b9d2c08e8d Merge "Fix SSYNC concurrency on partition" 2018-12-11 23:34:54 +00:00
Tim Burke
43412c73de Treat all invalid frag indexes the same
Note that we also want to prevent any IndexErrors that might arise from
pickled data already on disk.

Change-Id: If56190134edddb4b778e41a2cdd360008fbfb0e4
Closes-Bug: 1772378
2018-12-10 10:48:12 -08:00
Romain LE DISEZ
014d46f9a7 Fix SSYNC concurrency on partition
Commit e199192caefef068b5bf57da8b878e0bc82e3453 introduced the ability
to have multiple SSYNC running on a single device. It misses a security
to ensure that only one SSYNC request can be running on a partition.

This commit update replication_lock to lock N times the device, then
lock once the partition related to a SSYNC request.

Change-Id: Id053ed7dd355d414d7920dda79a968a1c6677c14
2018-12-04 14:47:26 +01:00
chenxiangui
662fd818cf Fix typo
Fix the typo 'a' to 'an' in test_diskfile.py

Change-Id: I26332766e76b4f0b728e0e7a41fa75afae2517a8
2018-11-09 20:28:51 +08:00
Tim Burke
3420921a33 Clean up HASH_PATH_* patching
Previously, we'd sometimes shove strings into HASH_PATH_PREFIX or
HASH_PATH_SUFFIX, which would blow up on py3. Now, always use bytes.

Change-Id: Icab9981e8920da505c2395eb040f8261f2da6d2e
2018-11-01 20:52:33 +00:00
Alexandre Lécuyer
d306345ddd Remove empty directories after a revert job
Currently, the reconstructor will not remove empty object and suffixes
directories after processing a revert job. This will only happen during
its next run.

This patch will attempt to remove these empty directories immediately,
while we have the inodes cached.

Change-Id: I5dfc145b919b70ab7dae34fb124c8a25ba77222f
2018-10-26 09:29:14 +02:00
Zuul
5cc4a72c76 Merge "Configure diskfile per storage policy" 2018-09-27 00:19:32 +00:00
Clay Gerrard
52ecbf9539 Add a chunks_finished to BaseDiskFileWriter
BaseDiskFileWriter will track md5 and expose upload_size, etag via the
chunks_finished method.

The BaseDiskFileReader already tracks the md5/etag via _iter_etag, for
parity we add a _chunks_etag to the BaseDiskFileReader.

Instead of returning the upload_size and hexdigest every call to write,
we return the tuple from chunks_finished.

Change-Id: I26c58719cff5fde941d0248c250a0204e0379ae5
2018-09-13 12:28:57 -05:00
Tim Burke
ce257b3d15 DiskFile(Writer) refactor cleanups
Change-Id: I5b0bcc6028dbe6248e0e09baf2cbb72deb011c80
2018-09-12 19:09:12 +00:00
Clay Gerrard
33c7650753 Add writer method to DiskFile
DiskFile already exposes a reader method that creates the DiskFileReader
instance. Add a writer method for parity.

DiskFile currently only provides a context manager create - that will
open and close the DiskFileWriter.  Add explicit open and close methods
to support more flexibility in how callers manage life-cycle on their
DiskFileWriter instances.

Diskfile confusingly manages some state for DiskFileWriter (e.g. fd,
tmppath, use_linkat).  Encapsulate the DiskFileWriter state to improve
readability and reduce coupling (e.g. put_succeeced).

Change-Id: If18e0041680470a9c57a08e9ea9327acba8593df
2018-09-12 10:37:48 -05:00
Romain LE DISEZ
673fda7620 Configure diskfile per storage policy
With this commit, each storage policy can define the diskfile to use to
access objects. Selection of the diskfile is done in swift.conf.

Example:
    [storage-policy:0]
    name = gold
    policy_type = replication
    default = yes
    diskfile = egg:swift#replication.fs

The diskfile configuration item accepts the same format than middlewares
declaration: [[scheme:]egg_name#]entry_point
The egg_name is optional and default to "swift". The scheme is optional
and default to the only valid value "egg". The upstream entry points are
"replication.fs" and "erasure_coding.fs".

Co-Authored-By: Alexandre Lécuyer <alexandre.lecuyer@corp.ovh.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I070c21bc1eaf1c71ac0652cec9e813cadcc14851
2018-08-24 02:29:13 +00:00
Zuul
3de21d945b Merge "Remove empty part dirs during ssync replication" 2018-06-23 02:19:18 +00:00
Tim Burke
f192f51d37 Have check_drive raise ValueError on errors
...which helps us differentiate between a drive that's not mounted vs.
not a dir better in log messages. We were already doing that a bit in
diskfile.py, and it seems like a useful distinction; let's do it more.

While we're at it, remove some log translations.

Related-Change: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Related-Change: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: Ife0d34f9482adb4524d1ab1fe6c335c6b287c2fd
Partial-Bug: 1674543
2018-06-20 17:15:07 -07:00
Zuul
ba5c13e9dd Merge "Add assertion to test_consolidate_hashes_raises_exception" 2018-06-05 18:03:48 +00:00
Samuel Merritt
a19548b3e6 Remove empty part dirs during ssync replication
When we're pushing data to a remote node using ssync, we end up
walking the entire partition's directory tree. We were already
removing reclaimable (i.e. old) tombstones and non-durable EC data
files plus their containing hash dirs, but we were leaving the suffix
dirs around for future removal, and we weren't cleaning up partition
dirs at all. Now we remove as much of the directory structure as we
can, even up to the partition dir, as soon as we observe that it's
empty.

Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1706321
2018-05-01 17:18:22 -07:00
Alexandre Lécuyer
8d7e245eb5 Change object_audit_location_generator() to yield for a single policy.
auditor.py currently relies on POLICY[0] object_audit_location_generator()
to yield an AuditLocation for all policies on the object-server.

The changes in this patch are :
  - object_audit_location_generator() yields AuditLocation only for the
  requested policy
  - audit_all_objects() calls object_audit_location_generator() for each
    policy

Closes-Bug: 1534212
Change-Id: Ida92ba0a5e1e486a4f7132c6539460b38c34ec87
2018-03-22 03:02:35 +00:00
Tim Burke
36c42974d6 py3: Port more CLI tools
Bring under test

 - test/unit/cli/test_dispersion_report.py
 - test/unit/cli/test_info.py and
 - test/unit/cli/test_relinker.py

I've verified that swift-*-info (at least) behave reasonably under
py3, even swift-object-info when there's non-utf8 metadata on the
data/meta file.

Change-Id: Ifed4b8059337c395e56f5e9f8d939c34fe4ff8dd
2018-02-28 21:10:01 +00:00
Tim Burke
396380f340 Better handle missing files in _construct_from_data_file
There's a window between when we list the files on disk and when we actually
try to open the .data file where another process could delete it. That should
raise a DiskFileNotExist error, not an IOError.

Change-Id: I1b43fef35949cb6f71997874e4e6b7646eeec8fe
Closes-Bug: 1712662
2017-11-13 16:04:49 -08:00
Zuul
27398c3573 Merge "No longer import nose" 2017-11-08 23:43:07 +00:00
Zuul
18d3ba4e6e Merge "Cleanup lock_path and replication_concurrency_per_device tests" 2017-11-07 19:33:01 +00:00
Steve Kowalik
5a06e3da3b No longer import nose
Since Python 2.7, unittest in the standard library has included mulitple
facilities for skipping tests by decorators as well as an exception.
Switch to that directly, rather than importing nose.

Change-Id: I4009033473ea24f0d0faed3670db844f40051f30
2017-11-07 15:39:25 +11:00
Samuel Merritt
728b4ba140 Add checksum to object extended attributes
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.

This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.

Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.

As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.

The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?

Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.

Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).

References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.

THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS

With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.

So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.

`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.

The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.

Co-Authored-By: John Dickinson <me@not.mn>

Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
2017-11-03 13:30:05 -04:00
Clay Gerrard
feee399840 Use check_drive consistently
We added check_drive to the account/container servers to unify how all
the storage wsgi servers treat device dirs/mounts.  Thus pushes that
unification down into the consistency engine.

Drive-by:
 * use FakeLogger less
 * clean up some repeititon in probe utility for device re-"mounting"

Related-Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
2017-11-01 16:33:40 +00:00
Alistair Coles
e0244dd30d Cleanup lock_path and replication_concurrency_per_device tests
use assertRaises where applicable; use the with_tempdir
decorator

Related-Change: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
Change-Id: Ie83584fc22a5ec6e0a568e39c90caf577da29497
2017-10-24 16:17:48 +01:00
Romain LE DISEZ
e199192cae Replace replication_one_per_device by custom count
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.

Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.

Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
2017-10-24 16:17:41 +01:00
Clay Gerrard
36a843be73 Preserve X-Static-Large-Object from .data file after POST
You can't modify the X-Static-Large-Object metadata with a POST, an
object being a SLO is a property of the .data file.  Revert the change
from 4500ff which attempts to correctly handle X-Static-Large-Object
metadata on a POST, but is subject to a race if the most recent SLO
.data isn't available during the POST.  Instead this change adjusts the
reading of metadata such that the X-Static-Large-Object metadata is
always preserved from the metadata on the datafile and bleeds through
a .meta if any.

Closes-bug: #1453807
Closes-bug: #1634723

Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Change-Id: Ie48a38442559229a2993443ab0a04dc84717ca59
2017-09-26 15:14:57 -04:00
Pavel Kvasnička
163fb4d52a Always require device dir for containers
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.

This behavior is consistent with the object-server's checks in diskfile.

Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
2017-09-01 10:32:12 -07:00
Jenkins
83b62b4f39 Merge "Add Timestamp.now() helper" 2017-07-18 03:27:50 +00:00
Jenkins
e94b383655 Merge "Add support to increase object ring partition power" 2017-07-05 14:40:42 +00:00
Christian Schwede
e1140666d6 Add support to increase object ring partition power
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.

1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.

2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.

3. The "next_part_power" flag will be removed.

This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree.  This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.

Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.

Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).

Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>

[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html

Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2017-06-15 15:08:48 -07:00
lingyongxu
ee9458a250 Using assertIsNone() instead of assertEqual(None)
Following OpenStack Style Guidelines:
[1] http://docs.openstack.org/developer/hacking/#unit-tests-and-assertraises
[H203] Unit test assertions tend to give better messages for more specific
assertions. As a result, assertIsNone(...) is preferred over
assertEqual(None, ...) and assertIs(..., None)

Change-Id: If4db8872c4f5705c1fff017c4891626e9ce4d1e4
2017-06-07 14:05:53 +08:00
Jenkins
45b17d89c7 Merge "Fix SSYNC failing to replicate unexpired object" 2017-05-31 22:49:55 +00:00
Tim Burke
85d6cd30be Add Timestamp.now() helper
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.

Change-Id: I3581c635c094a8c4339e9b770331a03eab704074
2017-04-27 14:19:00 -07:00
Romain LE DISEZ
38d35797df Fix SSYNC failing to replicate unexpired object
Fix a situation where SSYNC would fail to replicate a valid object because
the datafile contains an expired X-Delete-At information while a metafile
contains no X-Delete-At information. Example:
 - 1454619054.02968.data => contains X-Delete-At: 1454619654
 - 1454619056.04876.meta => does not contain X-Delete-At info

In this situation, the replicator tries to PUT the datafile, and then
to POST the metadata. Previously, if the receiver has the datafile but
current time is greater than the X-Delete-At, then it considers it to
be expired and requests no updates from the sender, so the metafile is
never synced. If the receiver does not have the datafile then it does
request updates from the sender, but the ssync PUT subrequest is
refused if the current time is greater than the X-Delete-At (the
object is expired). If the datafile is transfered, the ssync POST
subrequest fails because the object does not exist (expired).

This commit allows PUT and POST to works so that the object can be
replicated, by enabling the receiver object server to open expired
diskfiles when handling replication requests.

Closes-Bug: #1683689
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I919994ead2b20dbb6c5671c208823e8b7f513715
2017-04-26 11:29:40 +01:00
Jenkins
cce5482bd8 Merge "Fix encoding issue in ssync_sender.send_put()" 2017-04-19 19:37:38 +00:00
Jenkins
88bca22549 Merge "Follow up tests for get_hashes regression" 2017-04-19 17:32:54 +00:00
Romain LE DISEZ
091157fc7f Fix encoding issue in ssync_sender.send_put()
EC object metadata can currently have a mixture of bytestrings and
unicode.  The ssync_sender.send_put() method raises an
UnicodeDecodeError when it attempts to concatenate the metadata
values, if any bytestring has non-ascii characters.

The root cause of this issue is that the object server uses unicode
for the keys of some object metadata items that are received in the
footer of an EC PUT request, whereas all other object metadata keys
and values are persisted as bytestrings.

This patch fixes the bug by changing diskfile write_metadata()
function to encode all unicode metadata keys and values as utf8
encoded bytes before writing to disk. To cope with existing objects
that have a mixture of unicode and bytestring metadata, the diskfile
read_metadata() function is also changed so that all returned unicode
metadata keys and values are utf8 encoded. This ensures that
ssync_sender.send_put() (and any other caller of diskfile
read_metadata) only reads bytestrings from object metadata.

Closes-Bug: #1678018
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Ic23c55754ee142f6f5388dcda592a3afc9845c39
2017-04-19 18:05:52 +01:00