146 Commits

Author SHA1 Message Date
Zuul
a495f1e327 Merge "pep8: Turn on E305" 2020-04-10 11:55:07 +00:00
Tim Burke
668242c422 pep8: Turn on E305
Change-Id: Ia968ec7375ab346a2155769a46e74ce694a57fc2
2020-04-03 21:22:38 +02:00
Romain LE DISEZ
804776b379 Optimize obj replicator/reconstructor healthchecks
DaemonStrategy class calls Daemon.is_healthy() method every 0.1 seconds
to ensure that all workers are running as wanted.

On object replicator/reconstructor daemons, is_healthy() check if the rings
changed to decide if workers must be created/killed. With large rings,
this operation can be CPU intensive, especially on low-end CPU.

This patch:
- increases the check interval to 5 seconds by default, because none of
  these daemons are critical for performance (they are not in the datapath).
  But it allows each daemon to change this value if necessary
- ensures that before doing a computation of all devices in the ring,
  object replicator/reconstructor checks that the ring really changed
  (by checking the mtime of the ring.gz files)

On an Atom N2800 processor, this patch reduced the CPU usage of the main
object replicator/reconstructor from 70% of a core to 0%.

Change-Id: I2867e2be539f325778e2f044a151fd0773a7c390
2020-04-01 08:03:32 -04:00
Sean McGinnis
5b26b749b5
Drop use of unittest2
unittest2 was needed for Python version <= 2.6, so it hasn't been needed
for quite some time. See unittest2 note one:

https://docs.python.org/2.7/library/unittest.html

This drops unittest2 in favor of the standard unittest module.

Change-Id: I2e787cfbf1709b7f9c889230a10c03689e032957
Signed-off-by: Sean McGinnis <sean.mcginnis@gmail.com>
2020-01-12 03:13:41 -06:00
Clay Gerrard
286082222d Use less responses from handoffs
Since we don't use 404s from handoffs anymore, we need to not let errors
on handoffs overwhelm primary responses either

Change-Id: I2624e113c9d945542f787e5f18f487bd7be3d32e
Closes-Bug: #1857909
2020-01-02 16:44:05 -08:00
Tim Burke
d270596b67 Consistently use io.BytesIO
Change-Id: Ic41b37ac75b5596a8307c4962be86f2a4b0d9731
2019-10-15 15:09:46 +02:00
Thomas Goirand
12a7b42062 Fix test_parse_get_node_args
Looks like xattr_supported_check was missing ERANGE

Change-Id: I82263e48e836f38f77d81593c8435f64a4728b5d
2019-07-19 01:32:25 +02:00
Clay Gerrard
563e1671cf Return 503 when primary containers can't respond
Closes-Bug: #1833612

Change-Id: I53ed04b5de20c261ddd79c98c629580472e09961
2019-06-25 12:23:12 -05:00
Tim Burke
e8e7106d14 py3: port obj/reconstructor tests
All of the swift changes we needed for this were already done elsewhere.

Change-Id: Ib2c26fdf7bd36ed1cccd5dbd1fa208f912f4d8d5
2019-06-10 08:31:41 -07:00
Tim Burke
2e35376c6d py3: symlink follow-up
- Have the unit tests use WSGI strings, like a real system.
- Port the func tests.

Change-Id: I3a6f409208de45ebf9f55f7f59e4fe6ac6fbe163
2019-05-30 16:25:17 -07:00
Tim Burke
b8284538be py3: start porting for unit/proxy/test_server.py
Mostly this ammounts to

    Exception.message -> Exception.args[0]
    '...' -> b'...'
    StringIO -> BytesIO
    makefile() -> makefile('rwb')
    iter.next() -> next(iter)
    bytes[n] -> bytes[n:n + 1]
    integer division

Note that the versioning tests are mostly untouched; they seemed to get
a little hairy.

Change-Id: I167b5375e7ed39d4abecf0653f84834ea7dac635
2019-05-04 20:35:05 -07:00
Pete Zaitcev
575538b55b py3: port the container
This started with ShardRanges and its CLI. The sharder is at the
bottom of the dependency chain. Even container backend needs it.
Once we started tinkering with the sharder, it all snowballed to
include the rest of the container services.

Beware, this does affect some of Python 2 code. Mostly it's trivial
and obviously correct, but needs checking by reviewers.

About killing the stray "from __future__ import unicode_literals":
we do not do it in general. The specific problem it caused was
a failure of functional tests because unicode leaked into a field
that was supposed to be encoded. It is just too hard to track the
types when rules change from file to file, so off with its head.

Change-Id: Iba4e65d0e46d8c1f5a91feb96c2c07f99ca7c666
2019-02-20 21:30:46 -06:00
Zuul
64e5fd364a Merge "Stop using duplicate dev IDs in write_fake_ring" 2019-02-09 07:08:21 +00:00
Clay Gerrard
ea8e545a27 Rebuild frags for unmounted disks
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.

Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring.  If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.

To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve.  In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.

Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced.  After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.

Closes-Bug: #1510342

Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
2019-02-08 18:04:55 +00:00
Tim Burke
8a6159f67b Stop using duplicate dev IDs in write_fake_ring
This would cause some weird issues where get_more_nodes() would actually
yield out something, despite us only having two drives.

Change-Id: Ibf658d69fce075c76c0870a542348f220376c87a
2019-02-08 09:36:35 -08:00
Zuul
80b58553e8 Merge "py3: port object controller in proxy" 2019-02-06 21:09:07 +00:00
Zuul
e7b13da497 Merge "Fix socket leak on object-server death" 2019-02-06 19:28:34 +00:00
Pete Zaitcev
988e719232 py3: port object controller in proxy
When looking at the tuple thing, remember that tuples are comparable
with ints in py2, but they do not compare like (100)[0]. Instead, they
are always greater, acting almost like MAX_INT. No wonder py3 banned
that practice.

We had a much of other noncomparables, but we dealt with them in
obvious ways, like adding key= to sorted().

Change-Id: I52e96406c3c1f39b98c1d81bdc057805cd1a6278
2019-02-06 00:26:39 -06:00
Zuul
a25b7f9c91 Merge "py3 object-server follow-ups" 2019-02-04 23:42:41 +00:00
Tim Burke
2bd7b7a109 py3 object-server follow-ups
Change-Id: Ief7d85af8d3e1d5e03a6484a889c9146d69f1377
Related-Change: I203a54fddddbd4352be0e6ea476a628e3f747dc1
2019-02-04 09:36:16 -08:00
Samuel Merritt
0e81ffd1e1 Fix socket leak on object-server death
Consider a client that's downloading a large replicated object of size
N bytes. If the object server process dies (e.g. with a segfault)
partway through the download, the proxy will have read fewer than N
bytes, and then read(sockfd) will start returning 0 bytes. At this
point, the proxy believes the object download is complete, and so the
WSGI server waits for a new request to come in. Meanwhile, the client
is waiting for the rest of their bytes. Until the client times out,
that socket will be held open.

The fix is to look at the Content-Length and Content-Range headers in
the response from the object server, then retry with another object
server in case the original GET is truncated. This way, the client
gets all the bytes they should.

Note that ResumingGetter already had retry logic for when an
object-server is slow to send bytes -- this extends it to also cover
unexpected disconnects.

Change-Id: Iab1e07706193ddc86832fd2cff0d7c2cb6d79ad9
Related-Change: I74d8c13eba2a4917b5a116875b51a781b33a7abf
Closes-Bug: 1568650
2019-01-31 18:38:35 +00:00
Thiago da Silva
0668731839 Change how O_TMPFILE support is detected
Previously o_tmpfile support was detected by checking the
kernel version as it was officially introduced in XFS in 3.15.
The problem is that RHEL has backported the support to at least
RHEL 7.6 but the kernel version is not updated.

This patch changes o_tmpfile is detected by actually attempting to
open a file with the O_TMPFILE flag and keeps the information cached
in DiskFileManager so that the check only happens once while process
is running.

Change-Id: I3599e2ab257bcd99467aee83b747939afac639d8
2019-01-31 18:35:39 +00:00
Pete Zaitcev
5b5ed29ab4 py3: object server
This does not do anything about replicator or other daemons,
only ports the server.

- wsgi_to_bytes(something.get('X-Foo')) assumes that None is possible
- Dunno if del-in-for-in-dict calls for clone or list(), using list()
- Fixed the zero-copy with EventletPlungerString(bytes)
- Yet another reminder that Request.blank() takes a WSGI string path

Curiously enough, the sleep(0) before checking for logging was
already present in the tests. The py3 scheduling merely forces us
to do it consistently throughout.

Change-Id: I203a54fddddbd4352be0e6ea476a628e3f747dc1
2019-01-11 09:18:08 -06:00
Pete Zaitcev
0d29b01d2b py3: Port the acl, account_quotas, cname_lookup, container_sync
These are trivial, but need to be done sooner or later.

About the isEnabledFor, our FakeLogger causes this on py3:
  AttributeError: 'FakeLogger' object has no attribute '_cache'
Adding the isEnabledFor short-cuts a need for that private member.

Change-Id: I4d1df857a24801fe2a396dc003719f54d099f72c
2018-12-27 18:55:47 +00:00
John Dickinson
c26d67efcf fixed _check_node() in the container sharder
Previously, _check_node() wouldn't catch the raise ValueError when
a drive was unmounted. Therefore the error would bubble up, uncaught,
and stop the shard cycle. The practical effect is that an unmounted
drive on a node would prevent sharding for happening.

This patch updates _check_node() to properly use the check_drive()
method. Furthermore, the _check_node() return value has been modified
to be more similar to what check_drive() actually returns. This
should help prevent similar errors from being introduced in the future.

Closes-Bug: #1806500

Change-Id: I3da9b5b120a5980e77ef5c4dc8fa1697e462ce0d
2018-12-04 16:16:04 -08:00
Tim Burke
37b814657e py3: port encryption
This got away from me a bit with the functional tests masquerading as
unit tests.

Change-Id: I1237c02eff96e53fff8f9661a2d85c4695b73371
2018-11-20 01:30:04 -06:00
Alistair Coles
9a7b46e1e3 swift-ring-builder shows hint about composite builder file
If swift-ring-builder is erroneously given a composite builder
file, which it will fail to load, it will now print a hint
that the file is a composite builder file.

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: If4517f3b61977a7f6ca3e08ed5deb182aa87a366
2018-07-05 15:57:05 +01:00
Matthew Oliver
2641814010 Add sharder daemon, manage_shard_ranges tool and probe tests
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.

The workflow is, in overview:

- perform an audit of the container for sharding purposes.

- move any misplaced objects that do not belong in the container
  to their correct shard.

- move shard ranges from FOUND state to CREATED state by creating
  shard containers.

- move shard ranges from CREATED to CLEAVED state by cleaving objects
  to shard dbs and replicating those dbs. By default this is done in
  batches of 2 shard ranges per visit.

Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.

The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>

Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
2018-05-18 18:48:13 +01:00
Alistair Coles
9d742b85ad Refactoring, test infrastructure changes and cleanup
...in preparation for the container sharding feature.

Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>

Change-Id: I4455677abb114a645cff93cd41b394d227e805de
2018-05-15 18:18:25 +01:00
Thomas Goirand
22b9a4a943 Fix tests using O_TMPFILE
Unit tests using O_TMPFILE only rely on the kernel version to check
for the feature. This is wrong, as some filesystem, like tmpfs, doesn't
support O_TMPFILE.

So, instead of checking kernel version, this patch actually attempts to
open a file using O_TMPFILE and see if that's supported. If not, then
the test is skipped.

Change-Id: I5d652f1634b1ef940838573cfdd799ea17b8b572
2018-03-13 12:06:07 +00:00
Tim Burke
36c42974d6 py3: Port more CLI tools
Bring under test

 - test/unit/cli/test_dispersion_report.py
 - test/unit/cli/test_info.py and
 - test/unit/cli/test_relinker.py

I've verified that swift-*-info (at least) behave reasonably under
py3, even swift-object-info when there's non-utf8 metadata on the
data/meta file.

Change-Id: Ifed4b8059337c395e56f5e9f8d939c34fe4ff8dd
2018-02-28 21:10:01 +00:00
Tim Burke
642f79965a py3: port common/ring/ and common/utils.py
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.

While we're at it (and since they were so close anyway), port

* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py

Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
2018-02-12 06:42:24 +00:00
Samuel Merritt
728b4ba140 Add checksum to object extended attributes
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.

This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.

Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.

As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.

The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?

Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.

Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).

References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.

THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS

With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.

So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.

`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.

The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.

Co-Authored-By: John Dickinson <me@not.mn>

Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
2017-11-03 13:30:05 -04:00
Samuel Merritt
e6cf9b4758 Fix some spelling in a docstring
Change-Id: I6b32238da3381848ae56aed1c570d299be72473e
2017-10-17 15:16:43 -07:00
Pavel Kvasnička
163fb4d52a Always require device dir for containers
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.

This behavior is consistent with the object-server's checks in diskfile.

Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
2017-09-01 10:32:12 -07:00
Tim Burke
6f55a5ea94 Always check for unexpected requests in mocked_http_conn
Change-Id: Ie0ea9971129b6d090a0fcb7c5be33acb6c9b512d
Related-Change: Ia72c407247e4525ef071a1728750850807ae8231
2017-08-11 09:54:39 -07:00
Clay Gerrard
701a172afa Add multiple worker processes strategy to reconstructor
This change adds a new Strategy concept to the daemon module similar to
how we manage WSGI workers.  We need to leverage multiple python
processes to get the concurrency properties we need.  More workers will
rebalance much faster on dense chassis with many devices.

Currently the default is still only one process, and no workers.  Set
reconstructor_workers in the [object-reconstructor] section to some
whole number <= the number of devices on a node to get that many
reconstructor workers.

Each worker will operate on a different subset of disks.

Once mode works as before, but tends to want to update recon drops a
little bit more.

If you change the rings, the strategy will shutdown workers and spawn
new ones.

You can kill the worker pids and the daemon strategy will respawn them.

New per-disk reconstructor stats are dumped to recon under the
object_reconstruction_per_disk key.  To maintain legacy compatibility
and replication monitoring based on cycle times they are aggregated
every stats_interval (default 5 mins).

Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe
2017-07-26 16:55:10 -07:00
Jenkins
83b62b4f39 Merge "Add Timestamp.now() helper" 2017-07-18 03:27:50 +00:00
Christian Schwede
e1140666d6 Add support to increase object ring partition power
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.

1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.

2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.

3. The "next_part_power" flag will be removed.

This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree.  This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.

Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.

Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).

Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>

[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html

Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2017-06-15 15:08:48 -07:00
Jenkins
96de9ad126 Merge "Clean up how PatchPolicies works" 2017-05-25 09:20:52 +00:00
Tim Burke
1b991803e8 Clean up how PatchPolicies works
We've got these lovely __enter__ and __exit__ methods; let's use them!

Note that this also changes how we patch classes' setUp methods so we
don't set self._orig_POLICIES when the class is already patched.  I
hope this may fix some sporadic failures that include tracebacks
that look like

  proxy ERROR: ERROR 500 Traceback (most recent call last):
    File ".../swift/obj/server.py", line 1105, in __call__
      res = getattr(self, req.method)(req)
    File ".../swift/common/utils.py", line 1626, in _timing_stats
      resp = func(ctrl, *args, **kwargs)
    File ".../swift/obj/server.py", line 880, in GET
      policy=policy, frag_prefs=frag_prefs)
    File ".../swift/obj/server.py", line 211, in get_diskfile
      return self._diskfile_router[policy].get_diskfile(
    File ".../swift/obj/diskfile.py", line 555, in __getitem__
      return self.policy_to_manager[policy]
  KeyError: ECStoragePolicy(...)

... and try to unpatch more gracefully with TestCase.addCleanup

Change-Id: Iaa3d42ec21758b0707155878a645e665aa36696c
2017-05-19 17:59:36 -07:00
Jenkins
d7a6d6e1e9 Merge "Do not sync suffixes when remote rejects reconstructor revert" 2017-05-01 20:38:07 +00:00
Tim Burke
85d6cd30be Add Timestamp.now() helper
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.

Change-Id: I3581c635c094a8c4339e9b770331a03eab704074
2017-04-27 14:19:00 -07:00
Tim Burke
20072570d9 Fix sporadic failure in test/unit/obj/test_server.py
In particular, in TestObjectController.test_object_delete_at_async_update

Rarely (<0.1% of the time?), it would fail with:

======================================================================
FAIL: test_object_delete_at_async_update
(test.unit.obj.test_server.TestObjectController)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/vagrant/swift/test/unit/obj/test_server.py", line 4826, in
test_object_delete_at_async_update
    resp = req.get_response(self.object_controller)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/vagrant/swift/test/unit/__init__.py", line 1075, in
mocked_http_conn
    raise AssertionError('left over status %r' % left_over_status)
AssertionError: left over status [500, 500]
-------------------- >> begin captured stdout << ---------------------
test INFO: None - - [26/Apr/2017:22:32:13 +0000] "PUT /sda1/p/a/c/o" 400
19 "-" "-" "-" 0.0003 "-" 23801 0

--------------------- >> end captured stdout << ----------------------
>>  raise AssertionError('left over status %r' % [500, 500])

----------------------------------------------------------------------

Related-Bug: 1514111
Change-Id: I1af4a291fb67cf4b1829f167998a08644117a800
2017-04-26 15:51:16 -07:00
Clay Gerrard
a0fcca1e05 Do not sync suffixes when remote rejects reconstructor revert
SSYNC is designed to limit concurrent incoming connections in order to
prevent IO contention.  The reconstructor should expect remote
replication servers to fail ssync_sender when the remote is too busy.
When the remote rejects SSYNC - it should avoid forcing additional IO
against the remote with a REPLICATE request which causes suffix
rehashing.

Suffix rehashing via REPLICATE verbs takes two forms:

1) a initial pre-flight call to REPLICATE /dev/part will cause a remote
primary to rehash any invalid suffixes and return a map for the local
sender to compare so that a sync can be performed on any mis-matched
suffixes.

2) a final call to REPLICATE /dev/part/suf1-suf2-suf3[-sufX[...]] will
cause the remote primary to rehash the *given* suffixes even if they are
*not* invalid.  This is a requirement for rsync replication because
after a suffix is synced via rsync the contents of a suffix dir will
likely have changed and the remote server needs to update it hashes.pkl
to reflect the new data.

SSYNC does not *need* to send a post-sync REPLICATE request.  Any
suffixes that are modified by the SSYNC protocol will call _finalize_put
under the hood as it is syncing.  It is however not harmful and
potentially useful to go ahead refresh hashes after an SSYNC while the
inodes of those suffixes are warm in the cache.

However, that only makes sense if the SSYNC conversation actually synced
any suffixes - if SSYNC is rejected for concurrency before it ever got
started there is no value in the remote performing a rehash.  It may be
that *another* reconstructor is pushing data into that same partition
and the suffixes will become immediately invalidated.

If a ssync_sender does not successful finish a sync the reconstructor
should skip the REPLICATE call entirely and move on to the next
partition without causing any useless remote IO.

Closes-Bug: #1665141

Change-Id: Ia72c407247e4525ef071a1728750850807ae8231
2017-04-06 17:37:34 +01:00
Alistair Coles
e4972f5ac7 Fixups for EC frag duplication tests
Follow up for related change:
- fix typos
- use common helper methods
- refactor some tests to reduce duplicate code

Related-Change: Idd155401982a2c48110c30b480966a863f6bd305

Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429
2017-02-25 20:40:04 -08:00
Kota Tsuyuzaki
40ba7f6172 EC Fragment Duplication - Foundational Global EC Cluster Support
This patch enables efficent PUT/GET for global distributed cluster[1].

Problem:
Erasure coding has the capability to decrease the amout of actual stored
data less then replicated model. For example, ec_k=6, ec_m=3 parameter
can be 1.5x of the original data which is smaller than 3x replicated.
However, unlike replication, erasure coding requires availability of at
least some ec_k fragments of the total ec_k + ec_m fragments to service
read (e.g. 6 of 9 in the case above). As such, if we stored the
EC object into a swift cluster on 2 geographically distributed data
centers which have the same volume of disks, it is likely the fragments
will be stored evenly (about 4 and 5) so we still need to access a
faraway data center to decode the original object. In addition, if one
of the data centers was lost in a disaster, the stored objects will be
lost forever, and we have to cry a lot. To ensure highly durable
storage, you would think of making *more* parity fragments (e.g.
ec_k=6, ec_m=10), unfortunately this causes *significant* performance
degradation due to the cost of mathmetical caluculation for erasure
coding encode/decode.

How this resolves the problem:
EC Fragment Duplication extends on the initial solution to add *more*
fragments from which to rebuild an object similar to the solution
described above. The difference is making *copies* of encoded fragments.
With experimental results[1][2], employing small ec_k and ec_m shows
enough performance to store/retrieve objects.

On PUT:

- Encode incomming object with small ec_k and ec_m  <- faster!
- Make duplicated copies of the encoded fragments. The # of copies
  are determined by 'ec_duplication_factor' in swift.conf
- Store all fragments in Swift Global EC Cluster

The duplicated fragments increase pressure on existing requirements
when decoding objects in service to a read request.  All fragments are
stored with their X-Object-Sysmeta-Ec-Frag-Index.  In this change, the
X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index
encoded by PyECLib, there *will* be duplicates.  Anytime we must decode
the original object data, we must only consider the ec_k fragments as
unique according to their X-Object-Sysmeta-Ec-Frag-Index.  On decode no
duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an
object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and
avoided if possible.

On GET:

This patch inclues following changes:
- Change GET Path to sort primary nodes grouping as subsets, so that
  each subset will includes unique fragments
- Change Reconstructor to be more aware of possibly duplicate fragments

For example, with this change, a policy could be configured such that

swift.conf:
ec_num_data_fragments = 2
ec_num_parity_fragments = 1
ec_duplication_factor = 2
(object ring must have 6 replicas)

At Object-Server:
node index (from object ring):  0 1 2 3 4 5 <- keep node index for
                                               reconstruct decision
X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual
                                               fragment index for
                                               backend (PyEClib)

Additional improvements to Global EC Cluster Support will require
features such as Composite Rings, and more efficient fragment
rebalance/reconstruction.

1: http://goo.gl/IYiNPk (Swift Design Spec Repository)
2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo)

Doc-Impact

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Idd155401982a2c48110c30b480966a863f6bd305
2017-02-22 10:56:13 -08:00
Clay Gerrard
f4adb2f28f Fix ZeroDivisionError in reconstructor.stats_line
Despite a check to prevent zero values in the denominator python
integer division could result in ZeroDivisionError in the compute_eta
helper function.  Make sure we always have a non-zero value even if it
is small.

NotImplemented:

 * stats calculation is still not great, see lp bug #1488608

Closes-Bug: #1549110
Change-Id: I54f2081c92c2a0b8f02c31e82f44f4250043d837
2016-11-07 18:19:20 -08:00
Alistair Coles
2a75091c58 Make ECDiskFileReader check fragment metadata
This patch makes the ECDiskFileReader check the validity of EC
fragment metadata as it reads chunks from disk and quarantine a
diskfile with bad metadata. This in turn means that both the object
auditor and a proxy GET request will cause bad EC fragments to be
quarantined.

This change is motivated by bug 1631144 which may result in corrupt EC
fragments being written to disk but appear valid to the object auditor
md5 hash and content-length checks.

NotImplemented:

 * perform metadata check when a read starts on any frag_size
   boundary, not just at zero

Related-Bug: #1631144
Closes-Bug: #1633647

Change-Id: Ifa6a7f8aaca94c7d39f4aeb9d4fa3f59c4f6ee13
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
2016-11-01 13:11:02 -07:00
Prashanth Pai
773edb4a5d Make object creation more atomic in Linux
Linux 3.11 introduced O_TMPFILE as a flag to open() sys call. This would
enable users to get a fd to an unnamed temporary file. As it's unnamed,
it does not require the caller to devise unique names. It is also not
accessible through any path. Hence, file creation is race-free.

This file is initially unreachable. It is then populated with data(write),
metadata(fsetxattr) and fsync'd before being atomically linked into the
filesystem in a fully formed state using linkat() sys call. Only after a
successful linkat() will the object file will be available for reference.

Caveats
* Unlike os.rename(), linkat() cannot overwrite destination path if it
  already exists. If path exists, we unlink and try again.
* XFS support for O_TMPFILE was only added in Linux 3.15.
* If client disconnects during object upload, although there is no
  incomplete/stale file on disk, the object directory would persist
  and is not cleaned up immediately.

Change-Id: I8402439fab3aba5d7af449b5e465f89332f606ec
Signed-off-by: Prashanth Pai <ppai@redhat.com>
2016-08-24 14:56:00 +05:30