swift

Author	SHA1	Message	Date
Zuul	a495f1e327	Merge "pep8: Turn on E305"	2020-04-10 11:55:07 +00:00
Tim Burke	668242c422	pep8: Turn on E305 Change-Id: Ia968ec7375ab346a2155769a46e74ce694a57fc2	2020-04-03 21:22:38 +02:00
Romain LE DISEZ	804776b379	Optimize obj replicator/reconstructor healthchecks DaemonStrategy class calls Daemon.is_healthy() method every 0.1 seconds to ensure that all workers are running as wanted. On object replicator/reconstructor daemons, is_healthy() check if the rings changed to decide if workers must be created/killed. With large rings, this operation can be CPU intensive, especially on low-end CPU. This patch: - increases the check interval to 5 seconds by default, because none of these daemons are critical for performance (they are not in the datapath). But it allows each daemon to change this value if necessary - ensures that before doing a computation of all devices in the ring, object replicator/reconstructor checks that the ring really changed (by checking the mtime of the ring.gz files) On an Atom N2800 processor, this patch reduced the CPU usage of the main object replicator/reconstructor from 70% of a core to 0%. Change-Id: I2867e2be539f325778e2f044a151fd0773a7c390	2020-04-01 08:03:32 -04:00
Sean McGinnis	5b26b749b5	Drop use of unittest2 unittest2 was needed for Python version <= 2.6, so it hasn't been needed for quite some time. See unittest2 note one: https://docs.python.org/2.7/library/unittest.html This drops unittest2 in favor of the standard unittest module. Change-Id: I2e787cfbf1709b7f9c889230a10c03689e032957 Signed-off-by: Sean McGinnis <sean.mcginnis@gmail.com>	2020-01-12 03:13:41 -06:00
Clay Gerrard	286082222d	Use less responses from handoffs Since we don't use 404s from handoffs anymore, we need to not let errors on handoffs overwhelm primary responses either Change-Id: I2624e113c9d945542f787e5f18f487bd7be3d32e Closes-Bug: #1857909	2020-01-02 16:44:05 -08:00
Tim Burke	d270596b67	Consistently use io.BytesIO Change-Id: Ic41b37ac75b5596a8307c4962be86f2a4b0d9731	2019-10-15 15:09:46 +02:00
Thomas Goirand	12a7b42062	Fix test_parse_get_node_args Looks like xattr_supported_check was missing ERANGE Change-Id: I82263e48e836f38f77d81593c8435f64a4728b5d	2019-07-19 01:32:25 +02:00
Clay Gerrard	563e1671cf	Return 503 when primary containers can't respond Closes-Bug: #1833612 Change-Id: I53ed04b5de20c261ddd79c98c629580472e09961	2019-06-25 12:23:12 -05:00
Tim Burke	e8e7106d14	py3: port obj/reconstructor tests All of the swift changes we needed for this were already done elsewhere. Change-Id: Ib2c26fdf7bd36ed1cccd5dbd1fa208f912f4d8d5	2019-06-10 08:31:41 -07:00
Tim Burke	2e35376c6d	py3: symlink follow-up - Have the unit tests use WSGI strings, like a real system. - Port the func tests. Change-Id: I3a6f409208de45ebf9f55f7f59e4fe6ac6fbe163	2019-05-30 16:25:17 -07:00
Tim Burke	b8284538be	py3: start porting for unit/proxy/test_server.py Mostly this ammounts to Exception.message -> Exception.args[0] '...' -> b'...' StringIO -> BytesIO makefile() -> makefile('rwb') iter.next() -> next(iter) bytes[n] -> bytes[n:n + 1] integer division Note that the versioning tests are mostly untouched; they seemed to get a little hairy. Change-Id: I167b5375e7ed39d4abecf0653f84834ea7dac635	2019-05-04 20:35:05 -07:00
Pete Zaitcev	575538b55b	py3: port the container This started with ShardRanges and its CLI. The sharder is at the bottom of the dependency chain. Even container backend needs it. Once we started tinkering with the sharder, it all snowballed to include the rest of the container services. Beware, this does affect some of Python 2 code. Mostly it's trivial and obviously correct, but needs checking by reviewers. About killing the stray "from __future__ import unicode_literals": we do not do it in general. The specific problem it caused was a failure of functional tests because unicode leaked into a field that was supposed to be encoded. It is just too hard to track the types when rules change from file to file, so off with its head. Change-Id: Iba4e65d0e46d8c1f5a91feb96c2c07f99ca7c666	2019-02-20 21:30:46 -06:00
Zuul	64e5fd364a	Merge "Stop using duplicate dev IDs in write_fake_ring"	2019-02-09 07:08:21 +00:00
Clay Gerrard	ea8e545a27	Rebuild frags for unmounted disks Change the behavior of the EC reconstructor to perform a fragment rebuild to a handoff node when a primary peer responds with 507 to the REPLICATE request. Each primary node in a EC ring will sync with exactly three primary peers, in addition to the left & right nodes we now select a third node from the far side of the ring. If any of these partners respond unmounted the reconstructor will rebuild it's fragments to a handoff node with the appropriate index. To prevent ssync (which is uninterruptible) receiving a 409 (Conflict) we must give the remote handoff node the correct backend_index for the fragments it will recieve. In the common case we will use determistically different handoffs for each fragment index to prevent multiple unmounted primary disks from forcing a single handoff node to hold more than one rebuilt fragment. Handoff nodes will continue to attempt to revert rebuilt handoff fragments to the appropriate primary until it is remounted or rebalanced. After a rebalance of EC rings (potentially removing unmounted/failed devices), it's most IO efficient to run in handoffs_only mode to avoid unnecessary rebuilds. Closes-Bug: #1510342 Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec	2019-02-08 18:04:55 +00:00
Tim Burke	8a6159f67b	Stop using duplicate dev IDs in write_fake_ring This would cause some weird issues where get_more_nodes() would actually yield out something, despite us only having two drives. Change-Id: Ibf658d69fce075c76c0870a542348f220376c87a	2019-02-08 09:36:35 -08:00
Zuul	80b58553e8	Merge "py3: port object controller in proxy"	2019-02-06 21:09:07 +00:00
Zuul	e7b13da497	Merge "Fix socket leak on object-server death"	2019-02-06 19:28:34 +00:00
Pete Zaitcev	988e719232	py3: port object controller in proxy When looking at the tuple thing, remember that tuples are comparable with ints in py2, but they do not compare like (100)[0]. Instead, they are always greater, acting almost like MAX_INT. No wonder py3 banned that practice. We had a much of other noncomparables, but we dealt with them in obvious ways, like adding key= to sorted(). Change-Id: I52e96406c3c1f39b98c1d81bdc057805cd1a6278	2019-02-06 00:26:39 -06:00
Zuul	a25b7f9c91	Merge "py3 object-server follow-ups"	2019-02-04 23:42:41 +00:00
Tim Burke	2bd7b7a109	py3 object-server follow-ups Change-Id: Ief7d85af8d3e1d5e03a6484a889c9146d69f1377 Related-Change: I203a54fddddbd4352be0e6ea476a628e3f747dc1	2019-02-04 09:36:16 -08:00
Samuel Merritt	0e81ffd1e1	Fix socket leak on object-server death Consider a client that's downloading a large replicated object of size N bytes. If the object server process dies (e.g. with a segfault) partway through the download, the proxy will have read fewer than N bytes, and then read(sockfd) will start returning 0 bytes. At this point, the proxy believes the object download is complete, and so the WSGI server waits for a new request to come in. Meanwhile, the client is waiting for the rest of their bytes. Until the client times out, that socket will be held open. The fix is to look at the Content-Length and Content-Range headers in the response from the object server, then retry with another object server in case the original GET is truncated. This way, the client gets all the bytes they should. Note that ResumingGetter already had retry logic for when an object-server is slow to send bytes -- this extends it to also cover unexpected disconnects. Change-Id: Iab1e07706193ddc86832fd2cff0d7c2cb6d79ad9 Related-Change: I74d8c13eba2a4917b5a116875b51a781b33a7abf Closes-Bug: 1568650	2019-01-31 18:38:35 +00:00
Thiago da Silva	0668731839	Change how O_TMPFILE support is detected Previously o_tmpfile support was detected by checking the kernel version as it was officially introduced in XFS in 3.15. The problem is that RHEL has backported the support to at least RHEL 7.6 but the kernel version is not updated. This patch changes o_tmpfile is detected by actually attempting to open a file with the O_TMPFILE flag and keeps the information cached in DiskFileManager so that the check only happens once while process is running. Change-Id: I3599e2ab257bcd99467aee83b747939afac639d8	2019-01-31 18:35:39 +00:00
Pete Zaitcev	5b5ed29ab4	py3: object server This does not do anything about replicator or other daemons, only ports the server. - wsgi_to_bytes(something.get('X-Foo')) assumes that None is possible - Dunno if del-in-for-in-dict calls for clone or list(), using list() - Fixed the zero-copy with EventletPlungerString(bytes) - Yet another reminder that Request.blank() takes a WSGI string path Curiously enough, the sleep(0) before checking for logging was already present in the tests. The py3 scheduling merely forces us to do it consistently throughout. Change-Id: I203a54fddddbd4352be0e6ea476a628e3f747dc1	2019-01-11 09:18:08 -06:00
Pete Zaitcev	0d29b01d2b	py3: Port the acl, account_quotas, cname_lookup, container_sync These are trivial, but need to be done sooner or later. About the isEnabledFor, our FakeLogger causes this on py3: AttributeError: 'FakeLogger' object has no attribute '_cache' Adding the isEnabledFor short-cuts a need for that private member. Change-Id: I4d1df857a24801fe2a396dc003719f54d099f72c	2018-12-27 18:55:47 +00:00
John Dickinson	c26d67efcf	fixed _check_node() in the container sharder Previously, _check_node() wouldn't catch the raise ValueError when a drive was unmounted. Therefore the error would bubble up, uncaught, and stop the shard cycle. The practical effect is that an unmounted drive on a node would prevent sharding for happening. This patch updates _check_node() to properly use the check_drive() method. Furthermore, the _check_node() return value has been modified to be more similar to what check_drive() actually returns. This should help prevent similar errors from being introduced in the future. Closes-Bug: #1806500 Change-Id: I3da9b5b120a5980e77ef5c4dc8fa1697e462ce0d	2018-12-04 16:16:04 -08:00
Tim Burke	37b814657e	py3: port encryption This got away from me a bit with the functional tests masquerading as unit tests. Change-Id: I1237c02eff96e53fff8f9661a2d85c4695b73371	2018-11-20 01:30:04 -06:00
Alistair Coles	9a7b46e1e3	swift-ring-builder shows hint about composite builder file If swift-ring-builder is erroneously given a composite builder file, which it will fail to load, it will now print a hint that the file is a composite builder file. Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: If4517f3b61977a7f6ca3e08ed5deb182aa87a366	2018-07-05 15:57:05 +01:00
Matthew Oliver	2641814010	Add sharder daemon, manage_shard_ranges tool and probe tests The sharder daemon visits container dbs and when necessary executes the sharding workflow on the db. The workflow is, in overview: - perform an audit of the container for sharding purposes. - move any misplaced objects that do not belong in the container to their correct shard. - move shard ranges from FOUND state to CREATED state by creating shard containers. - move shard ranges from CREATED to CLEAVED state by cleaving objects to shard dbs and replicating those dbs. By default this is done in batches of 2 shard ranges per visit. Additionally, when the auto_shard option is True (NOT yet recommeneded in production), the sharder will identify shard ranges for containers that have exceeded the threshold for sharding, and will also manage the sharding and shrinking of shard containers. The manage_shard_ranges tool provides a means to manually identify shard ranges and merge them to a container in order to trigger sharding. This is currently the recommended way to shard a container. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Co-Authored-By: Tim Burke <tim.burke@gmail.com> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f	2018-05-18 18:48:13 +01:00
Alistair Coles	9d742b85ad	Refactoring, test infrastructure changes and cleanup ...in preparation for the container sharding feature. Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I4455677abb114a645cff93cd41b394d227e805de	2018-05-15 18:18:25 +01:00
Thomas Goirand	22b9a4a943	Fix tests using O_TMPFILE Unit tests using O_TMPFILE only rely on the kernel version to check for the feature. This is wrong, as some filesystem, like tmpfs, doesn't support O_TMPFILE. So, instead of checking kernel version, this patch actually attempts to open a file using O_TMPFILE and see if that's supported. If not, then the test is skipped. Change-Id: I5d652f1634b1ef940838573cfdd799ea17b8b572	2018-03-13 12:06:07 +00:00
Tim Burke	36c42974d6	py3: Port more CLI tools Bring under test - test/unit/cli/test_dispersion_report.py - test/unit/cli/test_info.py and - test/unit/cli/test_relinker.py I've verified that swift-*-info (at least) behave reasonably under py3, even swift-object-info when there's non-utf8 metadata on the data/meta file. Change-Id: Ifed4b8059337c395e56f5e9f8d939c34fe4ff8dd	2018-02-28 21:10:01 +00:00
Tim Burke	642f79965a	py3: port common/ring/ and common/utils.py I can't imagine us not having a py3 proxy server at some point, and that proxy server is going to need a ring. While we're at it (and since they were so close anyway), port * cli/ringbuilder.py and * common/linkat.py * common/daemon.py Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b	2018-02-12 06:42:24 +00:00
Samuel Merritt	728b4ba140	Add checksum to object extended attributes Currently, our integrity checking for objects is pretty weak when it comes to object metadata. If the extended attributes on a .data or .meta file get corrupted in such a way that we can still unpickle it, we don't have anything that detects that. This could be especially bad with encrypted etags; if the encrypted etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk, then send it to the client. Net effect is that the client sees a GET response with an ETag that doesn't match the MD5 of the object and Swift has no way of detecting and quarantining this object. Note that, with an unencrypted object, if the ETag metadatum gets mangled, then the object will be quarantined by the object server or auditor, whichever notices first. As part of this commit, I also ripped out some mocking of getxattr/setxattr in tests. It appears to be there to allow unit tests to run on systems where /tmp doesn't support xattrs. However, since the mock is keyed off of inode number and inode numbers get re-used, there's lots of leakage between different test runs. On a real FS, unlinking a file and then creating a new one of the same name will also reset the xattrs; this isn't the case with the mock. The mock was pretty old; Ubuntu 12.04 and up all support xattrs in /tmp, and recent Red Hat / CentOS releases do too. The xattr mock was added in 2011; maybe it was to support Ubuntu Lucid Lynx? Bonus: now you can pause a test with the debugger, inspect its files in /tmp, and actually see the xattrs along with the data. Since this patch now uses a real filesystem for testing filesystem operations, tests are skipped if the underlying filesystem does not support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4). References to "/tmp" have been replaced with calls to tempfile.gettempdir(). This will allow setting the TMPDIR envvar in test setup and getting an XFS filesystem instead of ext4 or tmpfs. THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS With this patch, every test environment will require TMPDIR to be using a filesystem that supports at least 4k of extended attributes. Neither ext4 nor tempfs support this. XFS is recommended. So why all the SkipTests? Why not simply raise an error? We still need the tests to run on the base image for OpenStack's CI system. Since we were previously mocking out xattr, there wasn't a problem, but we also weren't actually testing anything. This patch adds functionality to validate xattr data, so we need to drop the mock. `test.unit.skip_if_no_xattrs()` is also imported into `test.functional` so that functional tests can import it from the functional test namespace. The related OpenStack CI infrastructure changes are made in https://review.openstack.org/#/c/394600/. Co-Authored-By: John Dickinson <me@not.mn> Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808	2017-11-03 13:30:05 -04:00
Samuel Merritt	e6cf9b4758	Fix some spelling in a docstring Change-Id: I6b32238da3381848ae56aed1c570d299be72473e	2017-10-17 15:16:43 -07:00
Pavel Kvasnička	163fb4d52a	Always require device dir for containers For test purposes (e.g. saio probetests) even if mount_check is False, still require check_dir for account/container server storage when real mount points are not used. This behavior is consistent with the object-server's checks in diskfile. Co-Author: Clay Gerrard <clay.gerrard@gmail.com> Related lp bug #1693005 Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2 Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764	2017-09-01 10:32:12 -07:00
Tim Burke	6f55a5ea94	Always check for unexpected requests in mocked_http_conn Change-Id: Ie0ea9971129b6d090a0fcb7c5be33acb6c9b512d Related-Change: Ia72c407247e4525ef071a1728750850807ae8231	2017-08-11 09:54:39 -07:00
Clay Gerrard	701a172afa	Add multiple worker processes strategy to reconstructor This change adds a new Strategy concept to the daemon module similar to how we manage WSGI workers. We need to leverage multiple python processes to get the concurrency properties we need. More workers will rebalance much faster on dense chassis with many devices. Currently the default is still only one process, and no workers. Set reconstructor_workers in the [object-reconstructor] section to some whole number <= the number of devices on a node to get that many reconstructor workers. Each worker will operate on a different subset of disks. Once mode works as before, but tends to want to update recon drops a little bit more. If you change the rings, the strategy will shutdown workers and spawn new ones. You can kill the worker pids and the daemon strategy will respawn them. New per-disk reconstructor stats are dumped to recon under the object_reconstruction_per_disk key. To maintain legacy compatibility and replication monitoring based on cycle times they are aggregated every stats_interval (default 5 mins). Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe	2017-07-26 16:55:10 -07:00
Jenkins	83b62b4f39	Merge "Add Timestamp.now() helper"	2017-07-18 03:27:50 +00:00
Christian Schwede	e1140666d6	Add support to increase object ring partition power This patch adds methods to increase the partition power of an existing object ring without downtime for the users using a 3-step process. Data won't be moved to other nodes; objects using the new increased partition power will be located on the same device and are hardlinked to avoid data movement. 1. A new setting "next_part_power" will be added to the rings, and once the proxy server reloaded the rings it will send this value to the object servers on any write operation. Object servers will now create a hard-link in the new location to the original DiskFile object. Already existing data will be relinked using a new tool in the new locations using hardlinks. 2. The actual partition power itself will be increased. Servers will now use the new partition power to read from and write to. No longer required hard links in the old object location have to be removed now by the relinker tool; the relinker tool reads the next_part_power setting to find object locations that need to be cleaned up. 3. The "next_part_power" flag will be removed. This mostly implements the spec in [1]; however it's not using an "epoch" as described there. The idea of the epoch was to store data using different partition powers in their own namespace to avoid conflicts with auditors and replicators as well as being able to abort such an operation and just remove the new tree. This would require some heavy change of the on-disk data layout, and other object-server implementations would be required to adopt this scheme too. Instead the object-replicator is now aware that there is a partition power increase in progress and will skip replication of data in that storage policy; the relinker tool should be simply run and afterwards the partition power will be increased. This shouldn't take that much time (it's only walking the filesystem and hardlinking); impact should be low therefore. The relinker should be run on all storage nodes at the same time in parallel to decrease the required time (though this is not mandatory). Failures during relinking should not affect cluster operations - relinking can be even aborted manually and restarted later. Auditors are not quarantining objects written to a path with a different partition power and therefore working as before (though they are reading each object twice in the worst case before the no longer needed hard links are removed). Co-Authored-By: Alistair Coles <alistair.coles@hpe.com> Co-Authored-By: Matthew Oliver <matt@oliver.net.au> Co-Authored-By: Tim Burke <tim.burke@gmail.com> [1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/ increasing_partition_power.html Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb	2017-06-15 15:08:48 -07:00
Jenkins	96de9ad126	Merge "Clean up how PatchPolicies works"	2017-05-25 09:20:52 +00:00
Tim Burke	1b991803e8	Clean up how PatchPolicies works We've got these lovely __enter__ and __exit__ methods; let's use them! Note that this also changes how we patch classes' setUp methods so we don't set self._orig_POLICIES when the class is already patched. I hope this may fix some sporadic failures that include tracebacks that look like proxy ERROR: ERROR 500 Traceback (most recent call last): File ".../swift/obj/server.py", line 1105, in __call__ res = getattr(self, req.method)(req) File ".../swift/common/utils.py", line 1626, in _timing_stats resp = func(ctrl, args, *kwargs) File ".../swift/obj/server.py", line 880, in GET policy=policy, frag_prefs=frag_prefs) File ".../swift/obj/server.py", line 211, in get_diskfile return self._diskfile_router[policy].get_diskfile( File ".../swift/obj/diskfile.py", line 555, in __getitem__ return self.policy_to_manager[policy] KeyError: ECStoragePolicy(...) ... and try to unpatch more gracefully with TestCase.addCleanup Change-Id: Iaa3d42ec21758b0707155878a645e665aa36696c	2017-05-19 17:59:36 -07:00
Jenkins	d7a6d6e1e9	Merge "Do not sync suffixes when remote rejects reconstructor revert"	2017-05-01 20:38:07 +00:00
Tim Burke	85d6cd30be	Add Timestamp.now() helper Often, we want the current timestamp. May as well improve the ergonomics a bit and provide a class method for it. Change-Id: I3581c635c094a8c4339e9b770331a03eab704074	2017-04-27 14:19:00 -07:00
Tim Burke	20072570d9	Fix sporadic failure in test/unit/obj/test_server.py In particular, in TestObjectController.test_object_delete_at_async_update Rarely (<0.1% of the time?), it would fail with: ====================================================================== FAIL: test_object_delete_at_async_update (test.unit.obj.test_server.TestObjectController) ---------------------------------------------------------------------- Traceback (most recent call last): File "/vagrant/swift/test/unit/obj/test_server.py", line 4826, in test_object_delete_at_async_update resp = req.get_response(self.object_controller) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/vagrant/swift/test/unit/__init__.py", line 1075, in mocked_http_conn raise AssertionError('left over status %r' % left_over_status) AssertionError: left over status [500, 500] -------------------- >> begin captured stdout << --------------------- test INFO: None - - [26/Apr/2017:22:32:13 +0000] "PUT /sda1/p/a/c/o" 400 19 "-" "-" "-" 0.0003 "-" 23801 0 --------------------- >> end captured stdout << ---------------------- >> raise AssertionError('left over status %r' % [500, 500]) ---------------------------------------------------------------------- Related-Bug: 1514111 Change-Id: I1af4a291fb67cf4b1829f167998a08644117a800	2017-04-26 15:51:16 -07:00
Clay Gerrard	a0fcca1e05	Do not sync suffixes when remote rejects reconstructor revert SSYNC is designed to limit concurrent incoming connections in order to prevent IO contention. The reconstructor should expect remote replication servers to fail ssync_sender when the remote is too busy. When the remote rejects SSYNC - it should avoid forcing additional IO against the remote with a REPLICATE request which causes suffix rehashing. Suffix rehashing via REPLICATE verbs takes two forms: 1) a initial pre-flight call to REPLICATE /dev/part will cause a remote primary to rehash any invalid suffixes and return a map for the local sender to compare so that a sync can be performed on any mis-matched suffixes. 2) a final call to REPLICATE /dev/part/suf1-suf2-suf3[-sufX[...]] will cause the remote primary to rehash the given suffixes even if they are not invalid. This is a requirement for rsync replication because after a suffix is synced via rsync the contents of a suffix dir will likely have changed and the remote server needs to update it hashes.pkl to reflect the new data. SSYNC does not need to send a post-sync REPLICATE request. Any suffixes that are modified by the SSYNC protocol will call _finalize_put under the hood as it is syncing. It is however not harmful and potentially useful to go ahead refresh hashes after an SSYNC while the inodes of those suffixes are warm in the cache. However, that only makes sense if the SSYNC conversation actually synced any suffixes - if SSYNC is rejected for concurrency before it ever got started there is no value in the remote performing a rehash. It may be that another reconstructor is pushing data into that same partition and the suffixes will become immediately invalidated. If a ssync_sender does not successful finish a sync the reconstructor should skip the REPLICATE call entirely and move on to the next partition without causing any useless remote IO. Closes-Bug: #1665141 Change-Id: Ia72c407247e4525ef071a1728750850807ae8231	2017-04-06 17:37:34 +01:00
Alistair Coles	e4972f5ac7	Fixups for EC frag duplication tests Follow up for related change: - fix typos - use common helper methods - refactor some tests to reduce duplicate code Related-Change: Idd155401982a2c48110c30b480966a863f6bd305 Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429	2017-02-25 20:40:04 -08:00
Kota Tsuyuzaki	40ba7f6172	EC Fragment Duplication - Foundational Global EC Cluster Support This patch enables efficent PUT/GET for global distributed cluster[1]. Problem: Erasure coding has the capability to decrease the amout of actual stored data less then replicated model. For example, ec_k=6, ec_m=3 parameter can be 1.5x of the original data which is smaller than 3x replicated. However, unlike replication, erasure coding requires availability of at least some ec_k fragments of the total ec_k + ec_m fragments to service read (e.g. 6 of 9 in the case above). As such, if we stored the EC object into a swift cluster on 2 geographically distributed data centers which have the same volume of disks, it is likely the fragments will be stored evenly (about 4 and 5) so we still need to access a faraway data center to decode the original object. In addition, if one of the data centers was lost in a disaster, the stored objects will be lost forever, and we have to cry a lot. To ensure highly durable storage, you would think of making more parity fragments (e.g. ec_k=6, ec_m=10), unfortunately this causes significant performance degradation due to the cost of mathmetical caluculation for erasure coding encode/decode. How this resolves the problem: EC Fragment Duplication extends on the initial solution to add more fragments from which to rebuild an object similar to the solution described above. The difference is making copies of encoded fragments. With experimental results[1][2], employing small ec_k and ec_m shows enough performance to store/retrieve objects. On PUT: - Encode incomming object with small ec_k and ec_m <- faster! - Make duplicated copies of the encoded fragments. The # of copies are determined by 'ec_duplication_factor' in swift.conf - Store all fragments in Swift Global EC Cluster The duplicated fragments increase pressure on existing requirements when decoding objects in service to a read request. All fragments are stored with their X-Object-Sysmeta-Ec-Frag-Index. In this change, the X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index encoded by PyECLib, there will be duplicates. Anytime we must decode the original object data, we must only consider the ec_k fragments as unique according to their X-Object-Sysmeta-Ec-Frag-Index. On decode no duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and avoided if possible. On GET: This patch inclues following changes: - Change GET Path to sort primary nodes grouping as subsets, so that each subset will includes unique fragments - Change Reconstructor to be more aware of possibly duplicate fragments For example, with this change, a policy could be configured such that swift.conf: ec_num_data_fragments = 2 ec_num_parity_fragments = 1 ec_duplication_factor = 2 (object ring must have 6 replicas) At Object-Server: node index (from object ring): 0 1 2 3 4 5 <- keep node index for reconstruct decision X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual fragment index for backend (PyEClib) Additional improvements to Global EC Cluster Support will require features such as Composite Rings, and more efficient fragment rebalance/reconstruction. 1: http://goo.gl/IYiNPk (Swift Design Spec Repository) 2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo) Doc-Impact Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: Idd155401982a2c48110c30b480966a863f6bd305	2017-02-22 10:56:13 -08:00
Clay Gerrard	f4adb2f28f	Fix ZeroDivisionError in reconstructor.stats_line Despite a check to prevent zero values in the denominator python integer division could result in ZeroDivisionError in the compute_eta helper function. Make sure we always have a non-zero value even if it is small. NotImplemented: * stats calculation is still not great, see lp bug #1488608 Closes-Bug: #1549110 Change-Id: I54f2081c92c2a0b8f02c31e82f44f4250043d837	2016-11-07 18:19:20 -08:00
Alistair Coles	2a75091c58	Make ECDiskFileReader check fragment metadata This patch makes the ECDiskFileReader check the validity of EC fragment metadata as it reads chunks from disk and quarantine a diskfile with bad metadata. This in turn means that both the object auditor and a proxy GET request will cause bad EC fragments to be quarantined. This change is motivated by bug 1631144 which may result in corrupt EC fragments being written to disk but appear valid to the object auditor md5 hash and content-length checks. NotImplemented: * perform metadata check when a read starts on any frag_size boundary, not just at zero Related-Bug: #1631144 Closes-Bug: #1633647 Change-Id: Ifa6a7f8aaca94c7d39f4aeb9d4fa3f59c4f6ee13 Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>	2016-11-01 13:11:02 -07:00
Prashanth Pai	773edb4a5d	Make object creation more atomic in Linux Linux 3.11 introduced O_TMPFILE as a flag to open() sys call. This would enable users to get a fd to an unnamed temporary file. As it's unnamed, it does not require the caller to devise unique names. It is also not accessible through any path. Hence, file creation is race-free. This file is initially unreachable. It is then populated with data(write), metadata(fsetxattr) and fsync'd before being atomically linked into the filesystem in a fully formed state using linkat() sys call. Only after a successful linkat() will the object file will be available for reference. Caveats * Unlike os.rename(), linkat() cannot overwrite destination path if it already exists. If path exists, we unlink and try again. * XFS support for O_TMPFILE was only added in Linux 3.15. * If client disconnects during object upload, although there is no incomplete/stale file on disk, the object directory would persist and is not cleaned up immediately. Change-Id: I8402439fab3aba5d7af449b5e465f89332f606ec Signed-off-by: Prashanth Pai <ppai@redhat.com>	2016-08-24 14:56:00 +05:30

1 2 3

146 Commits