swift

Author	SHA1	Message	Date
Jenkins	7f953ce0b9	Merge "Follow up the reconstructor test coverage"	2017-03-07 15:10:03 +00:00
Jenkins	30524392f8	Merge "Follow up on reconstructor handoffs_only"	2017-03-07 06:45:09 +00:00
Kota Tsuyuzaki	0e44770991	Follow up the reconstructor test coverage This is follow up for https://review.openstack.org/#/c/436522/ I'd like to use same assertion if it goes the same path. Both Exception and Timeout will be in the exception log starts with "Trying to GET". "Timeout" is an extra word appeared in the log. And more, this adds assertions for the return value from the get_response for error cases which should be as None. Change-Id: Iba86b495a14c15fc6eca8bf8a7df7d110256b0af	2017-03-05 18:02:09 -08:00
Jenkins	cf1c44dff0	Merge "Fixups for EC frag duplication tests"	2017-03-03 23:08:34 +00:00
Jenkins	3891721d59	Merge "Cleanup reconstructor tests"	2017-03-03 21:28:32 +00:00
Kota Tsuyuzaki	54347f92ed	Cleanup reconstructor tests Fixes: * assertTrue(xxxx in yyyyy) -> assertIn(xxxx, yyyy) * assertTrue(xxxx > yyyy) -> assertGreater(xxxx, yyyy) Change-Id: I353ec389f9abed3427951cd473d7c3ebcbca1669	2017-03-03 00:57:13 -08:00
Mahati Chamarthy	96f8b957ee	Increase test coverage for reconstructor Some part of the test coverage was omitted in related change and some has been missing. This change fixes it. Change-Id: I403b493bd8e59f6bcb586b4263a8e8c267728505 Related-Change-Id: I69e4c4baee64fd2192cbf5836b0803db1cc71705	2017-02-28 00:54:11 +05:30
Jenkins	1f36b5dd16	Merge "EC Fragment Duplication - Foundational Global EC Cluster Support"	2017-02-26 06:26:08 +00:00
Alistair Coles	e4972f5ac7	Fixups for EC frag duplication tests Follow up for related change: - fix typos - use common helper methods - refactor some tests to reduce duplicate code Related-Change: Idd155401982a2c48110c30b480966a863f6bd305 Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429	2017-02-25 20:40:04 -08:00
Kota Tsuyuzaki	40ba7f6172	EC Fragment Duplication - Foundational Global EC Cluster Support This patch enables efficent PUT/GET for global distributed cluster[1]. Problem: Erasure coding has the capability to decrease the amout of actual stored data less then replicated model. For example, ec_k=6, ec_m=3 parameter can be 1.5x of the original data which is smaller than 3x replicated. However, unlike replication, erasure coding requires availability of at least some ec_k fragments of the total ec_k + ec_m fragments to service read (e.g. 6 of 9 in the case above). As such, if we stored the EC object into a swift cluster on 2 geographically distributed data centers which have the same volume of disks, it is likely the fragments will be stored evenly (about 4 and 5) so we still need to access a faraway data center to decode the original object. In addition, if one of the data centers was lost in a disaster, the stored objects will be lost forever, and we have to cry a lot. To ensure highly durable storage, you would think of making more parity fragments (e.g. ec_k=6, ec_m=10), unfortunately this causes significant performance degradation due to the cost of mathmetical caluculation for erasure coding encode/decode. How this resolves the problem: EC Fragment Duplication extends on the initial solution to add more fragments from which to rebuild an object similar to the solution described above. The difference is making copies of encoded fragments. With experimental results[1][2], employing small ec_k and ec_m shows enough performance to store/retrieve objects. On PUT: - Encode incomming object with small ec_k and ec_m <- faster! - Make duplicated copies of the encoded fragments. The # of copies are determined by 'ec_duplication_factor' in swift.conf - Store all fragments in Swift Global EC Cluster The duplicated fragments increase pressure on existing requirements when decoding objects in service to a read request. All fragments are stored with their X-Object-Sysmeta-Ec-Frag-Index. In this change, the X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index encoded by PyECLib, there will be duplicates. Anytime we must decode the original object data, we must only consider the ec_k fragments as unique according to their X-Object-Sysmeta-Ec-Frag-Index. On decode no duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and avoided if possible. On GET: This patch inclues following changes: - Change GET Path to sort primary nodes grouping as subsets, so that each subset will includes unique fragments - Change Reconstructor to be more aware of possibly duplicate fragments For example, with this change, a policy could be configured such that swift.conf: ec_num_data_fragments = 2 ec_num_parity_fragments = 1 ec_duplication_factor = 2 (object ring must have 6 replicas) At Object-Server: node index (from object ring): 0 1 2 3 4 5 <- keep node index for reconstruct decision X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual fragment index for backend (PyEClib) Additional improvements to Global EC Cluster Support will require features such as Composite Rings, and more efficient fragment rebalance/reconstruction. 1: http://goo.gl/IYiNPk (Swift Design Spec Repository) 2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo) Doc-Impact Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: Idd155401982a2c48110c30b480966a863f6bd305	2017-02-22 10:56:13 -08:00
Jenkins	1f3dd83f41	Merge "Remove per-device reconstruction stats"	2017-02-20 21:55:27 +00:00
Tim Burke	8973ceb31a	Remove per-device reconstruction stats Now that we're shuffling parts before going through them, those stats no longer make sense -- device completion would always be 100%. Also, always use delete_partition for cleanup, so we have one place to make improvements. This means we'll properly clean up non-numeric directories. Also also, put more I/O in the tpool in delete_partition. Change-Id: Ie06bb16c130d46ccf887c8fcb252b8d018072d68 Related-Change: I69e4c4baee64fd2192cbf5836b0803db1cc71705	2017-02-20 16:22:45 +00:00
Jenkins	cdd72dd34f	Merge "Deprecate broken handoffs_first in favor of handoffs_only"	2017-02-15 03:54:49 +00:00
Kota Tsuyuzaki	600db4841e	Follow up on reconstructor handoffs_only This is a follow-up for https://review.openstack.org/#/c/425493 This patch includes: - Add more tests on the configuration with handoffs_first and handoffs_only - Remove unnecessary space in a warning log line. (2 places) - Change test conf from True/False to "True"/"False" (string) because in the conf dict, those value should be string. Co-Authored-By: Janie Richling <jrichli@us.ibm.com> Change-Id: Ida90c32d16481a15fa68c9fdb380932526c366f6	2017-02-14 18:21:58 -08:00
Clay Gerrard	da557011ec	Deprecate broken handoffs_first in favor of handoffs_only The handoffs_first mode in the replicator has the useful behavior of processing all handoff parts across all disks until there aren't any handoffs anymore on the node [1] and then it seemingly tries to drop back into normal operation. In practice I've only ever heard of handoffs_first used while rebalancing and turned off as soon as the rebalance finishes - it's not recommended to run with handoffs_first mode turned on and it emits a warning on startup if option is enabled. The handoffs_first mode on the reconstructor doesn't work - it was prioritizing handoffs per-part [2] - which is really unfortunate because in the reconstructor during a rebalance it's often much more attractive from an efficiency disk/network perspective to revert a partition from a handoff than it is to rebuild an entire partition from another primary using the other EC fragments in the cluster. This change deprecates handoffs_first in favor of handoffs_only in the reconstructor which is far more useful - and just like handoffs_first mode in the replicator - it gives the operator the option of forcing the consistency engine to focus on rebalance. The handoffs_only behavior is somewhat consistent with the replicator's handoffs_first option (any error on any handoff in the replicactor will make it essentially handoff only forever) but the option does what you want and is named correctly in the reconstructor. For consistency with the replicator the reconstructor will mostly honor the handoffs_first option, but if you set handoffs_only in the config it always takes precedence. Having handoffs_first in your config always results in a warning, but if handoff_only is not set and handoffs_first is true the reconstructor will assume you need handoffs_only and behaves as such. When running in handoffs_only mode the reconstructor will start to log a warning every cycle if you leave it running in handoffs_only after it finishes reverting handoffs. However you should be monitoring on-disk partitions and disable the option as soon as the cluster finishes the full rebalance cycle. 1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator handoffs_first "mode" 2. Unlike replication each partition in a EC policy can have a different kind of job per frag_index, but the cardinality of jobs is typically only one (either sync or revert) unless there's been a bunch of errors during write and then handoffs partitions maybe hold a number of different fragments. Known-Issues: handoffs_only is not documented outside of the example config, see lp bug #1626290 Closes-Bug: #1653018 Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898	2017-02-13 21:13:29 -08:00
Jenkins	65744c8448	Merge "Shuffle disks and parts in reconstructor"	2017-02-01 07:15:50 +00:00
Clay Gerrard	eadb01b8af	Do not revert fragments to handoffs We're already a handoff - just wait until we can ship it to the right primary location. If we timeout talking to a couple of nodes (or more likely get rejected for connection limits because of contention during a rebalance) we can actually end up making more work if we move the part to another node. I've seen clusters get stuck on rebalance just passing parts around handoffs for days. Known-Issues: If we see a 507 from a primary and we're not in the handoff list (we're an old primary post rebalance) it'd probably be not so terrible to try to revert it to the first handoff if it's not already holding a part. But that's more work and sounds more like lp bug #1510342 Closes-Bug: #1653169 Change-Id: Ie351d8342fc8e589b143f981e95ce74e70e52784	2017-01-31 02:37:31 +00:00
Clay Gerrard	2f0ab78f9f	Shuffle disks and parts in reconstructor The main problem with going disk by disk is that it means all of your I/O is only on one spindle at a time and no matter how high you set concurrency it doesn't go any faster. Closes-Bug: #1491605 Change-Id: I69e4c4baee64fd2192cbf5836b0803db1cc71705	2017-01-25 18:30:17 -08:00
Mahati Chamarthy	69f7be99a6	Move documented reclaim_age option to correct location The reclaim_age is a DiskFile option, it doesn't make sense for two different object services or nodes to use different values. I also driveby cleanup the reclaim_age plumbing from get_hashes to cleanup_ondisk_files since it's a method on the Manager and has access to the configured reclaim_age. This fixes a bug where finalize_put wouldn't use the [DEFAULT]/object-server configured reclaim_age - which is normally benign but leads to weird behavior on DELETE requests with really small reclaim_age. There's a couple of places in the replicator and reconstructor that reach into their manager to borrow the reclaim_age when emptying out the aborted PUTs that failed to cleanup their files in tmp - but that timeout doesn't really need to be coupled with reclaim_age and that method could have just as reasonably been implemented on the Manager. UpgradeImpact: Previously the reclaim_age was documented to be configurable in various object-* services config sections, but that did not work correctly unless you also configured the option for the object-server because of REPLICATE request rehash cleanup. All object services must use the same reclaim_age. If you require a non-default reclaim age it should be set in the [DEFAULT] section. If there are different non-default values, the greater should be used for all object services and configured only in the [DEFAULT] section. If you specify a reclaim_age value in any object related config you should move it to only the [DEFAULT] section before you upgrade. If you configure a reclaim_age less that your consistency window you are likely to be eaten by a Grue. Closes-Bug: #1626296 Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916 Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>	2017-01-13 03:10:47 +00:00
Kota Tsuyuzaki	b09360d447	Fix stats calculation in object-reconstructor This patch fixes the object-reconstructor to calculate device_count as the total number of local devices in all policies. Previously Swift counts it for each policy but reconstruction_device_count which means the number of devices actually swift needs to reconstruct is counted as sum of ones for all polices. With this patch, Swift will gather all local devices for all policies at first, and then, collect parts for each devices as well as current. To do so, we can see the statuses for remaining job/disks percentage via stats_line output. To enable this change, this patch also touchs the object replicator to get a DiskFileManager via the DiskFileRouter class so that DiskFileManager instances are policy specific. Currently the same replication policy DiskFileManager class is always used, but this change future proofs the replicator for possible other DiskFileManager implementations. The change also gives the ObjectReplicator a _df_router variable, making it consistent with the ObjectReconstructor, and allowing a common way for ssync.Sender to access DiskFileManager instances via it's daemon's _df_router instance. Also, remove the use of FakeReplicator from the ssync test suite. It was not necessary and risked masking divergence between ssync and the replicator and reconstructor daemon implementations. Co-Author: Alistair Coles <alistair.coles@hpe.com> Closes-Bug: #1488608 Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068	2016-12-12 21:26:54 -08:00
Clay Gerrard	f4adb2f28f	Fix ZeroDivisionError in reconstructor.stats_line Despite a check to prevent zero values in the denominator python integer division could result in ZeroDivisionError in the compute_eta helper function. Make sure we always have a non-zero value even if it is small. NotImplemented: * stats calculation is still not great, see lp bug #1488608 Closes-Bug: #1549110 Change-Id: I54f2081c92c2a0b8f02c31e82f44f4250043d837	2016-11-07 18:19:20 -08:00
Alistair Coles	6574ce31ee	EC: reconstruct using non-durable fragments Previously the reconstructor would only reconstruct a missing fragment when a set of ec_ndata other fragments was available, all of which were durable. Since change [1] it has been possible to retrieve non-durable fragments from object servers. This patch changes the reconstructor to take advantage of [1] and use non-durable fragments. A new probe test is added to test scenarios with a mix of failed and non-durable nodes. The existing probe tests in test_reconstructor_rebuild.py and test_reconstructor_durable.py were broken. These were intended to simulate cases where combinations of nodes were either failed or had non-durable fragments, but the test scenarios defined were not actually created - every test scenario broke only one node instead of the intent of breaking multiple nodes. The existing tests have been refactored to re-use most of their setup and assertion code, and merged with the new test into a single class in test_reconstructor_rebuild.py. test_reconstructor_durable.py is removed. [1] Related-Change: I2310981fd1c4622ff5d1a739cbcc59637ffe3fc3 Change-Id: Ic0cdbc7cee657cea0330c2eb1edabe8eb52c0567 Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Closes-Bug: #1624088	2016-11-03 16:54:09 +00:00
Ondřej Nový	33c18c579e	Remove executable flag from some test modules Change-Id: I36560c2b54c43d1674b007b8105200869b5f7987	2016-10-31 21:22:10 +00:00
Jenkins	264e728364	Merge "Prevent ssync writing bad fragment data to diskfile"	2016-10-14 23:29:29 +00:00
Alistair Coles	3218f8b064	Prevent ssync writing bad fragment data to diskfile Previously, if a reconstructor sync type job failed to provide sufficient bytes from a reconstructed fragment body iterator to match the content-length that the ssync sender had already sent to the ssync receiver, the sender would still proceed to send the next subrequest. The ssync receiver might then write the start of the next subrequest to the partially complete diskfile for the previous subrequest (including writing subrequest headers to that diskfile) until it has received content-length bytes. Since a reconstructor ssync job does not send an ETag header (it cannot because it does not know the ETag of a reconstructed fragment until it has been sent) then the receiving object server does not detect the "bad" data written to the fragment diskfile, and worse, will label it with an ETag that matches the md5 sum of the bad data. The bad fragment file will therefore appear good to the auditor. There is no easy way for the ssync sender to communicate a lack of source data to the receiver other than by disconnecting the session. So this patch adds a check in the ssync sender that the sent byte count is equal to the sent Content-Length header value for each subrequest, and disconnect if a mismatch is detected. The disconnect prevents the receiver finalizing the bad diskfile, but also prevents subsequent fragments in the ssync job being sync'd until the next cycle. Closes-Bug: #1631144 Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp> Change-Id: I54068906efdb9cd58fcdc6eae7c2163ea92afb9d	2016-10-13 17:15:10 +01:00
Alistair Coles	b13b49a27c	EC - eliminate .durable files Instead of using a separate .durable file to indicate the durable status of a .data file, rename the .data to include a durable marker in the filename. This saves one inode for every EC fragment archive. An EC policy PUT will, as before, first rename a temp file to: <timestamp>#<frag_index>.data but now, when the object is committed, that file will be renamed: <timestamp>#<frag_index>#d.data with the '#d' suffix marking the data file as durable. Diskfile suffix hashing returns the same result when the new durable-data filename or the legacy durable file is found in an object directory. A fragment archive that has been created on an upgraded object server will therefore appear to be in the same state, as far as the consistency engine is concerned, as the same fragment archive created on an older object server. Since legacy .durable files will still exist in deployed clusters, many of the unit tests scenarios have been duplicated for both new durable-data filenames and legacy durable files. Change-Id: I6f1f62d47be0b0ac7919888c77480a636f11f607	2016-10-10 18:11:02 +01:00
Tim Burke	ad16e2c77b	Stop complaining about auditor_status files Following fd86d5a, the object-auditor would leave status files so it could resume where it left off if restarted. However, this would also cause the object-reconstructor to print warnings like: Unexpected entity in data dir: u'/srv/node4/sdb8/objects/auditor_status_ZBF.json' ...which isn't actually terribly useful or actionable. The auditor will clean it up (eventually); the operator doesn't have to do anything. Now, the reconstructor will specifically ignore those status files. Change-Id: I2f3d0bd2f1e242db6eb263c7755f1363d1430048	2016-05-11 20:13:46 -07:00
Shashirekha Gundur	cf48e75c25	change default ports for servers Changing the recommended ports for Swift services from ports 6000-6002 to unused ports 6200-6202; so they do not conflict with X-Windows or other services. Updated SAIO docs. DocImpact Closes-Bug: #1521339 Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18	2016-04-29 14:47:38 -04:00
Samuel Merritt	9430f4c9f5	Move HeaderKeyDict to avoid an inline import There was a function in swift.common.utils that was importing swob.HeaderKeyDict at call time. It couldn't import it at compilation time since utils can't import from swob or else it blows up with a circular import error. This commit just moves HeaderKeyDict into swift.common.header_key_dict so that we can remove the inline import. Change-Id: I656fde8cc2e125327c26c589cf1045cb81ffc7e5	2016-03-07 12:26:48 -08:00
Ondřej Nový	f53cf1043d	Fixed few misspellings in comments Change-Id: I8479c85cb8821c48b5da197cac37c80e5c1c7f05	2016-01-05 20:20:15 +01:00
Tushar Gohad	2d85a3f699	EC: Use best available ec_type in unittests To minimize external library dependencies for Swift unit tests and SAIO, PyECLib 1.1.1 introduces a native backend 'liberasurecode_rs_vand.' This patch is to migrate over the unit tests to the new ec_type when available. This change will work with current pyeclib requirements (==1.0.7) and also future requirements (>=1.0.7). When we're able to raise our requirements to >=1.1.1 we should remove jerasure from the list of preferred backends. Related SAIO doc and example config changes should be included with that patch. Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: Idf657f0acf0479bc8158972e568a29dbc08eaf3b	2015-11-10 12:18:50 -08:00
Samuel Merritt	e31ecb24b6	Get rid of contextlib.nested() for py3 contextlib.nested() is missing completely in Python 3. Since 2.7, we can use multiple context managers in a 'with' statement, like so: with thing1() as t1, thing2() as t2: do_stuff() Now, if we had some code that needed to nest an arbitrary number of context managers, there's stuff we could do with contextlib.ExitStack and such... but we don't. We only use contextlib.nested() in tests to set up bunches of mocks without crazy-deep indentation, and all that stuff fits perfectly into multiple-context-manager 'with' statements. Change-Id: Id472958b007948f05dbd4c7fb8cf3ffab58e2681	2015-10-23 11:44:54 -07:00
Alistair Coles	29c10db0cb	Add POST capability to ssync for .meta files ssync currently does the wrong thing when replicating object dirs containing both a .data and a .meta file. The ssync sender uses a single PUT to send both object content and metadata to the receiver, using the metadata (.meta file) timestamp. This results in the object content timestamp being advanced to the metadata timestamp, potentially overwriting newer object data on the receiver and causing an inconsistency with the container server record for the object. For example, replicating an object dir with {t0.data(etag=x), t2.meta} to a receiver with t1.data(etag=y) will result in the creation of t2.data(etag=x) on the receiver. However, the container server will continue to list the object as t1(etag=y). This patch modifies ssync to replicate the content of .data and .meta separately using a PUT request for the data (no change) and a POST request for the metadata. In effect, ssync replication replicates the client operations that generated the .data and .meta files so that the result of replication is the same as if the original client requests had persisted on all object servers. Apart from maintaining correct timestamps across sync'd nodes, this has the added benefit of not needing to PUT objects when only the metadata has changed and a POST will suffice. Taking the same example, ssync sender will no longer PUT t0.data but will POST t2.meta resulting in the receiver having t1.data and t2.meta. The changes are backwards compatible: an upgraded sender will only sync data files to a legacy receiver and will not sync meta files (fixing the erroneous behavior described above); a legacy sender will operate as before when sync'ing to an upgraded receiver. Changes: - diskfile API provides methods to get the data file timestamp as distinct from the diskfile timestamp. - diskfile yield_hashes return tuple now passes a dict mapping data and meta (if any) timestamps to their respective values in the timestamp field. - ssync_sender will encode data and meta timestamps in the (hash_path, timestamp) tuple sent to the receiver during missing_checks. - ssync_receiver compares sender's data and meta timestamps to any local diskfile and may specify that only data or meta parts are sent during updates phase by appending a qualifier to the hash returned in its 'wanted' list. - ssync_sender now sends POST subrequests when a meta file exists and its content needs to be replicated. - ssync_sender may send only a POST if the receiver indicates that is the only part required to be sync'd. - object server will allow PUT and DELETE with earlier timestamp than a POST - Fixed TODO related to replicated objects with fast-POST and ssync Related spec change-id: I60688efc3df692d3a39557114dca8c5490f7837e Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Closes-Bug: 1501528 Change-Id: I97552d194e5cc342b0a3f4b9800de8aa6b9cb85b	2015-10-02 11:24:19 +00:00
Jenkins	f1b5a1f4c5	Merge "Reconstructor logging to omit 404 warnings"	2015-09-19 18:45:31 +00:00
Jenkins	227e1f8235	Merge "Fix purge for tombstone only REVERT job"	2015-09-19 18:42:30 +00:00
Minwoo Bae	a63f70c17d	Reconstructor logging to omit 404 warnings Currently, the replicator does not log warning messages for 404 responses. We would like the reconstructor to do the same, as 404s are not considered unusual, and are already handled by the object server. Change-Id: Ia927bf30362548832e9f451923ff94053e11b758 Closes-Bug: #1491883	2015-09-18 15:25:32 -05:00
Bill Huber	9324ce83c6	Reconstructor GET excludes user_agent in log To make it easier for users to deduce in the log to find out where the request originates from, it is necessary to include the user_agent field in the reconstructor for a GET method and to have this particular log consistent with other servers' methods. Change-Id: I0ca7443436e97c2db64c966ab4d73c5c12a1f059 Closes-Bug: 1491871 Co-Authored-By: Kota Tsuyuzakai <tsuyuzaki.kota@lab.ntt.co.jp>	2015-09-11 14:49:41 -05:00
Clay Gerrard	369447ec47	Fix purge for tombstone only REVERT job When we revert a partition we normally push it off to the specific primary node for the index of the data files in the partition. However, when a partition is devoid of any data files (only tombstones) we build a REVERT job with a frag_index of None. This change updates the ECDiskFile's purge method to be robust to purging tombstones when the frag_index is None. Add probetest to validate tombstone only revert jobs will clean themselves up if they can validate they're in-sync with part-replica count nodes - even if one of the primaries is down (in which case they sync tombstones with other handoffs to fill in for the primaries) Change-Id: Ib9a42f412fb90d51959efce886c0f8952aba8d85	2015-09-10 11:07:04 +01:00
janonymous	9456af35a2	pep8 fix: assertEquals -> assertEqual assertEquals is deprecated in py3,changes in dir: test/unit/obj/ test/unit/test_locale/ Change-Id: I3dd0c1107165ac529f1cd967363e5cf408a1d02b	2015-08-07 19:28:35 +05:30
Jenkins	e7205fd7d6	Merge "cPickle is deprecated in py3, replacing it from six.moves"	2015-07-28 12:33:24 +00:00
Charles Hsu	39b6ef6e4f	Fix reconstructor stats mssage. Calculate reconstruction job count and remaining time that would be inappropriate for user. Use real partition count would be suitable for user. Change-Id: I6b025854baf4757dddf9d7fe7bc2cece58a49157 Closes-Bug: #1468298	2015-07-08 12:52:30 +08:00
Jenkins	131668f359	Merge "EC Reconstructor: Do not reconstruct existing fragments."	2015-07-07 22:24:16 +00:00
janonymous	c907107fe4	cPickle is deprecated in py3, replacing it from six.moves cPickle is deprecated and should be replaced with six.moves to provide py2 and py3 compatibility. Change-Id: Ibad990708722360d188c641e61444d50a16a1e93	2015-07-07 22:46:37 +05:30
Minwoo Bae	44b76a1b1b	EC Reconstructor: Do not reconstruct existing fragments. The EC reconstructor needs to verify that the fragment needing to be reconstructed does not reside in the collection of node responses. Otherwise, resources will be spent unnecessarily reconstructing the fragment. Moreover, this could cause a segfault on some backends. This change adds the necessary verification steps to make sure that a fragment will only be rebuilt in the case it is missing from the other fragment archives. Added some tests to provide coverage for these scenarios. Change-Id: I91f3d4af52cbc66c9f7ce00726f247b5462e66f9 Closes-Bug: #1452553	2015-06-26 16:46:58 -05:00
Darrell Bishop	df134df901	Allow 1+ object-servers-per-disk deployment Enabled by a new > 0 integer config value, "servers_per_port" in the [DEFAULT] config section for object-server and/or replication server configs. The setting's integer value determines how many different object-server workers handle requests for any single unique local port in the ring. In this mode, the parent swift-object-server process continues to run as the original user (i.e. root if low-port binding is required), binds to all ports as defined in the ring, and forks off the specified number of workers per listen socket. The child, per-port servers drop privileges and behave pretty much how object-server workers always have, except that because the ring has unique ports per disk, the object-servers will only be handling requests for a single disk. The parent process detects dead servers and restarts them (with the correct listen socket), starts missing servers when an updated ring file is found with a device on the server with a new port, and kills extraneous servers when their port is found to no longer be in the ring. The ring files are stat'ed at most every "ring_check_interval" seconds, as configured in the object-server config (same default of 15s). Immediately stopping all swift-object-worker processes still works by sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process still causes the parent process to close all listen sockets and exit, allowing existing children to finish serving their existing requests. The drop_privileges helper function now has an optional param to suppress the setsid() call, which otherwise screws up the child workers' process management. The class method RingData.load() can be told to only load the ring metadata (i.e. everything except replica2part2dev_id) with the optional kwarg, header_only=True. This is used to keep the parent and all forked off workers from unnecessarily having full copies of all storage policy rings in memory. A new helper class, swift.common.storage_policy.BindPortsCache, provides a method to return a set of all device ports in all rings for the server on which it is instantiated (identified by its set of IP addresses). The BindPortsCache instance will track mtimes of ring files, so they are not opened more frequently than necessary. This patch includes enhancements to the probe tests and object-replicator/object-reconstructor config plumbing to allow the probe tests to work correctly both in the "normal" config (same IP but unique ports for each SAIO "server") and a server-per-port setup where each SAIO "server" must have a unique IP address and unique port per disk within each "server". The main probe tests only work with 4 servers and 4 disks, but you can see the difference in the rings for the EC probe tests where there are 2 disks per server for a total of 8 disks. Specifically, swift.common.ring.utils.is_local_device() will ignore the ports when the "my_port" argument is None. Then, object-replicator and object-reconstructor both set self.bind_port to None if server_per_port is enabled. Bonus improvement for IPv6 addresses in is_local_device(). This PR for vagrant-swift-all-in-one will aid in testing this patch: https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/ Also allow SAIO to answer is_local_device() better; common SAIO setups have multiple "servers" all on the same host with different ports for the different "servers" (which happen to match the IPs specified in the rings for the devices on each of those "servers"). However, you can configure the SAIO to have different localhost IP addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the servers' config files' bind_ip setting. This new whataremyips() implementation combined with a little plumbing allows is_local_device() to accurately answer, even on an SAIO. In the default case (an unspecified bind_ip defaults to '0.0.0.0') as well as an explict "bind to everything" like '0.0.0.0' or '::', whataremyips() behaves as it always has, returning all IP addresses for the server. Also updated probe tests to handle each "server" in the SAIO having a unique IP address. For some (noisy) benchmarks that show servers_per_port=X is at least as good as the same number of "normal" workers: https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md Benchmarks showing the benefits of I/O isolation with a small number of slow disks: https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md If you were wondering what the overhead of threads_per_disk looks like: https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md DocImpact Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6	2015-06-18 12:43:50 -07:00
Clay Gerrard	a3559edc23	Exclude local_dev from sync partners on failure If the primary left or right hand partners are down, the next best thing is to validate the rest of the primary nodes. Where the rest should exclude not just the left and right hand partners - but ourself as well. This fixes a accidental noop when partner node is unavailable and another node is missing data. Validation: Add probetests to cover ssync failures for the primary sync_to nodes for sync jobs. Drive-by: Make additional plumbing for the check_mount and check_dir constraints into the remaining daemons. Change-Id: I4d1c047106c242bca85c94b569d98fd59bb255f4	2015-05-26 12:50:31 -07:00
Kota Tsuyuzaki	27f6fba5c3	Use reconstruct insetad of decode/encode With bumping PyECLib up to 1.0.7 on global requirements, we can use the "reconstruct" function directly instead of the current hack doing decode/encode on reconstructor. That is because the hack was for treating PyECLib < 1.0.7 (strictly jearsure scheme) reconstruction bug so we don't have to do decode/encode anymore. Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I69aae495670e3d0bdebe665f73915547a4d56f99	2015-04-20 16:48:10 -07:00
Clay Gerrard	52b102163e	Don't apply the wrong Etag validation to rebuilt fragments Because of the object-server's interaction with ssync sender's X-Backend-Replication-Headers when a object (or fragment archive) is pushed unmodified to another node it's ETag value is duped into the recieving ends metadata as Etag. This interacts poorly with the reconstructor's RebuildingECDiskFileStream which can not know ahead of time the ETag of the fragment archive being rebuilt. Don't send the Etag from the local source fragment archive being used as the basis for the rebuilt fragent archive's metadata along to ssync. Change-Id: Ie59ad93a67a7f439c9a84cd9cff31540f97f334a	2015-04-15 23:33:32 +01:00
paul luse	647b66a2ce	Erasure Code Reconstructor This patch adds the erasure code reconstructor. It follows the design of the replicator but: - There is no notion of update() or update_deleted(). - There is a single job processor - Jobs are processed partition by partition. - At the end of processing a rebalanced or handoff partition, the reconstructor will remove successfully reverted objects if any. And various ssync changes such as the addition of reconstruct_fa() function called from ssync_sender which performs the actual reconstruction while sending the object to the receiver Co-Authored-By: Alistair Coles <alistair.coles@hp.com> Co-Authored-By: Thiago da Silva <thiago@redhat.com> Co-Authored-By: John Dickinson <me@not.mn> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com> Co-Authored-By: Samuel Merritt <sam@swiftstack.com> Co-Authored-By: Christian Schwede <christian.schwede@enovance.com> Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com> blueprint ec-reconstructor Change-Id: I7d15620dc66ee646b223bb9fff700796cd6bef51	2015-04-14 00:52:17 -07:00

49 Commits