swift

Author	SHA1	Message	Date
Tim Burke	642f79965a	py3: port common/ring/ and common/utils.py I can't imagine us not having a py3 proxy server at some point, and that proxy server is going to need a ring. While we're at it (and since they were so close anyway), port * cli/ringbuilder.py and * common/linkat.py * common/daemon.py Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b	2018-02-12 06:42:24 +00:00
Tim Burke	89a5c9d56f	Disallow fractional replicas in EC policies Change-Id: I873d7bf7de54e4b1dccdafc8a61f03c09a65dfbc Closes-Bug: 1554391 Closes-Bug: 1677547	2018-01-29 16:57:46 -08:00
Samuel Merritt	9c97c80b26	Clean up a couple hand-rolled mocks. Change-Id: I6582985990e8b5e3a0c65bce5d3bb1e39d58dfb9	2017-10-20 15:31:07 -07:00
Tim Burke	0d4d940243	Make test_get_more_nodes_with_zero_weight_region more robust ...by using a random ring, looking at all partitions, and making assertions about the distribution of how many times we have to call next(). Change-Id: Ia5feb9396d4bf6fd35f16bbc5280e63022ed2c47	2017-08-11 09:53:20 -07:00
Alistair Coles	7fa9d6e5b4	Unset random seed after rebalancing ring Unit tests use the random module in places to randomise test inputs, but once the tests in test_builder.py or test_ring_builder_analyzer.py have been run the random seed is left in a repeatable state because calls are made to RingBuilder.balance with a seed value. Consequently, subsequent calls to random in other tests get repeatable results from one test run to another. This patch resets the state of the random module before returning from RingBuilder.rebalance. Closes-Bug: #1639755 Change-Id: I4b74030afc654e60452e65b3e0f1b45a189c16e3	2017-08-08 22:34:04 +00:00
Drew Balfour	1ec6e2bb0a	add byteorder information and logic to ring files On-disk, serialized ring files are byteorder dependent, which makes then unportable between different endian architectures. Add a field to the ring dictionary in the file indicating the byteorder used to generate the file, and then byteswap if necessary when deserializing the file. This patch only allows newly created ring files to be byteoder agnostic. Previously generated ring files will still fail on different endian architectures, and will need to be regenerated with this patch. Change-Id: I23b5e0a8082b30ca257aeb1fab03ab74e6f0b2d4 Closes-Bug: #1639980	2016-11-29 09:07:46 -08:00
Gábor Antal	77d6d015f6	Use more specific asserts in test/unit/common/ring I changed asserts with more specific assert methods. e.g.: from assertTrue(sth == None) to assertIsNone(*) or assertTrue(isinstance(inst, type)) to assertIsInstace(inst, type) or assertTrue(not sth) to assertFalse(sth). The code gets more readable, and a better description will be shown on fail. Change-Id: I9531c9939aa7c2dac127b5dc865b8d396dab318f	2016-07-15 13:28:31 +00:00
Shashirekha Gundur	cf48e75c25	change default ports for servers Changing the recommended ports for Swift services from ports 6000-6002 to unused ports 6200-6202; so they do not conflict with X-Windows or other services. Updated SAIO docs. DocImpact Closes-Bug: #1521339 Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18	2016-04-29 14:47:38 -04:00
Jenkins	5ec586bd34	Merge "add test for zero weight region get_more_nodes"	2016-01-28 12:50:58 +00:00
Clay Gerrard	165fa1fd40	add test for zero weight region get_more_nodes Change-Id: If537981e8deadd9c3528dcb30a15011c7781e334	2016-01-15 11:44:21 -08:00
Samuel Merritt	5d449471b1	Remove some Python 2.6 leftovers Change-Id: I798d08722c90327c66759aa0bb4526851ba38d41	2016-01-14 17:26:01 -08:00
Clay Gerrard	7035639dfd	Put part-replicas where they go It's harder than it sounds. There was really three challenges. Challenge #1 Initial Assignment =============================== Before starting to assign parts on this new shiny ring you've constructed, maybe we'll pause for a moment up front and consider the lay of the land. This process is called the replica_plan. The replica_plan approach is separating part assignment failures into two modes: 1) we considered the cluster topology and it's weights and came up with the wrong plan 2) we failed to execute on the plan I failed at both parts plenty of times before I got it this close. I'm sure a counter example still exists, but when we find it the new helper methods will let us reason about where things went wrong. Challenge #2 Fixing Placement ============================= With a sound plan in hand, it's much easier to fail to execute on it the less material you have to execute with - so we gather up as many parts as we can - as long as we think we can find them a better home. Picking the right parts for gather is a black art - when you notice a balance is slow it's because it's spending so much time iterating over replica2part2dev trying to decide just the right parts to gather. The replica plan can help at least in the gross dispersion collection to gather up the worst offenders first before considering balance. I think trying to avoid picking up parts that are stuck to the tier before falling into a forced grab on anything over parts_wanted helps with stability generally - but depending on where the parts_wanted are in relation to the full devices it's pretty easy pick up something that'll end up really close to where it started. I tried to break the gather methods into smaller pieces so it looked like I knew what I was doing. Going with a MAXIMUM gather iteration instead of balance (which doesn't reflect the replica_plan) doesn't seem to be costing me anything - most of the time the exit condition is either solved or all the parts overly aggressively locked up on min_part_hours. So far, it mostly seemds if the thing is going to balance this round it'll get it in the first couple of shakes. Challenge #3 Crazy replica2part2dev tables ========================================== I think there's lots of ways "scars" can build up a ring which can result in very particular replica2part2dev tables that are physically difficult to dig out of. It's repairing these scars that will take multiple rebalances to resolve. ... but at this point ... ... lacking a counter example ... I've been able to close up all the edge cases I was able to find. It may not be quick, but progress will be made. Basically my strategy just required a better understanding of how previous algorithms were able to mostly keep things moving by brute forcing the whole mess with a bunch of randomness. Then when we detect our "elegant" careful part selection isn't making progress - we can fall back to same old tricks. Validation ========== We validate against duplicate part replica assignment after rebalance and raise an ERROR if we detect more than one replica of a part assigned to the same device. In order to meet that requirement we have to have as many devices as replicas, so attempting to rebalance with too few devices w/o changing your replica_count is also an ERROR not a warning. Random Thoughts =============== As usual with rings, the test diff can be hard to reason about - hopefully I've added enough comments to assure future me that these assertions make sense. Despite being a large rewrite of a lot of important code, the existing code is known to have failed us. This change fixes a critical bug that's trivial to reproduce in a critical component of the system. There's probably a bunch of error messages and exit status stuff that's not as helpful as it could be considering the new behaviors. Change-Id: I1bbe7be38806fc1c8b9181a722933c18a6c76e05 Closes-Bug: #1452431	2015-12-07 16:06:42 -08:00
janonymous	1882801be1	pep8 fix: assertNotEquals -> assertNotEqual assertNotEquals is deprecated in py3 Change-Id: Ib611351987bed1199fb8f73a750955a61d022d0a	2015-10-12 07:40:07 +00:00
janonymous	f5f9d791b0	pep8 fix: assertEquals -> assertEqual assertEquals is deprecated in py3, replacing it. Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8 Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>	2015-10-11 12:57:25 +02:00
Victor Stinner	d719064e78	Fix warning pep8 E128 warning of hacking 0.10 Fix the warning E128: "continuation line under-indented for visual indent" of pep8. Change-Id: Ie6c6ae341fe3d6281f2095c1d756d552fa5937f9	2015-07-30 09:33:41 +02:00
janonymous	c907107fe4	cPickle is deprecated in py3, replacing it from six.moves cPickle is deprecated and should be replaced with six.moves to provide py2 and py3 compatibility. Change-Id: Ibad990708722360d188c641e61444d50a16a1e93	2015-07-07 22:46:37 +05:30
Victor Stinner	e5c962a28c	Replace xrange() with six.moves.range() Patch generated by the xrange operation of the sixer tool: https://pypi.python.org/pypi/sixer Manual changes: * Fix indentation for pep8 checks * Fix TestGreenthreadSafeIterator.test_access_is_serialized of test.unit.common.test_utils: replace range(1, 11) with list(range(1, 11)) * Fix UnsafeXrange docstring, revert change Change-Id: Icb7e26135c5e57b5302b8bfe066b33cafe69fe4d	2015-06-23 07:29:15 +00:00
Darrell Bishop	df134df901	Allow 1+ object-servers-per-disk deployment Enabled by a new > 0 integer config value, "servers_per_port" in the [DEFAULT] config section for object-server and/or replication server configs. The setting's integer value determines how many different object-server workers handle requests for any single unique local port in the ring. In this mode, the parent swift-object-server process continues to run as the original user (i.e. root if low-port binding is required), binds to all ports as defined in the ring, and forks off the specified number of workers per listen socket. The child, per-port servers drop privileges and behave pretty much how object-server workers always have, except that because the ring has unique ports per disk, the object-servers will only be handling requests for a single disk. The parent process detects dead servers and restarts them (with the correct listen socket), starts missing servers when an updated ring file is found with a device on the server with a new port, and kills extraneous servers when their port is found to no longer be in the ring. The ring files are stat'ed at most every "ring_check_interval" seconds, as configured in the object-server config (same default of 15s). Immediately stopping all swift-object-worker processes still works by sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process still causes the parent process to close all listen sockets and exit, allowing existing children to finish serving their existing requests. The drop_privileges helper function now has an optional param to suppress the setsid() call, which otherwise screws up the child workers' process management. The class method RingData.load() can be told to only load the ring metadata (i.e. everything except replica2part2dev_id) with the optional kwarg, header_only=True. This is used to keep the parent and all forked off workers from unnecessarily having full copies of all storage policy rings in memory. A new helper class, swift.common.storage_policy.BindPortsCache, provides a method to return a set of all device ports in all rings for the server on which it is instantiated (identified by its set of IP addresses). The BindPortsCache instance will track mtimes of ring files, so they are not opened more frequently than necessary. This patch includes enhancements to the probe tests and object-replicator/object-reconstructor config plumbing to allow the probe tests to work correctly both in the "normal" config (same IP but unique ports for each SAIO "server") and a server-per-port setup where each SAIO "server" must have a unique IP address and unique port per disk within each "server". The main probe tests only work with 4 servers and 4 disks, but you can see the difference in the rings for the EC probe tests where there are 2 disks per server for a total of 8 disks. Specifically, swift.common.ring.utils.is_local_device() will ignore the ports when the "my_port" argument is None. Then, object-replicator and object-reconstructor both set self.bind_port to None if server_per_port is enabled. Bonus improvement for IPv6 addresses in is_local_device(). This PR for vagrant-swift-all-in-one will aid in testing this patch: https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/ Also allow SAIO to answer is_local_device() better; common SAIO setups have multiple "servers" all on the same host with different ports for the different "servers" (which happen to match the IPs specified in the rings for the devices on each of those "servers"). However, you can configure the SAIO to have different localhost IP addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the servers' config files' bind_ip setting. This new whataremyips() implementation combined with a little plumbing allows is_local_device() to accurately answer, even on an SAIO. In the default case (an unspecified bind_ip defaults to '0.0.0.0') as well as an explict "bind to everything" like '0.0.0.0' or '::', whataremyips() behaves as it always has, returning all IP addresses for the server. Also updated probe tests to handle each "server" in the SAIO having a unique IP address. For some (noisy) benchmarks that show servers_per_port=X is at least as good as the same number of "normal" workers: https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md Benchmarks showing the benefits of I/O isolation with a small number of slow disks: https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md If you were wondering what the overhead of threads_per_disk looks like: https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md DocImpact Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6	2015-06-18 12:43:50 -07:00
Jenkins	65fce49b3b	Merge "Use just IP, not port, when determining partition placement"	2015-06-18 18:10:17 +00:00
Samuel Merritt	af72881d1d	Use just IP, not port, when determining partition placement In the ring builder, we place partitions with maximum possible dispersion across tiers, where a "tier" is region, then zone, then IP/port,then device. Now, instead of IP/port, just use IP. The port wasn't really getting us anything; two different object servers on two different ports on one machine aren't separate failure domains. However, if someone has only a few machines and is using one object server on its own port per disk, then the ring builder would end up with every disk in its own IP/port tier, resulting in bad (with respect to durability) partition placement. For example: assume 1 region, 1 zone, 4 machines, 48 total disks (12 per machine), and one object server (and hence one port) per disk. With the old behavior, partition replicas will all go in the one region, then the one zone, then pick one of 48 IP/port pairs, then pick the one disk therein. This gives the same result as randomly picking 3 disks (without replacement) to store data on; it completely ignores machine boundaries. With the new behavior, the replica placer will pick the one region, then the one zone, then one of 4 IPs, then one of 12 disks therein. This gives the optimal placement with respect to durability. The same applies to Ring.get_more_nodes(). Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp> Change-Id: Ibbd740c51296b7e360845b5309d276d7383a3742	2015-06-17 11:31:55 -07:00
janonymous	09e7477a39	Replace it.next() with next(it) for py3 compat The Python 2 next() method of iterators was renamed to __next__() on Python 3. Use the builtin next() function instead which works on Python 2 and Python 3. Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d	2015-06-15 22:10:45 +05:30
Samuel Merritt	decbcd24d4	Foundational support for PUT and GET of erasure-coded objects This commit makes it possible to PUT an object into Swift and have it stored using erasure coding instead of replication, and also to GET the object back from Swift at a later time. This works by splitting the incoming object into a number of segments, erasure-coding each segment in turn to get fragments, then concatenating the fragments into fragment archives. Segments are 1 MiB in size, except the last, which is between 1 B and 1 MiB. +====================================================================+ \| object data \| +====================================================================+ \| +------------------------+----------------------+ \| \| \| v v v +===================+ +===================+ +==============+ \| segment 1 \| \| segment 2 \| ... \| segment N \| +===================+ +===================+ +==============+ \| \| \| \| v v /=========\ /=========\ \| pyeclib \| \| pyeclib \| ... \=========/ \=========/ \| \| \| \| +--> fragment A-1 +--> fragment A-2 \| \| \| \| \| \| \| \| \| \| +--> fragment B-1 +--> fragment B-2 \| \| \| \| ... ... Then, object server A gets the concatenation of fragment A-1, A-2, ..., A-N, so its .data file looks like this (called a "fragment archive"): +=====================================================================+ \| fragment A-1 \| fragment A-2 \| ... \| fragment A-N \| +=====================================================================+ Since this means that the object server never sees the object data as the client sent it, we have to do a few things to ensure data integrity. First, the proxy has to check the Etag if the client provided it; the object server can't do it since the object server doesn't see the raw data. Second, if the client does not provide an Etag, the proxy computes it and uses the MIME-PUT mechanism to provide it to the object servers after the object body. Otherwise, the object would not have an Etag at all. Third, the proxy computes the MD5 of each fragment archive and sends it to the object server using the MIME-PUT mechanism. With replicated objects, the proxy checks that the Etags from all the object servers match, and if they don't, returns a 500 to the client. This mitigates the risk of data corruption in one of the proxy --> object connections, and signals to the client when it happens. With EC objects, we can't use that same mechanism, so we must send the checksum with each fragment archive to get comparable protection. On the GET path, the inverse happens: the proxy connects to a bunch of object servers (M of them, for an M+K scheme), reads one fragment at a time from each fragment archive, decodes those fragments into a segment, and serves the segment to the client. When an object server dies partway through a GET response, any partially-fetched fragment is discarded, the resumption point is wound back to the nearest fragment boundary, and the GET is retried with the next object server. GET requests for a single byterange work; GET requests for multiple byteranges do not. There are a number of things _not_ included in this commit. Some of them are listed here: * multi-range GET * deferred cleanup of old .data files * durability (daemon to reconstruct missing archives) Co-Authored-By: Alistair Coles <alistair.coles@hp.com> Co-Authored-By: Thiago da Silva <thiago@redhat.com> Co-Authored-By: John Dickinson <me@not.mn> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com> Co-Authored-By: Paul Luse <paul.e.luse@intel.com> Co-Authored-By: Christian Schwede <christian.schwede@enovance.com> Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com> Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2	2015-04-14 00:52:17 -07:00
Samuel Merritt	6d77c379bd	Let admins add a region without melting their cluster Prior to this commit, swift-ring-builder would place partitions on devices by first going for maximal dispersion and breaking ties with device weight. This commit flips the order so that device weight trumps dispersion. Note: if your ring can be balanced, you won't see a behavior change. It's only when device weights and maximal-dispersion come into conflict that this commit changes anything. Example: a cluster with two regions. Region 1 has a combined weight of 1000, while region 2 has a combined weight of only 400. The ring has 3 replicas and 2^16 partitions. Prior to this commit, the balance would look like so: Region 1: 2 * 2^16 partitions Region 2: 2^16 partitions After this commit, the balance will be: Region 1: 10/14 * 2^16 partitions (more than before) Region 2: 4/14 * 2^16 partitions (fewer than before) One consequence of this is that some partitions will not have a replica in region 2, since it's not big enough to hold all of them. This way, a cluster operator can add a new region to a single-region cluster in a gradual fashion so as not to destroy their WAN link with replication traffic. As device weights are increased in the second region, more replicas will shift over to it. Once its weight is half that of the first region's, every partition will have a replica there. DocImpact Change-Id: I945abcc4a2917bb12be554b640f7507dd23cd0da	2014-08-21 08:56:29 -07:00
Clay Gerrard	4321bb0af6	Add Storage Policy support to Containers Containers now have a storage policy index associated with them, stored in the container_stat table. This index is only settable at container creation time (PUT request), and cannot be changed without deleting and recreating the container. This is because a container's policy index will apply to all its objects, so changing a container's policy index would require moving large amounts of object data around. If a user wants to change the policy for data in a container, they must create a new container with the desired policy and move the data over. Keep status_changed_at up-to-date with status changes. In particular during container recreation and replication. When a container-server receives a PUT for a deleted database an extra UPDATE is issued against the container_stat table to notate the x-timestamp of the request. During replication if merge_timestamps causes a container's status to change (from DELETED to ACTIVE or vice-versa) the status_changed_at field is set to the current time. Accurate reporting of status_changed_at is useful for container replication forensics and allows resolution of "set on create" attributes like the upcoming storage_policy_index. Expose Backend container info on deleted containers. Include basic container info in backend headers on 404 responses from the container server. Default empty values are used as placeholders if the database does not exist. Specifically the X-Backend-Status-Changed-At, X-Backend-DELETE-Timestamp and the X-Backend-Storage-Policy-Index value will be needed by the reconciler to deal with reconciling out of order object writes in the face of recently deleted containers. * Add "status_changed_at" key to the response from ContainerBroker.get_info. * Add "Status Timestamp" field to swift.cli.info.print_db_info_metadata. * Add "status_changed_at" key to the response from AccountBroker.get_info. DocImpact Implements: blueprint storage-policies Change-Id: Ie6d388f067f5b096b0f96faef151120ba23c8748	2014-06-18 17:31:38 -07:00
James Page	b9b5fef89a	Set permissions on generated ring files The use of NamedTemporaryFile creates rings with permissions 0600; however most installs probably generate the rings as root but the swift-proxy runs as user swift. Set the permissions on the generated ring to 0644 prior to rename so that the swift user can read the rings. Change-Id: Ia511931f471c5c9840012c3a75b89c1f35b1b245 Closes-Bug: #1302700	2014-04-05 16:54:11 +01:00
Peter Portante	ca87827db9	Use a tempfile.mkdtemp() based temporary directory Change-Id: Ie0e8137615348b130b67323d0d2913dc5ebfd5fb	2014-01-19 16:29:45 -05:00
Clay Gerrard	bad52f1121	Allow programmatic reloading of Swift hash path config New util's function validate_hash_conf allows you to programmatically reload swift.conf and hash path global vars HASH_PATH_SUFFIX and HASH_PATH_PREFIX when they are invalid. When you load swift.common.utils before you have a swift.conf there's no good way to force a re-read of swift.conf and repopulate the hash path config options - short of restarting the process or reloading the module - both of which are hard to unittest. This should be no worse in general and in some cases easier. Change-Id: I1ff22c5647f127f65589762b3026f82c9f9401c1	2014-01-10 16:17:44 -08:00
Samuel Merritt	68db481ae5	Update handoff algorithm to use IP/port pairs The replica placement algorithm works on regions, then zones, then IP/port, then device ID. The handoff algorithm worked on regions, then zones, then device ID, completely skipping IP/port. It's now been updated to take IP/port into consideration. This means you get one handoff on each machine in the cluster before you start getting handoffs that share a machine with a previous one. In small clusters, this can help with durability. Because this is performance-critical code, here are some quick benchmark results: Run time averages over 25000 trials on a 1200-device ring (20 part power, 3 replicas, 2 regions, 10 zones, 120 nodes): \| master \| branch ===================+=============+============ get 1 more node \| 2.727e-05 \| 3.076e-05 get 6 more nodes \| 3.55e-05 \| 4.214e-05 get all more nodes \| 0.002247 \| 0.002691 There's a small slowdown from the additional bookkeeping, but nothing too awful. The time to get 6 more nodes (for handoff checks on 404, it's 2x replica count by default, hence 6) went from 35 to 42 microseconds, so it remains small. Change-Id: Ie7da4dfcb0fcf1a38e2fb13f60c204540fadbf06	2013-12-02 17:41:38 -08:00
Juan J. Martinez	845b8beeb5	Default region loading an old-style pickled ring This is to support upgrades from swift < 1.8 using old-style pickled rings to 1.10. Old-style pickled rings won't have region information. Change-Id: I18b2acba3d346e41def9d25d3d4dbd12705e5375 Closes-Bug: #1248919	2013-11-07 15:55:39 +00:00
Peter Portante	9411a24ba7	Revert "Refactor common/utils methods to common/ondisk" This reverts commit 7760f41c3ce436cb23b4b8425db3749a3da33d32 Change-Id: I95e57a2563784a8cd5e995cc826afeac0eadbe62 Signed-off-by: Peter Portante <peter.portante@redhat.com>	2013-10-07 17:18:09 -04:00
ZhiQiang Fan	f72704fc82	Change OpenStack LLC to Foundation Change-Id: I7c3df47c31759dbeb3105f8883e2688ada848d58 Closes-bug: #1214176	2013-09-20 01:02:31 +08:00
Peter Portante	7760f41c3c	Refactor common/utils methods to common/ondisk Place all the methods related to on-disk layout and / or configuration into a new common module that can be shared by the various modules using the same on-disk layout. Change-Id: I27ffd4665d5115ffdde649c48a4d18e12017e6a9 Signed-off-by: Peter Portante <peter.portante@redhat.com>	2013-09-17 17:32:04 -04:00
Peter Portante	b5a0b830e2	Pep8 remaining unit test modules in common (8 of 12) Change-Id: I6fa3291eeacb7ee5c095ad9bccbd33f027bf11e3 Signed-off-by: Peter Portante <peter.portante@redhat.com>	2013-09-01 16:12:51 -04:00
Alex Gaynor	c1f8f266d0	Ensure that files in tests are closed. This is needed on Pythons which do not have reference counting GCs (e.g. PyPy). Change-Id: I5a613e832e9a7a149b3e9317c053c3048f34afcb	2013-07-20 16:08:53 -07:00
Sergey Kraynev	ea7858176b	Implementation of replication servers Support separate replication ip address: - Added new function in utils. This function provides ability to select separate IP address for replication service. - Db_replicator and object replicators were changed. Replication process uses new function now. Replication network parameters: - Replication network fields (replication_ip, replication_port) support was added to device dictionary in swift-ring-builder script. - Changes were made to support new fields in search, show and set_info functions. Implementation of replication servers: - Separate replication servers use the same code as normal replication servers, but with replication_server parameter = True. When using a separate replication network, the non-replication servers set replication_server = False. When there is no separate replication network (the default case), replication_server is not included in the config. DocImpact Change-Id: Ie9af5bdcdf9241c355e36053ca4adfe49dc35bd0 Implements: blueprint dedicated-replication-network	2013-04-21 18:14:42 -04:00
gholt	d79a67ebf6	Refactored lists of nodes to contact for requests Extensive refactor here to consolidate what nodes are contacted for any request. This consolidation means reads will contact the same set of nodes that writes would, giving a very good chance that read-your-write behavior will succeed. This also means that writes will not necessarily try all nodes in the cluster as it would previously, which really wasn't desirable anyway. (If you really want that, you can set request_node_count to a really big number, but understand that also means reads will contact every node looking for something that might not exist.) * Added a request_node_count proxy-server conf value that allows control of how many nodes are contacted for a normal request. In proxy.controllers.base.Controller: * Got rid of error_increment since it was only used in one spot by another method and just served to confuse. * Made error_occurred also log the device name. * Made error_limit require an error message and also documented a bit better. * Changed iter_nodes to just take a ring and a partition and yield all the nodes itself so it could control the number of nodes used in a given request. Also happens to consolidate where sort_nodes is called. * Updated account_info and container_info to use all nodes from iter_nodes and to call error_occurred appropriately. * Updated GETorHEAD_base to not track attempts on its own and just stop when iter_nodes tells it to stop. Also, it doesn't take the nodes to contact anymore; instead it takes the ring and gets the nodes from iter_nodes itself. Elsewhere: * Ring now has a get_part method. * Made changes to reflect all of the above. Change-Id: I37f76c99286b6456311abf25167cd0485bfcafac	2013-04-08 20:48:32 +00:00
Greg Lange	44f00a23c1	fixed some minor things in tests that pyflakes complained about Change-Id: Ifeab56a964630bcf941e932fcbe39e6572e62975	2013-03-26 20:42:26 +00:00
David Hadas	a979c8007b	Add support for Hash Prefix A new configuration parameter is added to /etc/swift/swift.conf [swift-hash] swift_hash_path_prefix = 'random unique string' New installations are advised to set this parameter to a random secret, which would not be disclosed ouside the organization. The same secret needs to be used by all swift servers of the same cluster. Existing installations should set this parameter to an empty string (the default) DocImpact Fixes: Bug #1157454 Change-Id: I63b10d0b7d6dd3f74e0f10bb41b5f240fa03578a	2013-03-22 19:41:55 +02:00
Samuel Merritt	ebcd60f7d9	Add a region tier to Swift's ring. The region is one level above the zone; it is intended to represent a chunk of machines that is distant from others with respect to bandwidth and latency. Old rings will default to having all their devices in region 1. Since everything is in the same region by default, the ring builder will simply distribute across zones as it did before, so your partition assignment won't move because of this change. If you start adding devices in other regions, of course, the assignment will change to take that into account. swift-ring-builder still accepts the same syntax as before, but will default added devices to region 1 if no region is specified. Examples: $ swift-ring-builder foo.builder add r2z1-1.2.3.4:555/sda $ swift-ring-builder foo.builder add r1z3-1.2.3.4:555/sda $ swift-ring-builder foo.builder add z3-1.2.3.4:555/sda Also, some updates to ring-overview doc. Change-Id: Ifefbb839cdcf033e6c9201fadca95224c7303a29	2013-03-13 10:00:58 -07:00
Samuel Merritt	27dcaf2636	Spread handoffs out better around zones. Before, you'd get your 3* primary nodes in 3 different zones, and then get_more_nodes would give you everything it could from a non-primary zone, and then finish up with stuff from the primary zones. It would sort of look like this: P: device in a primary node's zone N: device not in a primary node's zone PPPNNNNNNNNNNNNNNNNNNN...NNNNNNNNNPPP...PPPPPP (The first three Ps are the primary nodes; they don't actually come out of get_more_nodes(), but they're included for clarity.) Now, the first few handoffs from get_more_nodes are in non-primary zones, but only one per zone, and then the rest of the handoffs ignore zones. It's still sampling the ring, so it's still taking weights into consideration, but the zone distribution is more even early in the handoff chain. It looks like this, assuming 10 zones: P: device in a primary node's zone N: device not in a primary node's zone D: zone doesn't matter PPPNNNNNNNDDDDDDDDDDD...DDD * or whatever your replica count is Change-Id: I31d2a2bc2cd6038386a2df85cd4fa37ccf2f650e	2013-03-05 13:28:12 -08:00
gholt	e064ba1915	Updated get_more_nodes algorithm The handoff nodes will try to be in zones other than the primary zones, will take into account the device weights, and will usually keep the same sequences of handoffs even with ring changes. On a real ring test the old get_more_nodes placed data mostly evenly across zones, which is a problem for differently weighted zones. But the real problem was that the extra partitions given to each device was 0% to 0.77% with only 46.05% of the candidate devices getting anything. Some of the devices increased in effective weight over 50% in the test. The new get_more_nodes placed closer to what the zone weights were and the extra partitions given to each device was 0% to 0.24% with 90.58% of the candidate devices getting something. The worst off device only increased in effective weight by 10.71%. Change-Id: Iffb133a22db69074acaa2b90854cbfa92e4c2b9e	2013-03-04 08:52:24 +00:00
Samuel Merritt	156cdc8edf	Deterministic, repeatable serialization for rings. The (account\|container\|object).ring.gz files contain, among other things, a JSON-encoded dictionary. This change simply makes the JSON serializer sort the keys of that dictionary so that two Python-identical rings will result in two bytewise-identical ring files. Also, to get repeatable compression, we lock down the timestamp in the gzip output stream to a fixed value. (There's a timestamp value in a gzip stream header; by default, gzip.GzipFile sticks time.time() in there.) This only works on Python 2.7; on 2.6, the 'mtime' argument to gzip.GzipFile() is unsupported. Don't worry, serialization still works on 2.6. It just doesn't always produce the same bytes for the same ring. Change-Id: Ide446413d0aeb78536883933fd0caf440b8f54ad	2013-01-31 16:55:10 -08:00
Darrell Bishop	f8ce43a218	Use custom encoding for RingData, not pickle. Serialize RingData in a versioned, custom format which is a combination of a JSON-encoded header and .tostring() dumps of the replica2part2dev_id arrays. This format deserializes hundreds of times faster than rings serialized with Python 2.7's pickle (a significant performance regression for ring loading between Python 2.6 and Python 2.7). Fixes bug 1031954. swift.common.ring.ring.RingData is now responsible for serialization and deserialization of its data via a new load() class method and save() object method. The new implementation is backward-compatible; if a ring does not begin with a new-style magic string, it is assumed to be an old-style pickle-dumped ring and is handled as before. So new Swift code can read old rings, but old Swift code will not be able to read newly-serialized rings. THIS SHOULD BE MENTIONED PROMINENTLY IN THE RELEASE NOTES. I didn't want to bite of more than necessary, so I didn't mess with builder file serialization. Change-Id: I799b9a4c894d54fb16592443904ac055b2638e2d	2012-08-05 00:51:49 -07:00
John Dickinson	de3b663a73	ensure that accessing the ring devs reloads the ring if necessary Change-Id: If5a6d32c40de02183a2eed6e2a32d62ba38df32d	2012-07-20 09:15:34 -07:00
Samuel Merritt	bb509dd863	As-unique-as-possible partition replica placement. This commit introduces a new algorithm for assigning partition replicas to devices. Basically, the ring builder organizes the devices into tiers (first zone, then IP/port, then device ID). When placing a replica, the ring builder looks for the emptiest device (biggest parts_wanted) in the furthest-away tier. In the case where zone-count >= replica-count, the new algorithm will give the same results as the one it replaces. Thus, no migration is needed. In the case where zone-count < replica-count, the new algorithm behaves differently from the old algorithm. The new algorithm will distribute things evenly at each tier so that the replication is as high-quality as possible, given the circumstances. The old algorithm would just crash, so again, no migration is needed. Handoffs have also been updated to use the new algorithm. When generating handoff nodes, first the ring looks for nodes in other zones, then other ips/ports, then any other drive. The first handoff nodes (the ones in other zones) will be the same as before; this commit just extends the list of handoff nodes. The proxy server and replicators have been altered to avoid looking at the ring's replica count directly. Previously, with a replica count of C, RingData.get_nodes() and RingData.get_part_nodes() would return lists of length C, so some other code used the replica count when it needed the number of nodes. If two of a partition's replicas are on the same device (e.g. with 3 replicas, 2 devices), then that assumption is no longer true. Fortunately, all the proxy server and replicators really needed was the number of nodes returned, which they already had. (Bonus: now the only code that mentions replica_count directly is in the ring and the ring builder.) Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa	2012-05-09 15:56:06 -07:00
Jenkins	6682138b0a	Merge "Make ring class interface slightly more abstracted from implementation."	2012-03-22 20:25:06 +00:00
John Dickinson	1ecf5ebba1	updated copyright date for all files Change-Id: Ifd909d3561c2647770a7e0caa3cd91acd1b4f298	2012-03-19 13:45:34 -05:00
Michael Barton	e008c2ebb8	Make ring class interface slightly more abstracted from implementation. Change-Id: I0f55d61c7b8de30460f17a69e5d9946494dbda6e	2012-03-14 22:00:30 +00:00
David Goetz	aa14afe2bb	nodes with a weight of zero should not be valid for handoff	2011-02-22 16:02:36 +00:00
Clay Gerrard	ae1c2d73ab	creating a Ring will ensure a valid HASH_PATH_SUFFIX To make sure that node lookups match what the servers return the generated hashes need to match. All the utils that use the ring should validate their HASH_PATH_SUFFIX.	2011-02-16 09:02:38 -06:00

1 2

53 Commits