53 Commits

Author SHA1 Message Date
Tim Burke
642f79965a py3: port common/ring/ and common/utils.py
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.

While we're at it (and since they were so close anyway), port

* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py

Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
2018-02-12 06:42:24 +00:00
Tim Burke
89a5c9d56f Disallow fractional replicas in EC policies
Change-Id: I873d7bf7de54e4b1dccdafc8a61f03c09a65dfbc
Closes-Bug: 1554391
Closes-Bug: 1677547
2018-01-29 16:57:46 -08:00
Samuel Merritt
9c97c80b26 Clean up a couple hand-rolled mocks.
Change-Id: I6582985990e8b5e3a0c65bce5d3bb1e39d58dfb9
2017-10-20 15:31:07 -07:00
Tim Burke
0d4d940243 Make test_get_more_nodes_with_zero_weight_region more robust
...by using a random ring, looking at *all* partitions, and making
assertions about the distribution of how many times we have to call
next().

Change-Id: Ia5feb9396d4bf6fd35f16bbc5280e63022ed2c47
2017-08-11 09:53:20 -07:00
Alistair Coles
7fa9d6e5b4 Unset random seed after rebalancing ring
Unit tests use the random module in places to randomise
test inputs, but once the tests in test_builder.py or
test_ring_builder_analyzer.py have been run the random
seed is left in a repeatable state because calls are made
to RingBuilder.balance with a seed value. Consequently,
subsequent calls to random in other tests get repeatable
results from one test run to another.

This patch resets the state of the random module before
returning from RingBuilder.rebalance.

Closes-Bug: #1639755
Change-Id: I4b74030afc654e60452e65b3e0f1b45a189c16e3
2017-08-08 22:34:04 +00:00
Drew Balfour
1ec6e2bb0a add byteorder information and logic to ring files
On-disk, serialized ring files are byteorder dependent, which makes then
unportable between different endian architectures. Add a field to the
ring dictionary in the file indicating the byteorder used to generate
the file, and then byteswap if necessary when deserializing the file.

This patch only allows newly created ring files to be byteoder
agnostic. Previously generated ring files will still fail on different
endian architectures, and will need to be regenerated with this patch.

Change-Id: I23b5e0a8082b30ca257aeb1fab03ab74e6f0b2d4
Closes-Bug: #1639980
2016-11-29 09:07:46 -08:00
Gábor Antal
77d6d015f6 Use more specific asserts in test/unit/common/ring
I changed asserts with more specific assert methods.
e.g.: from assertTrue(sth == None) to assertIsNone(*) or
assertTrue(isinstance(inst, type)) to assertIsInstace(inst, type)
or assertTrue(not sth) to assertFalse(sth).

The code gets more readable, and a better description will be shown on fail.

Change-Id: I9531c9939aa7c2dac127b5dc865b8d396dab318f
2016-07-15 13:28:31 +00:00
Shashirekha Gundur
cf48e75c25 change default ports for servers
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.

Updated SAIO docs.

DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
2016-04-29 14:47:38 -04:00
Jenkins
5ec586bd34 Merge "add test for zero weight region get_more_nodes" 2016-01-28 12:50:58 +00:00
Clay Gerrard
165fa1fd40 add test for zero weight region get_more_nodes
Change-Id: If537981e8deadd9c3528dcb30a15011c7781e334
2016-01-15 11:44:21 -08:00
Samuel Merritt
5d449471b1 Remove some Python 2.6 leftovers
Change-Id: I798d08722c90327c66759aa0bb4526851ba38d41
2016-01-14 17:26:01 -08:00
Clay Gerrard
7035639dfd Put part-replicas where they go
It's harder than it sounds.  There was really three challenges.

Challenge #1 Initial Assignment
===============================

Before starting to assign parts on this new shiny ring you've
constructed, maybe we'll pause for a moment up front and consider the
lay of the land.  This process is called the replica_plan.

The replica_plan approach is separating part assignment failures into
two modes:

 1) we considered the cluster topology and it's weights and came up with
    the wrong plan

 2) we failed to execute on the plan

I failed at both parts plenty of times before I got it this close.  I'm
sure a counter example still exists, but when we find it the new helper
methods will let us reason about where things went wrong.

Challenge #2 Fixing Placement
=============================

With a sound plan in hand, it's much easier to fail to execute on it the
less material you have to execute with - so we gather up as many parts
as we can - as long as we think we can find them a better home.

Picking the right parts for gather is a black art - when you notice a
balance is slow it's because it's spending so much time iterating over
replica2part2dev trying to decide just the right parts to gather.

The replica plan can help at least in the gross dispersion collection to
gather up the worst offenders first before considering balance.  I think
trying to avoid picking up parts that are stuck to the tier before
falling into a forced grab on anything over parts_wanted helps with
stability generally - but depending on where the parts_wanted are in
relation to the full devices it's pretty easy pick up something that'll
end up really close to where it started.

I tried to break the gather methods into smaller pieces so it looked
like I knew what I was doing.

Going with a MAXIMUM gather iteration instead of balance (which doesn't
reflect the replica_plan) doesn't seem to be costing me anything - most
of the time the exit condition is either solved or all the parts overly
aggressively locked up on min_part_hours.  So far, it mostly seemds if
the thing is going to balance this round it'll get it in the first
couple of shakes.

Challenge #3 Crazy replica2part2dev tables
==========================================

I think there's lots of ways "scars" can build up a ring which can
result in very particular replica2part2dev tables that are physically
difficult to dig out of.  It's repairing these scars that will take
multiple rebalances to resolve.

... but at this point ...

... lacking a counter example ...

I've been able to close up all the edge cases I was able to find.  It
may not be quick, but progress will be made.

Basically my strategy just required a better understanding of how
previous algorithms were able to *mostly* keep things moving by brute
forcing the whole mess with a bunch of randomness.  Then when we detect
our "elegant" careful part selection isn't making progress - we can fall
back to same old tricks.

Validation
==========

We validate against duplicate part replica assignment after rebalance
and raise an ERROR if we detect more than one replica of a part assigned
to the same device.

In order to meet that requirement we have to have as many devices as
replicas, so attempting to rebalance with too few devices w/o changing
your replica_count is also an ERROR not a warning.

Random Thoughts
===============

As usual with rings, the test diff can be hard to reason about -
hopefully I've added enough comments to assure future me that these
assertions make sense.

Despite being a large rewrite of a lot of important code, the existing
code is known to have failed us.  This change fixes a critical bug that's
trivial to reproduce in a critical component of the system.

There's probably a bunch of error messages and exit status stuff that's
not as helpful as it could be considering the new behaviors.

Change-Id: I1bbe7be38806fc1c8b9181a722933c18a6c76e05
Closes-Bug: #1452431
2015-12-07 16:06:42 -08:00
janonymous
1882801be1 pep8 fix: assertNotEquals -> assertNotEqual
assertNotEquals is deprecated in py3

Change-Id: Ib611351987bed1199fb8f73a750955a61d022d0a
2015-10-12 07:40:07 +00:00
janonymous
f5f9d791b0 pep8 fix: assertEquals -> assertEqual
assertEquals is deprecated in py3, replacing it.

Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>
2015-10-11 12:57:25 +02:00
Victor Stinner
d719064e78 Fix warning pep8 E128 warning of hacking 0.10
Fix the warning E128: "continuation line under-indented for visual
indent" of pep8.

Change-Id: Ie6c6ae341fe3d6281f2095c1d756d552fa5937f9
2015-07-30 09:33:41 +02:00
janonymous
c907107fe4 cPickle is deprecated in py3, replacing it from six.moves
cPickle is deprecated and should be replaced with six.moves
to provide py2 and py3 compatibility.

Change-Id: Ibad990708722360d188c641e61444d50a16a1e93
2015-07-07 22:46:37 +05:30
Victor Stinner
e5c962a28c Replace xrange() with six.moves.range()
Patch generated by the xrange operation of the sixer tool:
https://pypi.python.org/pypi/sixer

Manual changes:

* Fix indentation for pep8 checks
* Fix TestGreenthreadSafeIterator.test_access_is_serialized of
  test.unit.common.test_utils:
  replace range(1, 11) with list(range(1, 11))
* Fix UnsafeXrange docstring, revert change

Change-Id: Icb7e26135c5e57b5302b8bfe066b33cafe69fe4d
2015-06-23 07:29:15 +00:00
Darrell Bishop
df134df901 Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs.  The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring.  In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket.  The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk.  The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring.  The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).

Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM.  Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.

The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True.  This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.

A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses).  The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.

This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server".  The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks.  Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None.  Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled.  Bonus improvement for IPv6
addresses in is_local_device().

This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/

Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").

However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.

This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.

In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.

Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.

For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md

Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md

If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md

DocImpact

Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-06-18 12:43:50 -07:00
Jenkins
65fce49b3b Merge "Use just IP, not port, when determining partition placement" 2015-06-18 18:10:17 +00:00
Samuel Merritt
af72881d1d Use just IP, not port, when determining partition placement
In the ring builder, we place partitions with maximum possible
dispersion across tiers, where a "tier" is region, then zone, then
IP/port,then device. Now, instead of IP/port, just use IP. The port
wasn't really getting us anything; two different object servers on two
different ports on one machine aren't separate failure
domains. However, if someone has only a few machines and is using one
object server on its own port per disk, then the ring builder would
end up with every disk in its own IP/port tier, resulting in bad (with
respect to durability) partition placement.

For example: assume 1 region, 1 zone, 4 machines, 48 total disks (12
per machine), and one object server (and hence one port) per
disk. With the old behavior, partition replicas will all go in the one
region, then the one zone, then pick one of 48 IP/port pairs, then
pick the one disk therein. This gives the same result as randomly
picking 3 disks (without replacement) to store data on; it completely
ignores machine boundaries.

With the new behavior, the replica placer will pick the one region,
then the one zone, then one of 4 IPs, then one of 12 disks
therein. This gives the optimal placement with respect to durability.

The same applies to Ring.get_more_nodes().

Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>

Change-Id: Ibbd740c51296b7e360845b5309d276d7383a3742
2015-06-17 11:31:55 -07:00
janonymous
09e7477a39 Replace it.next() with next(it) for py3 compat
The Python 2 next() method of iterators was renamed to __next__() on
Python 3. Use the builtin next() function instead which works on Python
2 and Python 3.

Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d
2015-06-15 22:10:45 +05:30
Samuel Merritt
decbcd24d4 Foundational support for PUT and GET of erasure-coded objects
This commit makes it possible to PUT an object into Swift and have it
stored using erasure coding instead of replication, and also to GET
the object back from Swift at a later time.

This works by splitting the incoming object into a number of segments,
erasure-coding each segment in turn to get fragments, then
concatenating the fragments into fragment archives. Segments are 1 MiB
in size, except the last, which is between 1 B and 1 MiB.

+====================================================================+
|                             object data                            |
+====================================================================+

                                   |
          +------------------------+----------------------+
          |                        |                      |
          v                        v                      v

+===================+    +===================+         +==============+
|     segment 1     |    |     segment 2     |   ...   |   segment N  |
+===================+    +===================+         +==============+

          |                       |
          |                       |
          v                       v

     /=========\             /=========\
     | pyeclib |             | pyeclib |         ...
     \=========/             \=========/

          |                       |
          |                       |
          +--> fragment A-1       +--> fragment A-2
          |                       |
          |                       |
          |                       |
          |                       |
          |                       |
          +--> fragment B-1       +--> fragment B-2
          |                       |
          |                       |
         ...                     ...

Then, object server A gets the concatenation of fragment A-1, A-2,
..., A-N, so its .data file looks like this (called a "fragment archive"):

+=====================================================================+
|     fragment A-1     |     fragment A-2     |  ...  |  fragment A-N |
+=====================================================================+

Since this means that the object server never sees the object data as
the client sent it, we have to do a few things to ensure data
integrity.

First, the proxy has to check the Etag if the client provided it; the
object server can't do it since the object server doesn't see the raw
data.

Second, if the client does not provide an Etag, the proxy computes it
and uses the MIME-PUT mechanism to provide it to the object servers
after the object body. Otherwise, the object would not have an Etag at
all.

Third, the proxy computes the MD5 of each fragment archive and sends
it to the object server using the MIME-PUT mechanism. With replicated
objects, the proxy checks that the Etags from all the object servers
match, and if they don't, returns a 500 to the client. This mitigates
the risk of data corruption in one of the proxy --> object connections,
and signals to the client when it happens. With EC objects, we can't
use that same mechanism, so we must send the checksum with each
fragment archive to get comparable protection.

On the GET path, the inverse happens: the proxy connects to a bunch of
object servers (M of them, for an M+K scheme), reads one fragment at a
time from each fragment archive, decodes those fragments into a
segment, and serves the segment to the client.

When an object server dies partway through a GET response, any
partially-fetched fragment is discarded, the resumption point is wound
back to the nearest fragment boundary, and the GET is retried with the
next object server.

GET requests for a single byterange work; GET requests for multiple
byteranges do not.

There are a number of things _not_ included in this commit. Some of
them are listed here:

 * multi-range GET

 * deferred cleanup of old .data files

 * durability (daemon to reconstruct missing archives)

Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com>
Co-Authored-By: Paul Luse <paul.e.luse@intel.com>
Co-Authored-By: Christian Schwede <christian.schwede@enovance.com>
Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com>
Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
2015-04-14 00:52:17 -07:00
Samuel Merritt
6d77c379bd Let admins add a region without melting their cluster
Prior to this commit, swift-ring-builder would place partitions on
devices by first going for maximal dispersion and breaking ties with
device weight. This commit flips the order so that device weight
trumps dispersion.

Note: if your ring can be balanced, you won't see a behavior
change. It's only when device weights and maximal-dispersion come into
conflict that this commit changes anything.

Example: a cluster with two regions. Region 1 has a combined weight of
1000, while region 2 has a combined weight of only 400. The ring has 3
replicas and 2^16 partitions.

Prior to this commit, the balance would look like so:

  Region 1: 2 * 2^16 partitions
  Region 2: 2^16 partitions

After this commit, the balance will be:

  Region 1: 10/14 * 2^16 partitions  (more than before)
  Region 2: 4/14 * 2^16 partitions  (fewer than before)

One consequence of this is that some partitions will not have a
replica in region 2, since it's not big enough to hold all of them.

This way, a cluster operator can add a new region to a single-region
cluster in a gradual fashion so as not to destroy their WAN link with
replication traffic. As device weights are increased in the second
region, more replicas will shift over to it. Once its weight is half
that of the first region's, every partition will have a replica there.

DocImpact

Change-Id: I945abcc4a2917bb12be554b640f7507dd23cd0da
2014-08-21 08:56:29 -07:00
Clay Gerrard
4321bb0af6 Add Storage Policy support to Containers
Containers now have a storage policy index associated with them,
stored in the container_stat table. This index is only settable at
container creation time (PUT request), and cannot be changed without
deleting and recreating the container. This is because a container's
policy index will apply to all its objects, so changing a container's
policy index would require moving large amounts of object data
around. If a user wants to change the policy for data in a container,
they must create a new container with the desired policy and move the
data over.

Keep status_changed_at up-to-date with status changes.

In particular during container recreation and replication.

When a container-server receives a PUT for a deleted database an extra UPDATE
is issued against the container_stat table to notate the x-timestamp of the
request.

During replication if merge_timestamps causes a container's status to change
(from DELETED to ACTIVE or vice-versa) the status_changed_at field is set to
the current time.

Accurate reporting of status_changed_at is useful for container replication
forensics and allows resolution of "set on create" attributes like the
upcoming storage_policy_index.

Expose Backend container info on deleted containers.

Include basic container info in backend headers on 404 responses from the
container server.  Default empty values are used as placeholders if the
database does not exist.

Specifically the X-Backend-Status-Changed-At, X-Backend-DELETE-Timestamp and
the X-Backend-Storage-Policy-Index value will be needed by the reconciler to
deal with reconciling out of order object writes in the face of recently
deleted containers.

 * Add "status_changed_at" key to the response from ContainerBroker.get_info.
 * Add "Status Timestamp" field to swift.cli.info.print_db_info_metadata.
 * Add "status_changed_at" key to the response from AccountBroker.get_info.

DocImpact
Implements: blueprint storage-policies
Change-Id: Ie6d388f067f5b096b0f96faef151120ba23c8748
2014-06-18 17:31:38 -07:00
James Page
b9b5fef89a Set permissions on generated ring files
The use of NamedTemporaryFile creates rings with permissions 0600;
however most installs probably generate the rings as root but the
swift-proxy runs as user swift.

Set the permissions on the generated ring to 0644 prior to rename so
that the swift user can read the rings.

Change-Id: Ia511931f471c5c9840012c3a75b89c1f35b1b245
Closes-Bug: #1302700
2014-04-05 16:54:11 +01:00
Peter Portante
ca87827db9 Use a tempfile.mkdtemp() based temporary directory
Change-Id: Ie0e8137615348b130b67323d0d2913dc5ebfd5fb
2014-01-19 16:29:45 -05:00
Clay Gerrard
bad52f1121 Allow programmatic reloading of Swift hash path config
New util's function validate_hash_conf allows you to programmatically reload
swift.conf and hash path global vars HASH_PATH_SUFFIX and HASH_PATH_PREFIX
when they are invalid.

When you load swift.common.utils before you have a swift.conf there's no good
way to force a re-read of swift.conf and repopulate the hash path config
options - short of restarting the process or reloading the module - both of
which are hard to unittest.  This should be no worse in general and in some
cases easier.

Change-Id: I1ff22c5647f127f65589762b3026f82c9f9401c1
2014-01-10 16:17:44 -08:00
Samuel Merritt
68db481ae5 Update handoff algorithm to use IP/port pairs
The replica placement algorithm works on regions, then zones, then
IP/port, then device ID. The handoff algorithm worked on regions, then
zones, then device ID, completely skipping IP/port. It's now been
updated to take IP/port into consideration.

This means you get one handoff on each machine in the cluster before
you start getting handoffs that share a machine with a previous
one. In small clusters, this can help with durability.

Because this is performance-critical code, here are some quick
benchmark results:

Run time averages over 25000 trials on a 1200-device ring (20 part
power, 3 replicas, 2 regions, 10 zones, 120 nodes):

		   |   master    |   branch
===================+=============+============
get 1 more node	   |  2.727e-05	 |  3.076e-05
get 6 more nodes   |  3.55e-05   |  4.214e-05
get all more nodes |  0.002247   |  0.002691

There's a small slowdown from the additional bookkeeping, but nothing
too awful. The time to get 6 more nodes (for handoff checks on 404,
it's 2x replica count by default, hence 6) went from 35 to 42
microseconds, so it remains small.

Change-Id: Ie7da4dfcb0fcf1a38e2fb13f60c204540fadbf06
2013-12-02 17:41:38 -08:00
Juan J. Martinez
845b8beeb5 Default region loading an old-style pickled ring
This is to support upgrades from swift < 1.8 using old-style pickled
rings to 1.10. Old-style pickled rings won't have region information.

Change-Id: I18b2acba3d346e41def9d25d3d4dbd12705e5375
Closes-Bug: #1248919
2013-11-07 15:55:39 +00:00
Peter Portante
9411a24ba7 Revert "Refactor common/utils methods to common/ondisk"
This reverts commit 7760f41c3ce436cb23b4b8425db3749a3da33d32

Change-Id: I95e57a2563784a8cd5e995cc826afeac0eadbe62
Signed-off-by: Peter Portante <peter.portante@redhat.com>
2013-10-07 17:18:09 -04:00
ZhiQiang Fan
f72704fc82 Change OpenStack LLC to Foundation
Change-Id: I7c3df47c31759dbeb3105f8883e2688ada848d58
Closes-bug: #1214176
2013-09-20 01:02:31 +08:00
Peter Portante
7760f41c3c Refactor common/utils methods to common/ondisk
Place all the methods related to on-disk layout and / or configuration
into a new common module that can be shared by the various modules
using the same on-disk layout.

Change-Id: I27ffd4665d5115ffdde649c48a4d18e12017e6a9
Signed-off-by: Peter Portante <peter.portante@redhat.com>
2013-09-17 17:32:04 -04:00
Peter Portante
b5a0b830e2 Pep8 remaining unit test modules in common (8 of 12)
Change-Id: I6fa3291eeacb7ee5c095ad9bccbd33f027bf11e3
Signed-off-by: Peter Portante <peter.portante@redhat.com>
2013-09-01 16:12:51 -04:00
Alex Gaynor
c1f8f266d0 Ensure that files in tests are closed.
This is needed on Pythons which do not have
reference counting GCs (e.g. PyPy).

Change-Id: I5a613e832e9a7a149b3e9317c053c3048f34afcb
2013-07-20 16:08:53 -07:00
Sergey Kraynev
ea7858176b Implementation of replication servers
Support separate replication ip address:
- Added new function in utils. This function provides ability
  to select separate IP address for replication service.
- Db_replicator and object replicators were changed.
  Replication process uses new function now.

Replication network parameters:
- Replication network fields (replication_ip, replication_port)
  support was added to device dictionary in swift-ring-builder script.
- Changes were made to support new fields in search, show and set_info
  functions.

Implementation of replication servers:
- Separate replication servers use the same code as normal replication
  servers, but with replication_server parameter = True.  When using a
  separate replication network, the non-replication servers set
  replication_server = False.  When there is no separate replication
  network (the default case), replication_server is not included in the config.

DocImpact
Change-Id: Ie9af5bdcdf9241c355e36053ca4adfe49dc35bd0
Implements: blueprint dedicated-replication-network
2013-04-21 18:14:42 -04:00
gholt
d79a67ebf6 Refactored lists of nodes to contact for requests
Extensive refactor here to consolidate what nodes are contacted for
any request. This consolidation means reads will contact the same set
of nodes that writes would, giving a very good chance that
read-your-write behavior will succeed. This also means that writes
will not necessarily try all nodes in the cluster as it would
previously, which really wasn't desirable anyway. (If you really want
that, you can set request_node_count to a really big number, but
understand that also means reads will contact every node looking for
something that might not exist.)

* Added a request_node_count proxy-server conf value that allows
  control of how many nodes are contacted for a normal request.

In proxy.controllers.base.Controller:

* Got rid of error_increment since it was only used in one spot by
  another method and just served to confuse.

* Made error_occurred also log the device name.

* Made error_limit require an error message and also documented a bit
  better.

* Changed iter_nodes to just take a ring and a partition and yield
  all the nodes itself so it could control the number of nodes used
  in a given request. Also happens to consolidate where sort_nodes is
  called.

* Updated account_info and container_info to use all nodes from
  iter_nodes and to call error_occurred appropriately.

* Updated GETorHEAD_base to not track attempts on its own and just
  stop when iter_nodes tells it to stop. Also, it doesn't take the
  nodes to contact anymore; instead it takes the ring and gets the
  nodes from iter_nodes itself.

Elsewhere:

* Ring now has a get_part method.

* Made changes to reflect all of the above.

Change-Id: I37f76c99286b6456311abf25167cd0485bfcafac
2013-04-08 20:48:32 +00:00
Greg Lange
44f00a23c1 fixed some minor things in tests that pyflakes complained about
Change-Id: Ifeab56a964630bcf941e932fcbe39e6572e62975
2013-03-26 20:42:26 +00:00
David Hadas
a979c8007b Add support for Hash Prefix
A new configuration parameter is added to /etc/swift/swift.conf
[swift-hash]
swift_hash_path_prefix = 'random unique string'

New installations are advised to set this parameter to a random secret,
which would not be disclosed ouside the organization.
The same secret needs to be used by all swift servers of the same cluster.

Existing installations should set this parameter to an empty string
(the default)

DocImpact

Fixes: Bug #1157454

Change-Id: I63b10d0b7d6dd3f74e0f10bb41b5f240fa03578a
2013-03-22 19:41:55 +02:00
Samuel Merritt
ebcd60f7d9 Add a region tier to Swift's ring.
The region is one level above the zone; it is intended to represent a
chunk of machines that is distant from others with respect to
bandwidth and latency.

Old rings will default to having all their devices in region 1. Since
everything is in the same region by default, the ring builder will
simply distribute across zones as it did before, so your partition
assignment won't move because of this change. If you start adding
devices in other regions, of course, the assignment will change to
take that into account.

swift-ring-builder still accepts the same syntax as before, but will
default added devices to region 1 if no region is specified.

Examples:

$ swift-ring-builder foo.builder add r2z1-1.2.3.4:555/sda

$ swift-ring-builder foo.builder add r1z3-1.2.3.4:555/sda

$ swift-ring-builder foo.builder add z3-1.2.3.4:555/sda

Also, some updates to ring-overview doc.

Change-Id: Ifefbb839cdcf033e6c9201fadca95224c7303a29
2013-03-13 10:00:58 -07:00
Samuel Merritt
27dcaf2636 Spread handoffs out better around zones.
Before, you'd get your 3* primary nodes in 3 different zones, and then
get_more_nodes would give you everything it could from a non-primary
zone, and then finish up with stuff from the primary zones. It would
sort of look like this:

P: device in a primary node's zone
N: device not in a primary node's zone

PPPNNNNNNNNNNNNNNNNNNN...NNNNNNNNNPPP...PPPPPP

(The first three Ps are the primary nodes; they don't actually come
out of get_more_nodes(), but they're included for clarity.)

Now, the first few handoffs from get_more_nodes are in non-primary
zones, but only one per zone, and then the rest of the handoffs ignore
zones. It's still sampling the ring, so it's still taking weights into
consideration, but the zone distribution is more even early in the
handoff chain. It looks like this, assuming 10 zones:

P: device in a primary node's zone
N: device not in a primary node's zone
D: zone doesn't matter

PPPNNNNNNNDDDDDDDDDDD...DDD

* or whatever your replica count is

Change-Id: I31d2a2bc2cd6038386a2df85cd4fa37ccf2f650e
2013-03-05 13:28:12 -08:00
gholt
e064ba1915 Updated get_more_nodes algorithm
The handoff nodes will try to be in zones other than the primary
zones, will take into account the device weights, and will usually
keep the same sequences of handoffs even with ring changes.

On a real ring test the old get_more_nodes placed data mostly evenly
across zones, which is a problem for differently weighted zones. But
the real problem was that the extra partitions given to each device
was 0% to 0.77% with only 46.05% of the candidate devices getting
anything. Some of the devices increased in effective weight over 50%
in the test.

The new get_more_nodes placed closer to what the zone weights were
and the extra partitions given to each device was 0% to 0.24% with
90.58% of the candidate devices getting something. The worst off
device only increased in effective weight by 10.71%.

Change-Id: Iffb133a22db69074acaa2b90854cbfa92e4c2b9e
2013-03-04 08:52:24 +00:00
Samuel Merritt
156cdc8edf Deterministic, repeatable serialization for rings.
The (account|container|object).ring.gz files contain, among other
things, a JSON-encoded dictionary. This change simply makes the JSON
serializer sort the keys of that dictionary so that two
Python-identical rings will result in two bytewise-identical ring
files. Also, to get repeatable compression, we lock down the timestamp
in the gzip output stream to a fixed value. (There's a timestamp value
in a gzip stream header; by default, gzip.GzipFile sticks time.time()
in there.)

This only works on Python 2.7; on 2.6, the 'mtime' argument to
gzip.GzipFile() is unsupported. Don't worry, serialization still works
on 2.6. It just doesn't always produce the same bytes for the same
ring.

Change-Id: Ide446413d0aeb78536883933fd0caf440b8f54ad
2013-01-31 16:55:10 -08:00
Darrell Bishop
f8ce43a218 Use custom encoding for RingData, not pickle.
Serialize RingData in a versioned, custom format which is a combination
of a JSON-encoded header and .tostring() dumps of the
replica2part2dev_id arrays.  This format deserializes hundreds of times
faster than rings serialized with Python 2.7's pickle (a significant
performance regression for ring loading between Python 2.6 and Python
2.7).  Fixes bug 1031954.

swift.common.ring.ring.RingData is now responsible for serialization and
deserialization of its data via a new load() class method and save()
object method.  The new implementation is backward-compatible; if a ring
does not begin with a new-style magic string, it is assumed to be an
old-style pickle-dumped ring and is handled as before.  So new Swift
code can read old rings, but old Swift code will not be able to read
newly-serialized rings. THIS SHOULD BE MENTIONED PROMINENTLY IN THE
RELEASE NOTES.

I didn't want to bite of more than necessary, so I didn't mess with
builder file serialization.

Change-Id: I799b9a4c894d54fb16592443904ac055b2638e2d
2012-08-05 00:51:49 -07:00
John Dickinson
de3b663a73 ensure that accessing the ring devs reloads the ring if necessary
Change-Id: If5a6d32c40de02183a2eed6e2a32d62ba38df32d
2012-07-20 09:15:34 -07:00
Samuel Merritt
bb509dd863 As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.

In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.

In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.

Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.

The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)

Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-05-09 15:56:06 -07:00
Jenkins
6682138b0a Merge "Make ring class interface slightly more abstracted from implementation." 2012-03-22 20:25:06 +00:00
John Dickinson
1ecf5ebba1 updated copyright date for all files
Change-Id: Ifd909d3561c2647770a7e0caa3cd91acd1b4f298
2012-03-19 13:45:34 -05:00
Michael Barton
e008c2ebb8 Make ring class interface slightly more abstracted from implementation.
Change-Id: I0f55d61c7b8de30460f17a69e5d9946494dbda6e
2012-03-14 22:00:30 +00:00
David Goetz
aa14afe2bb nodes with a weight of zero should not be valid for handoff 2011-02-22 16:02:36 +00:00
Clay Gerrard
ae1c2d73ab creating a Ring will ensure a valid HASH_PATH_SUFFIX
To make sure that node lookups match what the servers return the generated
hashes need to match.  All the utils that use the ring should validate their
HASH_PATH_SUFFIX.
2011-02-16 09:02:38 -06:00