I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.
While we're at it (and since they were so close anyway), port
* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py
Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
...by using a random ring, looking at *all* partitions, and making
assertions about the distribution of how many times we have to call
next().
Change-Id: Ia5feb9396d4bf6fd35f16bbc5280e63022ed2c47
Unit tests use the random module in places to randomise
test inputs, but once the tests in test_builder.py or
test_ring_builder_analyzer.py have been run the random
seed is left in a repeatable state because calls are made
to RingBuilder.balance with a seed value. Consequently,
subsequent calls to random in other tests get repeatable
results from one test run to another.
This patch resets the state of the random module before
returning from RingBuilder.rebalance.
Closes-Bug: #1639755
Change-Id: I4b74030afc654e60452e65b3e0f1b45a189c16e3
On-disk, serialized ring files are byteorder dependent, which makes then
unportable between different endian architectures. Add a field to the
ring dictionary in the file indicating the byteorder used to generate
the file, and then byteswap if necessary when deserializing the file.
This patch only allows newly created ring files to be byteoder
agnostic. Previously generated ring files will still fail on different
endian architectures, and will need to be regenerated with this patch.
Change-Id: I23b5e0a8082b30ca257aeb1fab03ab74e6f0b2d4
Closes-Bug: #1639980
I changed asserts with more specific assert methods.
e.g.: from assertTrue(sth == None) to assertIsNone(*) or
assertTrue(isinstance(inst, type)) to assertIsInstace(inst, type)
or assertTrue(not sth) to assertFalse(sth).
The code gets more readable, and a better description will be shown on fail.
Change-Id: I9531c9939aa7c2dac127b5dc865b8d396dab318f
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
It's harder than it sounds. There was really three challenges.
Challenge #1 Initial Assignment
===============================
Before starting to assign parts on this new shiny ring you've
constructed, maybe we'll pause for a moment up front and consider the
lay of the land. This process is called the replica_plan.
The replica_plan approach is separating part assignment failures into
two modes:
1) we considered the cluster topology and it's weights and came up with
the wrong plan
2) we failed to execute on the plan
I failed at both parts plenty of times before I got it this close. I'm
sure a counter example still exists, but when we find it the new helper
methods will let us reason about where things went wrong.
Challenge #2 Fixing Placement
=============================
With a sound plan in hand, it's much easier to fail to execute on it the
less material you have to execute with - so we gather up as many parts
as we can - as long as we think we can find them a better home.
Picking the right parts for gather is a black art - when you notice a
balance is slow it's because it's spending so much time iterating over
replica2part2dev trying to decide just the right parts to gather.
The replica plan can help at least in the gross dispersion collection to
gather up the worst offenders first before considering balance. I think
trying to avoid picking up parts that are stuck to the tier before
falling into a forced grab on anything over parts_wanted helps with
stability generally - but depending on where the parts_wanted are in
relation to the full devices it's pretty easy pick up something that'll
end up really close to where it started.
I tried to break the gather methods into smaller pieces so it looked
like I knew what I was doing.
Going with a MAXIMUM gather iteration instead of balance (which doesn't
reflect the replica_plan) doesn't seem to be costing me anything - most
of the time the exit condition is either solved or all the parts overly
aggressively locked up on min_part_hours. So far, it mostly seemds if
the thing is going to balance this round it'll get it in the first
couple of shakes.
Challenge #3 Crazy replica2part2dev tables
==========================================
I think there's lots of ways "scars" can build up a ring which can
result in very particular replica2part2dev tables that are physically
difficult to dig out of. It's repairing these scars that will take
multiple rebalances to resolve.
... but at this point ...
... lacking a counter example ...
I've been able to close up all the edge cases I was able to find. It
may not be quick, but progress will be made.
Basically my strategy just required a better understanding of how
previous algorithms were able to *mostly* keep things moving by brute
forcing the whole mess with a bunch of randomness. Then when we detect
our "elegant" careful part selection isn't making progress - we can fall
back to same old tricks.
Validation
==========
We validate against duplicate part replica assignment after rebalance
and raise an ERROR if we detect more than one replica of a part assigned
to the same device.
In order to meet that requirement we have to have as many devices as
replicas, so attempting to rebalance with too few devices w/o changing
your replica_count is also an ERROR not a warning.
Random Thoughts
===============
As usual with rings, the test diff can be hard to reason about -
hopefully I've added enough comments to assure future me that these
assertions make sense.
Despite being a large rewrite of a lot of important code, the existing
code is known to have failed us. This change fixes a critical bug that's
trivial to reproduce in a critical component of the system.
There's probably a bunch of error messages and exit status stuff that's
not as helpful as it could be considering the new behaviors.
Change-Id: I1bbe7be38806fc1c8b9181a722933c18a6c76e05
Closes-Bug: #1452431
assertEquals is deprecated in py3, replacing it.
Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
In the ring builder, we place partitions with maximum possible
dispersion across tiers, where a "tier" is region, then zone, then
IP/port,then device. Now, instead of IP/port, just use IP. The port
wasn't really getting us anything; two different object servers on two
different ports on one machine aren't separate failure
domains. However, if someone has only a few machines and is using one
object server on its own port per disk, then the ring builder would
end up with every disk in its own IP/port tier, resulting in bad (with
respect to durability) partition placement.
For example: assume 1 region, 1 zone, 4 machines, 48 total disks (12
per machine), and one object server (and hence one port) per
disk. With the old behavior, partition replicas will all go in the one
region, then the one zone, then pick one of 48 IP/port pairs, then
pick the one disk therein. This gives the same result as randomly
picking 3 disks (without replacement) to store data on; it completely
ignores machine boundaries.
With the new behavior, the replica placer will pick the one region,
then the one zone, then one of 4 IPs, then one of 12 disks
therein. This gives the optimal placement with respect to durability.
The same applies to Ring.get_more_nodes().
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Change-Id: Ibbd740c51296b7e360845b5309d276d7383a3742
The Python 2 next() method of iterators was renamed to __next__() on
Python 3. Use the builtin next() function instead which works on Python
2 and Python 3.
Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d
This commit makes it possible to PUT an object into Swift and have it
stored using erasure coding instead of replication, and also to GET
the object back from Swift at a later time.
This works by splitting the incoming object into a number of segments,
erasure-coding each segment in turn to get fragments, then
concatenating the fragments into fragment archives. Segments are 1 MiB
in size, except the last, which is between 1 B and 1 MiB.
+====================================================================+
| object data |
+====================================================================+
|
+------------------------+----------------------+
| | |
v v v
+===================+ +===================+ +==============+
| segment 1 | | segment 2 | ... | segment N |
+===================+ +===================+ +==============+
| |
| |
v v
/=========\ /=========\
| pyeclib | | pyeclib | ...
\=========/ \=========/
| |
| |
+--> fragment A-1 +--> fragment A-2
| |
| |
| |
| |
| |
+--> fragment B-1 +--> fragment B-2
| |
| |
... ...
Then, object server A gets the concatenation of fragment A-1, A-2,
..., A-N, so its .data file looks like this (called a "fragment archive"):
+=====================================================================+
| fragment A-1 | fragment A-2 | ... | fragment A-N |
+=====================================================================+
Since this means that the object server never sees the object data as
the client sent it, we have to do a few things to ensure data
integrity.
First, the proxy has to check the Etag if the client provided it; the
object server can't do it since the object server doesn't see the raw
data.
Second, if the client does not provide an Etag, the proxy computes it
and uses the MIME-PUT mechanism to provide it to the object servers
after the object body. Otherwise, the object would not have an Etag at
all.
Third, the proxy computes the MD5 of each fragment archive and sends
it to the object server using the MIME-PUT mechanism. With replicated
objects, the proxy checks that the Etags from all the object servers
match, and if they don't, returns a 500 to the client. This mitigates
the risk of data corruption in one of the proxy --> object connections,
and signals to the client when it happens. With EC objects, we can't
use that same mechanism, so we must send the checksum with each
fragment archive to get comparable protection.
On the GET path, the inverse happens: the proxy connects to a bunch of
object servers (M of them, for an M+K scheme), reads one fragment at a
time from each fragment archive, decodes those fragments into a
segment, and serves the segment to the client.
When an object server dies partway through a GET response, any
partially-fetched fragment is discarded, the resumption point is wound
back to the nearest fragment boundary, and the GET is retried with the
next object server.
GET requests for a single byterange work; GET requests for multiple
byteranges do not.
There are a number of things _not_ included in this commit. Some of
them are listed here:
* multi-range GET
* deferred cleanup of old .data files
* durability (daemon to reconstruct missing archives)
Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com>
Co-Authored-By: Paul Luse <paul.e.luse@intel.com>
Co-Authored-By: Christian Schwede <christian.schwede@enovance.com>
Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com>
Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
Prior to this commit, swift-ring-builder would place partitions on
devices by first going for maximal dispersion and breaking ties with
device weight. This commit flips the order so that device weight
trumps dispersion.
Note: if your ring can be balanced, you won't see a behavior
change. It's only when device weights and maximal-dispersion come into
conflict that this commit changes anything.
Example: a cluster with two regions. Region 1 has a combined weight of
1000, while region 2 has a combined weight of only 400. The ring has 3
replicas and 2^16 partitions.
Prior to this commit, the balance would look like so:
Region 1: 2 * 2^16 partitions
Region 2: 2^16 partitions
After this commit, the balance will be:
Region 1: 10/14 * 2^16 partitions (more than before)
Region 2: 4/14 * 2^16 partitions (fewer than before)
One consequence of this is that some partitions will not have a
replica in region 2, since it's not big enough to hold all of them.
This way, a cluster operator can add a new region to a single-region
cluster in a gradual fashion so as not to destroy their WAN link with
replication traffic. As device weights are increased in the second
region, more replicas will shift over to it. Once its weight is half
that of the first region's, every partition will have a replica there.
DocImpact
Change-Id: I945abcc4a2917bb12be554b640f7507dd23cd0da
Containers now have a storage policy index associated with them,
stored in the container_stat table. This index is only settable at
container creation time (PUT request), and cannot be changed without
deleting and recreating the container. This is because a container's
policy index will apply to all its objects, so changing a container's
policy index would require moving large amounts of object data
around. If a user wants to change the policy for data in a container,
they must create a new container with the desired policy and move the
data over.
Keep status_changed_at up-to-date with status changes.
In particular during container recreation and replication.
When a container-server receives a PUT for a deleted database an extra UPDATE
is issued against the container_stat table to notate the x-timestamp of the
request.
During replication if merge_timestamps causes a container's status to change
(from DELETED to ACTIVE or vice-versa) the status_changed_at field is set to
the current time.
Accurate reporting of status_changed_at is useful for container replication
forensics and allows resolution of "set on create" attributes like the
upcoming storage_policy_index.
Expose Backend container info on deleted containers.
Include basic container info in backend headers on 404 responses from the
container server. Default empty values are used as placeholders if the
database does not exist.
Specifically the X-Backend-Status-Changed-At, X-Backend-DELETE-Timestamp and
the X-Backend-Storage-Policy-Index value will be needed by the reconciler to
deal with reconciling out of order object writes in the face of recently
deleted containers.
* Add "status_changed_at" key to the response from ContainerBroker.get_info.
* Add "Status Timestamp" field to swift.cli.info.print_db_info_metadata.
* Add "status_changed_at" key to the response from AccountBroker.get_info.
DocImpact
Implements: blueprint storage-policies
Change-Id: Ie6d388f067f5b096b0f96faef151120ba23c8748
The use of NamedTemporaryFile creates rings with permissions 0600;
however most installs probably generate the rings as root but the
swift-proxy runs as user swift.
Set the permissions on the generated ring to 0644 prior to rename so
that the swift user can read the rings.
Change-Id: Ia511931f471c5c9840012c3a75b89c1f35b1b245
Closes-Bug: #1302700
New util's function validate_hash_conf allows you to programmatically reload
swift.conf and hash path global vars HASH_PATH_SUFFIX and HASH_PATH_PREFIX
when they are invalid.
When you load swift.common.utils before you have a swift.conf there's no good
way to force a re-read of swift.conf and repopulate the hash path config
options - short of restarting the process or reloading the module - both of
which are hard to unittest. This should be no worse in general and in some
cases easier.
Change-Id: I1ff22c5647f127f65589762b3026f82c9f9401c1
The replica placement algorithm works on regions, then zones, then
IP/port, then device ID. The handoff algorithm worked on regions, then
zones, then device ID, completely skipping IP/port. It's now been
updated to take IP/port into consideration.
This means you get one handoff on each machine in the cluster before
you start getting handoffs that share a machine with a previous
one. In small clusters, this can help with durability.
Because this is performance-critical code, here are some quick
benchmark results:
Run time averages over 25000 trials on a 1200-device ring (20 part
power, 3 replicas, 2 regions, 10 zones, 120 nodes):
| master | branch
===================+=============+============
get 1 more node | 2.727e-05 | 3.076e-05
get 6 more nodes | 3.55e-05 | 4.214e-05
get all more nodes | 0.002247 | 0.002691
There's a small slowdown from the additional bookkeeping, but nothing
too awful. The time to get 6 more nodes (for handoff checks on 404,
it's 2x replica count by default, hence 6) went from 35 to 42
microseconds, so it remains small.
Change-Id: Ie7da4dfcb0fcf1a38e2fb13f60c204540fadbf06
This is to support upgrades from swift < 1.8 using old-style pickled
rings to 1.10. Old-style pickled rings won't have region information.
Change-Id: I18b2acba3d346e41def9d25d3d4dbd12705e5375
Closes-Bug: #1248919
This reverts commit 7760f41c3ce436cb23b4b8425db3749a3da33d32
Change-Id: I95e57a2563784a8cd5e995cc826afeac0eadbe62
Signed-off-by: Peter Portante <peter.portante@redhat.com>
Place all the methods related to on-disk layout and / or configuration
into a new common module that can be shared by the various modules
using the same on-disk layout.
Change-Id: I27ffd4665d5115ffdde649c48a4d18e12017e6a9
Signed-off-by: Peter Portante <peter.portante@redhat.com>
Support separate replication ip address:
- Added new function in utils. This function provides ability
to select separate IP address for replication service.
- Db_replicator and object replicators were changed.
Replication process uses new function now.
Replication network parameters:
- Replication network fields (replication_ip, replication_port)
support was added to device dictionary in swift-ring-builder script.
- Changes were made to support new fields in search, show and set_info
functions.
Implementation of replication servers:
- Separate replication servers use the same code as normal replication
servers, but with replication_server parameter = True. When using a
separate replication network, the non-replication servers set
replication_server = False. When there is no separate replication
network (the default case), replication_server is not included in the config.
DocImpact
Change-Id: Ie9af5bdcdf9241c355e36053ca4adfe49dc35bd0
Implements: blueprint dedicated-replication-network
Extensive refactor here to consolidate what nodes are contacted for
any request. This consolidation means reads will contact the same set
of nodes that writes would, giving a very good chance that
read-your-write behavior will succeed. This also means that writes
will not necessarily try all nodes in the cluster as it would
previously, which really wasn't desirable anyway. (If you really want
that, you can set request_node_count to a really big number, but
understand that also means reads will contact every node looking for
something that might not exist.)
* Added a request_node_count proxy-server conf value that allows
control of how many nodes are contacted for a normal request.
In proxy.controllers.base.Controller:
* Got rid of error_increment since it was only used in one spot by
another method and just served to confuse.
* Made error_occurred also log the device name.
* Made error_limit require an error message and also documented a bit
better.
* Changed iter_nodes to just take a ring and a partition and yield
all the nodes itself so it could control the number of nodes used
in a given request. Also happens to consolidate where sort_nodes is
called.
* Updated account_info and container_info to use all nodes from
iter_nodes and to call error_occurred appropriately.
* Updated GETorHEAD_base to not track attempts on its own and just
stop when iter_nodes tells it to stop. Also, it doesn't take the
nodes to contact anymore; instead it takes the ring and gets the
nodes from iter_nodes itself.
Elsewhere:
* Ring now has a get_part method.
* Made changes to reflect all of the above.
Change-Id: I37f76c99286b6456311abf25167cd0485bfcafac
A new configuration parameter is added to /etc/swift/swift.conf
[swift-hash]
swift_hash_path_prefix = 'random unique string'
New installations are advised to set this parameter to a random secret,
which would not be disclosed ouside the organization.
The same secret needs to be used by all swift servers of the same cluster.
Existing installations should set this parameter to an empty string
(the default)
DocImpact
Fixes: Bug #1157454
Change-Id: I63b10d0b7d6dd3f74e0f10bb41b5f240fa03578a
The region is one level above the zone; it is intended to represent a
chunk of machines that is distant from others with respect to
bandwidth and latency.
Old rings will default to having all their devices in region 1. Since
everything is in the same region by default, the ring builder will
simply distribute across zones as it did before, so your partition
assignment won't move because of this change. If you start adding
devices in other regions, of course, the assignment will change to
take that into account.
swift-ring-builder still accepts the same syntax as before, but will
default added devices to region 1 if no region is specified.
Examples:
$ swift-ring-builder foo.builder add r2z1-1.2.3.4:555/sda
$ swift-ring-builder foo.builder add r1z3-1.2.3.4:555/sda
$ swift-ring-builder foo.builder add z3-1.2.3.4:555/sda
Also, some updates to ring-overview doc.
Change-Id: Ifefbb839cdcf033e6c9201fadca95224c7303a29
Before, you'd get your 3* primary nodes in 3 different zones, and then
get_more_nodes would give you everything it could from a non-primary
zone, and then finish up with stuff from the primary zones. It would
sort of look like this:
P: device in a primary node's zone
N: device not in a primary node's zone
PPPNNNNNNNNNNNNNNNNNNN...NNNNNNNNNPPP...PPPPPP
(The first three Ps are the primary nodes; they don't actually come
out of get_more_nodes(), but they're included for clarity.)
Now, the first few handoffs from get_more_nodes are in non-primary
zones, but only one per zone, and then the rest of the handoffs ignore
zones. It's still sampling the ring, so it's still taking weights into
consideration, but the zone distribution is more even early in the
handoff chain. It looks like this, assuming 10 zones:
P: device in a primary node's zone
N: device not in a primary node's zone
D: zone doesn't matter
PPPNNNNNNNDDDDDDDDDDD...DDD
* or whatever your replica count is
Change-Id: I31d2a2bc2cd6038386a2df85cd4fa37ccf2f650e
The handoff nodes will try to be in zones other than the primary
zones, will take into account the device weights, and will usually
keep the same sequences of handoffs even with ring changes.
On a real ring test the old get_more_nodes placed data mostly evenly
across zones, which is a problem for differently weighted zones. But
the real problem was that the extra partitions given to each device
was 0% to 0.77% with only 46.05% of the candidate devices getting
anything. Some of the devices increased in effective weight over 50%
in the test.
The new get_more_nodes placed closer to what the zone weights were
and the extra partitions given to each device was 0% to 0.24% with
90.58% of the candidate devices getting something. The worst off
device only increased in effective weight by 10.71%.
Change-Id: Iffb133a22db69074acaa2b90854cbfa92e4c2b9e
The (account|container|object).ring.gz files contain, among other
things, a JSON-encoded dictionary. This change simply makes the JSON
serializer sort the keys of that dictionary so that two
Python-identical rings will result in two bytewise-identical ring
files. Also, to get repeatable compression, we lock down the timestamp
in the gzip output stream to a fixed value. (There's a timestamp value
in a gzip stream header; by default, gzip.GzipFile sticks time.time()
in there.)
This only works on Python 2.7; on 2.6, the 'mtime' argument to
gzip.GzipFile() is unsupported. Don't worry, serialization still works
on 2.6. It just doesn't always produce the same bytes for the same
ring.
Change-Id: Ide446413d0aeb78536883933fd0caf440b8f54ad
Serialize RingData in a versioned, custom format which is a combination
of a JSON-encoded header and .tostring() dumps of the
replica2part2dev_id arrays. This format deserializes hundreds of times
faster than rings serialized with Python 2.7's pickle (a significant
performance regression for ring loading between Python 2.6 and Python
2.7). Fixes bug 1031954.
swift.common.ring.ring.RingData is now responsible for serialization and
deserialization of its data via a new load() class method and save()
object method. The new implementation is backward-compatible; if a ring
does not begin with a new-style magic string, it is assumed to be an
old-style pickle-dumped ring and is handled as before. So new Swift
code can read old rings, but old Swift code will not be able to read
newly-serialized rings. THIS SHOULD BE MENTIONED PROMINENTLY IN THE
RELEASE NOTES.
I didn't want to bite of more than necessary, so I didn't mess with
builder file serialization.
Change-Id: I799b9a4c894d54fb16592443904ac055b2638e2d
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
To make sure that node lookups match what the servers return the generated
hashes need to match. All the utils that use the ring should validate their
HASH_PATH_SUFFIX.