On a full disk, a call to delete an object will fail when it tries to
write tombstones. Handling DiskFileNoSpace exception raised by
swift.common.utils.
Change-Id: I8f0cfcc4159ee154fcd3e7ca90c422aa5aadf0b3
Signed-off-by: Ganesh Maharaj Mahalingam <ganesh.mahalingam@intel.com>
Closes-Bug: 1491675
The current rule inside the db_replicator is to rsync+merge
containers during replication if the difference between rowids
differ by more than 50%:
# if the difference in rowids between the two differs by
# more than 50%, rsync then do a remote merge.
if rinfo['max_row'] / float(info['max_row']) < 0.5:
This mean on smaller containers, that only have few rows, and differ
by a small number still rsync+merge rather then copying rows.
This change adds a new condition, the difference in the rowids must
be greater than the defined per_diff otherwise usync will be used:
# if the difference in rowids between the two differs by
# more than 50% and the difference is greater than per_diff,
# rsync then do a remote merge.
# NOTE: difference > per_diff stops us from dropping to rsync
# on smaller containers, who have only a few rows to sync.
if rinfo['max_row'] / float(info['max_row']) < 0.5 and \
info['max_row'] - rinfo['max_row'] > self.per_diff:
Change-Id: I9e779f71bf37714919a525404565dd075762b0d4
Closes-bug: #1019712
This commit tries to give the user a reason that their SLO manifest
was invalid instead of just saying "Invalid SLO Manifest File". It
doesn't get every error condition, but it's better than before.
Examples of things that now have real error messages include:
* bad keys in manifest (e.g. using "name" instead of "path")
* bogus range (e.g. "bytes=123-taco")
* multiple ranges (e.g. "bytes=10-20,30-40")
* bad JSON structure (i.e. not a list of objects)
* non-integer size_bytes
Also fixed an annoyance with unspecified-size segments that are too
small. Previously, if you uploaded a segment reference with
'{"size_bytes": null, ...}' in it and the referenced segment was less
than 1 MiB, you'd get a response that looked like this:
HTTP/1.1 400 Bad Request
Content-Length: 62
Content-Type: text/html; charset=UTF-8
X-Trans-Id: txd9ee3b25896642098e4d9-0055dd095a
Date: Wed, 26 Aug 2015 00:33:30 GMT
Each segment, except the last, must be at least 1048576 bytes.
This is true, but not particularly helpful, since it doesn't tell you
which of your segments violated the rule.
Now you get something more like this:
HTTP/1.1 400 Bad Request
Content-Length: 49
Content-Type: text/plain
X-Trans-Id: tx586e52580bac4956ad8e2-0055dd09c2
Date: Wed, 26 Aug 2015 00:35:14 GMT
Errors:
/segs/small, Too Small; each segment, except the last...
It's not exactly a tutorial on SLO manifests, but at least it names
the problematic segment.
This also changes the status code for a self-referential manifest from
409 to 400. The rest of the error machinery was using 400, and
special-casing self-reference would be really annoying. Besides, now
that we're showing more than one error message at a time, what would
the right status code be for a manifest with a self-referential
segment *and* a segment with a bad range? 400? 409? 404.5? It's much
more consistent to just say invalid manifest --> 400.
Change-Id: I2275683230b36bc273319254e37c16b9e9b9d69c
If some server is overloaded or timeout set too low, swift-recon fails with
raised socket.timeout exception.
This error should be processed the same way as HTTPError/URLError.
Change-Id: Ide8843977ab224fa866097d0f0b765d6899c66b8
assertEquals is deprecated in py3, replacing it.
Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>
The builtin basestring type was removed in Python 3. Replace it with
six.string_types which works on Python 2 and Python 3.
Change-Id: Ib92a729682322cc65b41050ae169167be2899e2c
Currently, it is not possible to change weight of device with id=0
by swift-ring-builder cli. Instead of change the help is shown.
Example:
$ swift-ring-builder object.builder set_weight --id 0 1.00
But id=0 is generated by swift for the first device if not provided.
Also --weight, --zone and --region cause the same bug.
There is problem to detect new command format in validate_args
function if zero is as valid value for some args.
Change-Id: I4ee379c242f090d116cd2504e21d0e1904cdc2fc
The next() method of Python 2 generators was renamed to __next__().
Call the builtin next() function instead which works on Python 2 and
Python 3.
The patch was generated by the next operation of the sixer tool.
Change-Id: Id12bc16cba7d9b8a283af0d392188a185abe439d
The urllib, urllib2 and urlparse modules of Python 2 were reorganized
into a new urllib namespace on Python 3. Replace urllib, urllib2 and
urlparse imports with six.moves.urllib to make the modified code
compatible with Python 2 and Python 3.
The initial patch was generated by the urllib operation of the sixer
tool on: bin/* swift/ test/.
Change-Id: I61a8c7fb7972eabc7da8dad3b3d34bceee5c5d93
The unicode type was renamed to str in Python 3. Use six.text_type to
make the modified code compatible with Python 2 and Python 3.
The initial patch was generated by the unicode operation of the sixer
tool on: bin/* swift/ test/.
Change-Id: I9e13748ccde36ee8110756202d55d3ae945d4860
If a device has been removed from one of the rings, it actually is set as None
within the ring. In that case the length of the devices is not True without
filtering the None devices. However, if the length matched the condition but
included a removed device the probetests would fail with a TypeError.
This fix could be done also in swift/common/ring/ring.py, but it seems it only
affects probetests right now, thus fixing it there and not changing the current
behavior.
Change-Id: I8ccf9b32a51957e040dd370bc9f711d4328d17b1
This patch fixed the exception (AttributeError: 'list' object has no
attribute 'intersection') when replicator try to sync data from
handoff to primary partition in more than one remote region.
Change-Id: I565c45dda8c99d36e24dbf1145f2d2527d593ac0
Closes-Bug: 1503152
... which ensures no Timeouts remain pending after the parent generator
is closed when a client disconnects before being able to read the entire
body.
Also tighten up a few tests that may have left some open ECAppIter
generators lying about after the tests themselves had finished. This
has the side effect of preventing the extraneous printing of the Timeout
errors being raised by the eventlet hub in the background while our
unittests are running.
Change-Id: I156d873c72c19623bcfbf39bf120c98800b3cada
The rfc822 module has been deprecated since Python 2.3, and in
particular is absent from the Python 3 standard library. However, Swift
uses instances of rfc822.Message in a number of places, relying on its
behavior of immediately parsing the headers of a file-like object
without consuming the body, leaving the position of the file at the
start of the body. Python 3's http.client has an undocumented
parse_headers function with the same behavior, which inspired the new
parse_mime_headers utility introduced here. (The HeaderKeyDict returned
by parse_mime_headers doesn't have a `.getheader(key)` method like
rfc822.Message did; the dictionary-like `[key]` or `.get(key)` interface
should be used exclusively.)
The implementation in this commit won't actually work with Python 3, the
email.parser.Parser().parsestr of which expects a Unicode string, but it
is believed that this can be addressed in followup work.
Change-Id: Ia5ee2ead67e36e8c6416183667f64ae255887736
We should never assign multiple replicas of the same partition to the
same device - our on-disk layout can only support a single replica of a
given part on a single device. We should not do this, so we validate
against it and raise a loud warning if this terrible state is ever
observed after a rebalance.
Unfortunately currently there's a couple not necessarily uncommon
scenarios which will trigger this observed state today:
1. If we have less devices than replicas
2. If a server or zones aggregate device weight make it the most
appropriate candidate for multiple replicas and you're a bit unlucky
Fixing #1 would be easy, we should just not allow that state anymore.
Really we never did - if you have a 3 replica ring with one device - you
have one replica. Everything that iter_nodes'd would de-dupe. We
should just be insisting that you explicitly acknowledge your replica
count with set_replicas.
I have been lost in the abyss for days searching for a general solutions
to #2. I'm sure it exists, but I will not have wrestled it to
submission by RC1. In the meantime we can eliminate a great deal of the
luck required simply by refusing to place more than one replica of a
part on a device in assign_parts.
The meat of the change is a small update to the .validate method in
RingBuilder. It basically unrolls a pre-existing (part, replica) loop
so that all the replicas of the part come out in order so that we can
build up the set of dev_id's for which all the replicas of a given part
are assigned part-by-part.
If we observe any duplicates - we raise a warning.
To clean the cobwebs out of the rest of the corner cases we're going to
delay get_required_overload from kicking in until we achive dispersion,
and a small check was added when selecting a device subtier to validate
if it's already being used - picking any other device in the tier works
out much better. If no other devices are available in the tier - we
raise a warning. A more elegant or optimized solution may exist.
Many unittests did not meet the criteria #1, but the fix was straight
forward after being identified by the pigeonhole check.
However, many more tests were affected by #2 - but again the fix came to
be simply adding more devices. The fantasy that all failure domains
contain at least replica count devices is prevalent in both our ring
placement algorithm and it's tests. These tests were trying to
demonstrate some complex characteristics of our ring placement algorithm
and I believe we just got a bit too carried away trying to find the
simplest possible example to demonstrate the desirable trait. I think
a better example looks more like a real ring - with many devices in each
server and many servers in each zone - I think more devices makes the
tests better. As much as possible I've tried to maintain the original
intent of the tests - when adding devices I've either spread the weight
out amongst them or added proportional weights to the other tiers.
I added an example straw man test to validate that three devices with
different weights in three different zones won't blow up. Once we can
do that without raising warnings and assigning duplicate device part
replicas - we can add more. And more importantly change the warnings to
errors - because we would much prefer to not do that #$%^ anymore.
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Related-Bug: #1452431
Change-Id: I592d5b611188670ae842fe3d030aa3b340ac36f9
Increase the number of nodes from which we require a final successful
HTTP responses before we return success to the client on a write - to
the same number of nodes we'll require successful responses from to
service a client request for a read.
Change-Id: Ifd36790faa0a5d00ec79c23d1f96a332a0ca0f0b
Related-Bug: #1469094
ssync currently does the wrong thing when replicating object dirs
containing both a .data and a .meta file. The ssync sender uses a
single PUT to send both object content and metadata to the receiver,
using the metadata (.meta file) timestamp. This results in the object
content timestamp being advanced to the metadata timestamp,
potentially overwriting newer object data on the receiver and causing
an inconsistency with the container server record for the object.
For example, replicating an object dir with {t0.data(etag=x), t2.meta}
to a receiver with t1.data(etag=y) will result in the creation of
t2.data(etag=x) on the receiver. However, the container server will
continue to list the object as t1(etag=y).
This patch modifies ssync to replicate the content of .data and .meta
separately using a PUT request for the data (no change) and a POST
request for the metadata. In effect, ssync replication replicates the
client operations that generated the .data and .meta files so that
the result of replication is the same as if the original client requests
had persisted on all object servers.
Apart from maintaining correct timestamps across sync'd nodes, this has
the added benefit of not needing to PUT objects when only the metadata
has changed and a POST will suffice.
Taking the same example, ssync sender will no longer PUT t0.data but will
POST t2.meta resulting in the receiver having t1.data and t2.meta.
The changes are backwards compatible: an upgraded sender will only sync
data files to a legacy receiver and will not sync meta files (fixing the
erroneous behavior described above); a legacy sender will operate as
before when sync'ing to an upgraded receiver.
Changes:
- diskfile API provides methods to get the data file timestamp
as distinct from the diskfile timestamp.
- diskfile yield_hashes return tuple now passes a dict mapping data and
meta (if any) timestamps to their respective values in the timestamp
field.
- ssync_sender will encode data and meta timestamps in the
(hash_path, timestamp) tuple sent to the receiver during
missing_checks.
- ssync_receiver compares sender's data and meta timestamps to any
local diskfile and may specify that only data or meta parts are sent
during updates phase by appending a qualifier to the hash returned
in its 'wanted' list.
- ssync_sender now sends POST subrequests when a meta file
exists and its content needs to be replicated.
- ssync_sender may send *only* a POST if the receiver indicates that
is the only part required to be sync'd.
- object server will allow PUT and DELETE with earlier timestamp than
a POST
- Fixed TODO related to replicated objects with fast-POST and ssync
Related spec change-id: I60688efc3df692d3a39557114dca8c5490f7837e
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Closes-Bug: 1501528
Change-Id: I97552d194e5cc342b0a3f4b9800de8aa6b9cb85b
Not that the current implementation is broken, just wasteful.
When a client specifies a range for an SLO segment that includes the
entire referenced object, we should drop the 'range' key from the
manifest that's stored on disk.
Previously, we would do this if the uploaded manifest included the
object-length for validation, but not if it didn't. Now we will
always drop the 'range' key if the entire segment is being used.
Change-Id: I69d2fff8c7c59b81e9e4777bdbefcd3c274b59a9
Related-Change: Ia21d51c2cef4e2ee5162161dd2c1d3069009b52c
In case of a COPY request the swift_owner was already set to True, and the
following PUT request was granted access no matter if a service token was used
or not. This allowed to copy data to service accounts without any service
token.
Service token unit tests have been added to verify that when
swift_owner is set to True in a request environ, this setting is
ignored when authorizing another request based on the same
environ. Applying only this test change on master fails currently, and
only passes with the fix in this patch.
Tempauth seems to be not affected, however a small doc update has been added to
make it more clear that a service token is not needed to access a service account
when an ACL is used.
Further details with an example are available in the bug report
(https://bugs.launchpad.net/swift/+bug/1483007).
Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Co-Authored-By: Hisashi Osanai <osanai.hisashi@jp.fujitsu.com>
Co-Authored-By: Donagh McCabe <donagh.mccabe@hp.com>
Closes-Bug: 1483007
Change-Id: I1207b911f018b855362b1078f68c38615be74bbd
In EC PUT request case, proxy-server may send commits to object-servers
it may make .durable file even though the request failed due to a lack
of quorum number.
For example:
- Considering the case that almost all object-servers fail by 422
Unprocessable Entity
- Using ec scheme 4 + 2
- 5 (quorum size) object-server failed with 422, 1 object-servers
succeeded as 201 created
How it works:
- Client creates a PUT request
- Proxy will open connections to backend object-servers
- Proxy will send whole encoded chunks to object-servers
- Proxy will send content-md5 as footers.
- Proxy will get responses [422, 422, 422, 422, 422, 201] (currently
this list will be regarded as "we have quorum response")
- And then proxy will send commits to object-servers (the only
object-server with 201 will create .durable file)
- Proxy will return 503 because the commits results in no response
statuses from object-servers except the 201 node.
This patch fixes the quorum handling at ObjectController to check
that it has *successful* quorum responses before sending durable commits.
Closes-Bug: #1491748
Change-Id: Icc099993be76bcc687191f332db56d62856a500f
This patch fixes small nits for inline comments for
https://review.openstack.org/#/c/211338
as a follow-up patch, plus some other typos in comments.
Change-Id: Ibf7dc5683b39d6662573dbb036da146174a965fd
At PUT object request, proxy server makes backend headers (e.g.
X-Container-Partition) which help object-servers to determine
the container-server they should update. In addition, the backend
headers are created as many as the number of container replicas.
(i.e. 3 replica in container ring, 3 backend headers will be created)
On EC case, Swift fans out fragment archives to backend object-servers.
Basically the number of fragment archives will be more than the container
replica number and proxy-server assumes a request as success when quorum
number of object-server succeeded to store. That would cause to make an
orphaned object which is stored but not container updated.
For example, assuming k=10, m=4, container replica=3 case:
Assuming, proxy-server attempts to make 14 backend streams but
unfortunately first 3 nodes returns 507 (disk failure) and then
the Swift doesn't have any other disks.
In the case, proxy keeps 11 backend streams to store and current Swift
assumes it as sufficient because it is more than or equals quorum (right
now k+1 is sufficient i.e. 11 backend streams are enough to store)
However, in the case, the 11 streams doesn't have the container update
header so that the request will succeed but container will be never updated.
This patch allows to extract container updates up to object quorum_size
+ 1 to more nodes to ensure the updates. This approach sacrifices the
container update cost a bit because duplicated updates will be there but
quorum sizes + 1 seems reasonable (even if it's reaplicated case) to pay
to ensure that instead of whole objects incude the update headers.
Now Swift will work like as follows:
For example:
k=10, m=4, qurum_size=11 (k+1), 3 replica for container.
CU: container update
CA: commit ack
That result in like as
CU CU CU CU CU CU CU CU CU CU CU CU
[507, 507, 507, 201, 201, 201, 201, 201, 201, 201, 201, 201, 201, 201]
CA CA CA CA CA
In this case, at least 3 container updates are saved.
For another example:
7 replicated objects, qurum_size=4 (7//2+1), 3 replica for container.
CU: container update
CA: commit ack (201s for successful PUT on replicated)
CU CU CU CU CU
[507, 507, 507, 201, 201, 201, 201]
CA CA CA CA
In this replicated case, at least 2 container updates are saved.
Cleaned up some unit tests so that modifying policies doesn't leak
between tests.
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Sam Merritt <sam@swiftstack.com>
Closes-Bug: #1460920
Change-Id: I04132858f44b42ee7ecf3b7994cb22a19d001d70
There are a few places in the PUT path where the object server is
reading WSGI input and can find that there's nothing there. e.g. in the
middle of a 2 phase commit and the proxy goes away for whatever reason,
like maybe it timed out because things are really busy. Anyway, this
results in the ugly ValueError coming out of eventlet.wsgi about a
zillion levels away from the PUT path.
Expanding on the test cases from lp bug #1496205 and lp bug #1469094
this change carefully narrows into our read/readline calls to
wsgi_input and makes sure to tranlsate the ValueError to a
ChunkReadError - which the object.server can handle along with
ChunkReadTimeout. When it made sense, this change attempts to stay
consistent throughout the code path in logging/raising client disconnect
instead of timeout.
It's unfortunate the error coming out of eventlet is so generic, but
that will be improved in future versions [1].
1. c3ce3eef0b
Related-Bug: #1469094
Related-Bug: #1496205
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I9e4dbf26623c0c6fc5c87afd14349466aa157385
Users can now include an optional 'range' field in segment descriptions
to specify which bytes from the underlying object should be used for the
segment data. Only one range may be specified per segment. Note that the
'etag' and 'size_bytes' fields still describe the backing object as a
whole. So, if a user uploads a manifest like:
[{"path": "/con/obj_seg_1", "etag": null, "size_bytes": 1048576,
"range": "0-1023"},
{"path": "/con/obj_seg_2", "etag": null, "size_bytes": 1048576,
"range": "512-4095"},
{"path": "/con/obj_seg_1", "etag": null, "size_bytes": 1048576,
"range": "-2048"}]
then the segment will consist of the first 1024 bytes of /con/obj_seg_1,
followed by bytes 513 through 4096 (inclusive) of /con/obj_seg_2, and
finally bytes 1046528 through 1048576 (i.e., the last 2048 bytes) of
/con/obj_seg_1.
ETag generation for SLOs had been updated to prevent collisions when
using different ranges for the same set of objects.
Additionally, there are two performance enhancements:
* On download, multiple sequential requests for segments from the same
underlying object will be coalesced into a single ranged request,
provided it still does not meet Swift's "egregious range requests"
critieria.
* On upload, multiple sequential segments referencing the same object
will be validated against the response from a single HEAD request.
Change-Id: Ia21d51c2cef4e2ee5162161dd2c1d3069009b52c
DocImpact
The ECObjectController was unconditionally sending down the frag archive
commit document after the client source stream terminated - even if the
client disconnected early.
We can detect early disconnect in two ways:
1. Content-Length and not enough bytes_transfered
When eventlet.wsgi is reading from a Content-Length body the
readable returns the empty string and our iterable raises
StopIteration - but we can check content-length against
bytes_transfered and know if the client disconnected.
2. Transfer-Encoding: chunked - w/o a 0\r\n\r\n
When eventlet.wsgi is reading from a Transfer-Encoding: chunked
body the socket read returns the empty string, eventlet.wsgi's
chunked parser raises ValueError (which we translate to
ChunkReadError*) and we know we know the client disconnected.
... if we detect either of these conditions the proxy should:
1. *not* send down the commit document to object servers
2. disconnect from backend servers
3. log the client disconnect
Oddly the code on master was only messing up the first part. Backend
connections were terminated (gracefully after the commit document), and
then the disconnect was being logged as 499.
So now we only send down the commit document on a successful complete
client HTTP request (either whole Content-Length, or clean
Transfer-Encoding: chunked 0\r\n\r\n).
* To detect the early disconnect on Transfer-Encoding: chunked a new
swift.common.exceptions.ChunkReadError is used to translate
eventlet.wsgi's more general IOError and ValueErrors into something
more appropriate to catch and handle closer to our generic
ChunkReadTimeout handling.
Co-Author: Alistair Coles <alistair.coles@hp.com>
Closes-Bug: #1496205
Change-Id: I028a530aba82d50baa4ee1d05ddce18d4cce4e81