There was a function in swift.common.utils that was importing
swob.HeaderKeyDict at call time. It couldn't import it at compilation
time since utils can't import from swob or else it blows up with a
circular import error.
This commit just moves HeaderKeyDict into swift.common.header_key_dict
so that we can remove the inline import.
Change-Id: I656fde8cc2e125327c26c589cf1045cb81ffc7e5
Proxy-server now requires Content-Length in the response header
when getting object and does not support chunked transferring with
"Transfer-Encoding: chunked"
This doesn't matter in normal swift, but prohibits us from putting
any middelwares to execute something like streaming processing of
objects, which can't calculate the length of their response body
before they start to send their response.
Change-Id: I60fc6c86338d734e39b7e5f1e48a2647995045ef
This commit ensures that the logger thread_locals
value is passed to and set in _get_conn_response methods
executed in a green thread.
Added partial bug tag because in bug description a more
relevant fix is suggested which would fix the bug completely
but for now this makes sense to add this commit for logging.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I13bbf174fdca89318d69bb0674ed23dc9ec25b9a
Partial-Bug: #1409302
Since commit 4f2ed8bcd0468f3b69d5fded274d8d6b02ac3d10, the response
header for GET EC object doesn't include the Accept-Ranges header.
This patch fixes it and also adds a few unittests to prevent regression.
Closes-Bug: #1542168
Change-Id: Ibafe56ac87b14bc0028953e620a653cd68dd3f84
Follow up for https://review.openstack.org/#/c/236007
This fixes following minor items:
- Fix a 'raise Exception class' syntax to 'raise Exception instance'
- Use original eventlet.Timeout instead of swift.exceptions.Timeout
imported from eventlet.Timeout
- Change Timeout to initiate w/o args (1st arguments should be timeout
second and we don't have to set None if we don't want to set the sec)
- Add a message argument to some Exception instances
Change-Id: Iab608cd8a1f4d3f5b4963c26b94ab0501837ffe1
In the _transfer_data method, we translate all (Exception, Timeout)
into a 499 whereas we should consider translating them to 500 on
particular returning error scenarios.
This affects both ReplicatedObjectController and ECObjectControllear.
Change-Id: I571bbc5b1451243907b094a5718c8735fd824268
Closes-Bug: 1504299
assertEquals is deprecated in py3, replacing it.
Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
Co-Authored-By: Ondřej Nový <ondrej.novy@firma.seznam.cz>
The next() method of Python 2 generators was renamed to __next__().
Call the builtin next() function instead which works on Python 2 and
Python 3.
The patch was generated by the next operation of the sixer tool.
Change-Id: Id12bc16cba7d9b8a283af0d392188a185abe439d
... which ensures no Timeouts remain pending after the parent generator
is closed when a client disconnects before being able to read the entire
body.
Also tighten up a few tests that may have left some open ECAppIter
generators lying about after the tests themselves had finished. This
has the side effect of preventing the extraneous printing of the Timeout
errors being raised by the eventlet hub in the background while our
unittests are running.
Change-Id: I156d873c72c19623bcfbf39bf120c98800b3cada
Increase the number of nodes from which we require a final successful
HTTP responses before we return success to the client on a write - to
the same number of nodes we'll require successful responses from to
service a client request for a read.
Change-Id: Ifd36790faa0a5d00ec79c23d1f96a332a0ca0f0b
Related-Bug: #1469094
At PUT object request, proxy server makes backend headers (e.g.
X-Container-Partition) which help object-servers to determine
the container-server they should update. In addition, the backend
headers are created as many as the number of container replicas.
(i.e. 3 replica in container ring, 3 backend headers will be created)
On EC case, Swift fans out fragment archives to backend object-servers.
Basically the number of fragment archives will be more than the container
replica number and proxy-server assumes a request as success when quorum
number of object-server succeeded to store. That would cause to make an
orphaned object which is stored but not container updated.
For example, assuming k=10, m=4, container replica=3 case:
Assuming, proxy-server attempts to make 14 backend streams but
unfortunately first 3 nodes returns 507 (disk failure) and then
the Swift doesn't have any other disks.
In the case, proxy keeps 11 backend streams to store and current Swift
assumes it as sufficient because it is more than or equals quorum (right
now k+1 is sufficient i.e. 11 backend streams are enough to store)
However, in the case, the 11 streams doesn't have the container update
header so that the request will succeed but container will be never updated.
This patch allows to extract container updates up to object quorum_size
+ 1 to more nodes to ensure the updates. This approach sacrifices the
container update cost a bit because duplicated updates will be there but
quorum sizes + 1 seems reasonable (even if it's reaplicated case) to pay
to ensure that instead of whole objects incude the update headers.
Now Swift will work like as follows:
For example:
k=10, m=4, qurum_size=11 (k+1), 3 replica for container.
CU: container update
CA: commit ack
That result in like as
CU CU CU CU CU CU CU CU CU CU CU CU
[507, 507, 507, 201, 201, 201, 201, 201, 201, 201, 201, 201, 201, 201]
CA CA CA CA CA
In this case, at least 3 container updates are saved.
For another example:
7 replicated objects, qurum_size=4 (7//2+1), 3 replica for container.
CU: container update
CA: commit ack (201s for successful PUT on replicated)
CU CU CU CU CU
[507, 507, 507, 201, 201, 201, 201]
CA CA CA CA
In this replicated case, at least 2 container updates are saved.
Cleaned up some unit tests so that modifying policies doesn't leak
between tests.
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Sam Merritt <sam@swiftstack.com>
Closes-Bug: #1460920
Change-Id: I04132858f44b42ee7ecf3b7994cb22a19d001d70
And if they are not, exhaust the node iter to go get more. The
problem without this implementation is a simple overwrite where
a GET follows before the handoff has put the newer obj back on
the 'alive again' node such that the proxy gets n-1 fragments
of the newest set and 1 of the older.
This patch bucketizes the fragments by etag and if it doesn't
have enough continues to exhaust the node iterator until it
has a large enough matching set.
Change-Id: Ib710a133ce1be278365067fd0d6610d80f1f7372
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Closes-Bug: 1457691
Increase the .durable quorum from 2 to "parity + 1" to guarantee
that we will never fail to rebuild an object. Otherwise, with
low durable responses back (< parity + 1), the putter objects
return with failed attribute set to true, thereby failing the
rebuild of fragments for an object.
Change-Id: I80d666f61273e589d0990baa78fd657b3470785d
Closes-Bug: 1484565
When range GET (or COPY) for an EC object requested, if the requested range
starts from more than last segments alignment (i.e.
ceil(object_size/segment_size) * segment_size), proxy server will return
the original content length w/o body, though Swift should return an error
massage as a body and the length of message as the content length.
The current behavior will cause stuck on some client. (e.g. curl)
This patch fixes that proxy enables to return correct response, even if such
an over range requested.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I21f81c842f563ac4dddc69011ed759b744bb20bd
Closes-Bug: #1475499
There was some debug logging mixed in with some error handling on PUTs
that was relying on a very specific edge would only encounter a set of
backend responses that included the expected set of headers to diagnoise
the failure.
But the backend responses may not always have the expected headers.
The proxy debug logging should be more robust to missing headers.
It's a little hard to follow, but if you look `_connect_put_node` in
swift.proxy.controller.obj - you'll see that only a few connections can
make their way out of the initial put connection handling with a "resp"
attribute that is not None. In the happy path (e.g. 100-Continue) it's
explictly set to None, and in most errors (Timeout, 503, 413, etc) a new
connection will be established to the next node in the node iter.
Some status code will however allow a conn to be returned for validation
in `_check_failure_put_connections`, i.e.
* 2XX (e.g. 0-byte PUT would not send Expect 100-Continue)
* 409 - Conflict with another timestamp
* 412 - If-None-Match that encounters another object
... so I added tests for those - fixing a TypeError along the way.
Change-Id: Ibdad5a90fa14ce62d081e6aaf40aacfca31b94d2
Currently, a COPY request for an EC object might go to fail as 499 Client
disconnected because of the difference between destination request content
length and actual transferred bytes.
That is because the conditional response status and content length for
an EC object range GET is handled at calling the response instance on
proxy server. Therefore the calling response instance (resp()) will change
the conditional status from 200 (HTTP_OK) to 206 (PartialContent) and will
change the content length for the range GET.
In EC case, sometimes Swift needs whole stored contents to decode a segment.
It will make 200 HTTP OK response from object-server and proxy-server
will unfortunately set whole content length to the destination content
length and it makes the bug 1467677.
This patch introduces a new method "fix_conditional_response" for
swift.common.swob.Response that calling _response_iter() and cached the
iter in the Response instance. By calling it, Swift can set correct condtional
response any time after setting whole content_length to the response
instance like EC case.
Change-Id: If85826243f955d2f03c6ad395215c73daab509b1
Closes-Bug: #1467677
py2 zip() is eager but py3 zip() and six.moves.zip() are lazy,
changing ones that require eager evaluation.
Change-Id: Ic9f6bccd7f57772158581905794f8d23b05f4223
The Python 2 next() method of iterators was renamed to __next__() on
Python 3. Use the builtin next() function instead which works on Python
2 and Python 3.
Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d
This commit lets clients receive multipart/byteranges responses (see
RFC 7233, Appendix A) for erasure-coded objects. Clients can already
do this for replicated objects, so this brings EC closer to feature
parity (ha!).
GetOrHeadHandler got a base class extracted from it that treats an
HTTP response as a sequence of byte-range responses. This way, it can
continue to yield whole fragments, not just N-byte pieces of the raw
HTTP response, since an N-byte piece of a multipart/byteranges
response is pretty much useless.
There are a couple of bonus fixes in here, too. For starters, download
resuming now works on multipart/byteranges responses. Before, it only
worked on 200 responses or 206 responses for a single byte
range. Also, BufferedHTTPResponse grew a readline() method.
Also, the MIME response for replicated objects got tightened up a
little. Before, it had some leading and trailing CRLFs which, while
allowed by RFC 7233, provide no benefit. Now, both replicated and EC
multipart/byteranges avoid extraneous bytes. This let me re-use the
Content-Length calculation in swob instead of having to either hack
around it or add extraneous whitespace to match.
Change-Id: I16fc65e0ec4e356706d327bdb02a3741e36330a0
On EC PUT in an M+K scheme, we require M+1 fragment archives to
durably land on disk. If we get that, then we go ahead and ask the
object servers to "commit" the object by writing out .durable
files. We only require 2 of those.
When we got exactly M+1 fragment archives on disk, and then one
connection timed out while writing .durable files, we should still be
okay (provided M is at least 3). However, we'd take our M > 2
remaining successful responses and pass that off to best_response()
with a quorum size of M+1, thus getting a 503 even though everything
worked well enough.
Now we pass 2 to best_response() to avoid that false negative.
There was also a spot where we were getting the quorum size wrong. If
we wrote out 3 fragment archives for a 2+1 policy, we were only
requiring 2 successful backend PUTs. That's wrong; the right number is
3, which is what the policy's .quorum() method says. There was a spot
where the right number wasn't getting plumbed through, but it is now.
Change-Id: Ic658a199e952558db329268f4d7b4009f47c6d03
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Closes-Bug: 1452468
Current post as copy routine (i.e. POST object with post_as_copy option
turned on) on Object Controller uses "multipart-manifest" query string
which is feeded to env['copy_hook'] to decide which data (the manifest or
object pointed by the manifest) should be copied.
However, the way using the query string will confuse operators looking at
logging system (or analyzing the log) because whole POST object requests
have 'multipart-manifest=get' like as:
POST /v1/AUTH_test/d4c816b24d38489082f5118599a67920/manifest-abcde%3Fmultipart-manifest%3Dget
We cannot know whether the query string was added by hand
(from user) or not. In addition, the query isn't needed by the
backend conversation between proxy-server and object-server.
(Just needed by "copy_hook" on the proxy controller!)
To remove the confusable query string and to keep the log to be clean,
this patch introduces new environment variable "swift.post_as_copy"
and changes proxy controller and the copy_hook to use the new env.
This item was originally discussed at
https://review.openstack.org/#/c/177132/
Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Change-Id: I0cd37520eea1825a10ebd27ccdc7e9162647233e
This commit makes it possible to PUT an object into Swift and have it
stored using erasure coding instead of replication, and also to GET
the object back from Swift at a later time.
This works by splitting the incoming object into a number of segments,
erasure-coding each segment in turn to get fragments, then
concatenating the fragments into fragment archives. Segments are 1 MiB
in size, except the last, which is between 1 B and 1 MiB.
+====================================================================+
| object data |
+====================================================================+
|
+------------------------+----------------------+
| | |
v v v
+===================+ +===================+ +==============+
| segment 1 | | segment 2 | ... | segment N |
+===================+ +===================+ +==============+
| |
| |
v v
/=========\ /=========\
| pyeclib | | pyeclib | ...
\=========/ \=========/
| |
| |
+--> fragment A-1 +--> fragment A-2
| |
| |
| |
| |
| |
+--> fragment B-1 +--> fragment B-2
| |
| |
... ...
Then, object server A gets the concatenation of fragment A-1, A-2,
..., A-N, so its .data file looks like this (called a "fragment archive"):
+=====================================================================+
| fragment A-1 | fragment A-2 | ... | fragment A-N |
+=====================================================================+
Since this means that the object server never sees the object data as
the client sent it, we have to do a few things to ensure data
integrity.
First, the proxy has to check the Etag if the client provided it; the
object server can't do it since the object server doesn't see the raw
data.
Second, if the client does not provide an Etag, the proxy computes it
and uses the MIME-PUT mechanism to provide it to the object servers
after the object body. Otherwise, the object would not have an Etag at
all.
Third, the proxy computes the MD5 of each fragment archive and sends
it to the object server using the MIME-PUT mechanism. With replicated
objects, the proxy checks that the Etags from all the object servers
match, and if they don't, returns a 500 to the client. This mitigates
the risk of data corruption in one of the proxy --> object connections,
and signals to the client when it happens. With EC objects, we can't
use that same mechanism, so we must send the checksum with each
fragment archive to get comparable protection.
On the GET path, the inverse happens: the proxy connects to a bunch of
object servers (M of them, for an M+K scheme), reads one fragment at a
time from each fragment archive, decodes those fragments into a
segment, and serves the segment to the client.
When an object server dies partway through a GET response, any
partially-fetched fragment is discarded, the resumption point is wound
back to the nearest fragment boundary, and the GET is retried with the
next object server.
GET requests for a single byterange work; GET requests for multiple
byteranges do not.
There are a number of things _not_ included in this commit. Some of
them are listed here:
* multi-range GET
* deferred cleanup of old .data files
* durability (daemon to reconstruct missing archives)
Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com>
Co-Authored-By: Paul Luse <paul.e.luse@intel.com>
Co-Authored-By: Christian Schwede <christian.schwede@enovance.com>
Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com>
Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
There is a standard LBYL race that can better be addressed by making the
EAFP case safer.
Capture 409 Conflict during expect on PUT
Similarly to how the proxy handles 412 on PUT, we will gather 409
responses in the proxy during _connect_put_node. Rather than skipping
backend servers that already have a synced copy of an object we will
accept their response and return 202 immediately.
This is particularly useful to internal clients who are using
X-Timestamp to sync transfers (e.g. container-sync and
container-reconciler).
No observable change in client facing behavior except that swift is
faster to respond Accepted when it already has the data the client is
purposing to send.
Change-Id: Ie400d5bfd9ab28b290abce2e790889d78726095f
All GET or HEAD requests consistently error limit nodes that return 507
and increment errors for nodes responding with any other 5XX.
There were two places in the object PUT path where the proxy was error
limiting nodes and their behavior was inconsistent. During expect-100
connect we would only error_limit nodes on 507, and during response we
would increment errors for all 5XX series responses. This was pretty
hard to reason about and the divergence in behavior of questionable
value.
An audit of base controller highlighted where make_requests would apply
error_limit's on 507 but not increment errors on other 5XX responses.
Now anywhere we track errors on nodes we use error_limit on 507 and
error_occurred on any other 5XX series request. Additionally a Timeout
or Exception that is logged through exception_occurred will bump errors -
which is consistent with the approach in "Add Error Limiting to slow
nodes" [1].
1. https://review.openstack.org/#/c/112424/
Change-Id: I67e489d18afd6bdfc730bfdba76f85a2e3ca74f0
When a X-Backend-Timestamp is available it would generally preferred
over a less specific value and sorts correctly against any X-Timestamp
values anyway.
Change-Id: I08b7eb37ab8bd6eb3afbb7dee44ed07a8c69b57e
This change adds an optional overrides map to _make_request method
in the base Controller class.
def make_requests(self, req, ring, part, method, path, headers,
query_string='', overrides=None)
Which will be passed on the the best_response method. If set and
no quorum it reached, the override map is used to attempt to find
quorum.
The overrides map is in the form:
{ <response>: <override response>, .. }
The ObjectController, in the DELETE method now passes an override map
to make_requests method in the base Controller class in the form of:
{ 404: 204 }
Statuses/responses that have been overridden are used in calculation
of the quorum but never returned to the user. They are replaced by:
(STATUS, '', '', '')
And left out of the search for best response.
Change-Id: Ibf969eac3a09d67668d5275e808ed626152dd7eb
Closes-Bug: 1318375
This adds a sanity check on x-delete headers as
part of check_object_creation method
Change-Id: If5069469e433189235b1178ea203b5c8a926f553
Signed-off-by: Thiago da Silva <thiago@redhat.com>
If upgrading from a non-storage policy enabled version of
swift to a storage policy enabled version its possible that
memcached will have an info structure that does not contain
the 'storage_policy" key resulting in an unhandled exception
during the lookup. The fix is to simply make sure we never
return the dict without a storage_policy key defined; if it
doesn't exist its safe to make it '0' as this means you're
in the update scenario and there's xno other possibility.
Change-Id: If8e8f66d32819c5bfb2d1308e14643f3600ea6e9
In the proxy, container_info can return a 'storage_policy' of None. When
you set a header value on a swob.Request to None that effectively just
delete's the key. One path through the proxy during container sync was
counting on the the 'X-Backend-Storage-Policy-Index' being set which isn't
the case if the cached container_info if for a pre-policies container.
Also clean up some test cruft, tighten up the interface on FakeConn, and add
some object controller tests to exercise more interesting failure and handoff
code paths.
Change-Id: Ic379fa62634c226cc8a5a4c049b154dad70696b3
Objects now have a storage policy index associated with them as well;
this is determined by their filesystem path. Like before, objects in
policy 0 are in /srv/node/$disk/objects; this provides compatibility
on upgrade. (Recall that policy 0 is given to all existing data when a
cluster is upgraded.) Objects in policy 1 are in
/srv/node/$disk/objects-1, objects in policy 2 are in
/srv/node/$disk/objects-2, and so on.
* 'quarantined' dir already created 'objects' subdir so now there
will also be objects-N created at the same level
This commit does not address replicators, auditors, or updaters except
where method signatures changed. They'll still work if your cluster
has only one storage policy, though.
DocImpact
Implements: blueprint storage-policies
Change-Id: I459f3ed97df516cb0c9294477c28729c30f48e09
FakeLogger gets better log level handling
Parameterize logger on some daemons which were previously
unparameterized and try and use the interface in tests.
FakeRing use more real code
The existing FakeRing mock's implementation bit me on some pretty subtle
character encoding issue by-passing the hash_path code that is normally
part of get_part_nodes. This change tries to exercise more of the real
ring code paths when it makes sense and provide a better Fake for use in
testing.
Add write_fake_ring helper to test.unit for when you need a real ring.
DocImpact
Implements: blueprint storage-policies
Change-Id: Id2e3740b1dd569050f4e083617e7dd6a4249027e
It seems that the test_connect_put_timeout() test does not always fail
when it is expected. Sometimes, not very often, the attempt to connect
succeeds, resulting in a failed test.
This might be because the fake-connection infrastructure uses a
sleep(0.1) and the test uses a connect timeout of 0.1. There might be a
case where the two values result in the exact time where the entries
happen to be added in the wrong order such that the sleep() completes
first before the connect timeout fires, where the connect completes
successfully.
Closes bug 1302781
Change-Id: Ie23e40cf294170eccdf0713e313f9a31a92f9071
A common pattern that we see clients do is send a HEAD request before a
PUT to see if it exists. This can slow things down quite a bit
especially since 404s on HEAD are currently a bit expensive.
This change will allow a client to include a "If-None-Match: *" header
with a PUT request. In combination with "Expect: 100-Continue" this
allows the server to return that it already has a copy of the object
before any data is sent.
I attempted to also include etag support with the If-None-Match header,
but that turned up having too many hairy edge cases, so was left as a
future excercise.
DocImpact
Change-Id: I94e3754923dbe5faba065719c7a9afa9969652dd
The copy source must be container/object.
This patch avoids the server to return
an internal server error when user provides
a path without a container.
Fixes: bug #1255049
Change-Id: I1a85c98d9b3a78bad40b8ceba9088cf323042412
On GET, the proxy will go search the primary nodes plus some number of
handoffs for the account/container/object before giving up and
returning a 404. That number is, by default, twice the ring's replica
count. This was fine if your ring had an integral number of replicas,
but could lead to some slightly-odd behavior if you have fractional
replicas.
For example, imagine that you have 3.49 replicas in your object ring;
perhaps you're migrating a cluster from 3 replicas to 4, and you're
being smart and doing it a bit at a time.
On an object GET that all the primary nodes 404ed, the proxy would
then compute 2 * 3.49 = 6.98, round it up to 7, and go look at 7
handoff nodes. This is sort of weird; the intent was to look at 6
handoffs for objects with 3 replicas, and 8 handoffs for objects with
4, but the effect is 7 for everybody.
You also get little latency cliffs as you scale up replica counts. If,
instead of 3.49, you had 3.51 replicas, then the proxy would look at 8
handoff nodes in every case [ceil(2 * 3.51) = 8], so there'd be a
small-but-noticeable jump in the time it takes to produce a 404.
The fix is to compute the number of handoffs based on the number of
primary nodes for the partition, not the ring's replica count. This
gets rid of the little latency cliffs and makes the behavior more like
what you get with integral replica counts.
If your ring has an integral number of replicas, there's no behavior
change here.
Change-Id: I50538941e571135299fd6b86ecd9dc780cf649f5