* Add a new config option, proxy_base_url
* Support HTTPS as well as HTTP connections
* Monkey-patch eventlet early so we never import an unpatched version
from swiftclient
Change-Id: I4945d512966d3666f2738058f15a916c65ad4a6b
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.
Fix problems found.
Remove hacking and friends from lower-constraints, they are not needed
for installation.
Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
This patch adds a new object versioning mode. This new mode provides
a new set of APIs for users to interact with older versions of an
object. It also changes the naming scheme of older versions and adds
a version-id to each object.
This new mode is not backwards compatible or interchangeable with the
other two modes (i.e., stack and history), especially due to the changes
in the namimg scheme of older versions. This new mode will also serve
as a foundation for adding S3 versioning compatibility in the s3api
middleware.
Note that this does not (yet) support using a versioned container as
a source in container-sync. Container sync should be enhanced to sync
previous versions of objects.
Change-Id: Ic7d39ba425ca324eeb4543a2ce8d03428e2225a1
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
This allows static symlinks to be synced before their target. Dynamic
symlinks could already be synced even if target object has not been
synced, but static links previously required that target object existed
before it can be PUT. Now, have container_sync middleware plumb in an
override like it does for SLO.
Change-Id: I3bfc62b77b247003adcee6bd4d374168bfd6707d
This works fine; we continue processing the other rows in the DB. But it
*does* take longer than it really ought to require. See the related bug;
we ought to be able to shave some 17s off the test time by not retrying
on the 404.
Change-Id: I9ca2511651e9b2bc0045894baa4062d20bc15369
Related-Bug: #1849841
If we move it to constraints it's more globally accessible in our code,
but more importantly it's more obvious to ops that everything breaks if
you try to mis-configure different values per-service.
Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee
Previously, if you were on Python 2.7.10+ [0], such a newline would cause the
sharder to fail, complaining about invalid header values when trying to create
the shard containers. On older versions of Python, it would most likely cause a
parsing error in the container-server that was trying to handle the PUT.
Now, quote all places that we pass around container paths. This includes:
* The X-Container-Sysmeta-Shard-(Quoted-)Root sent when creating the (empty)
remote shards
* The X-Container-Sysmeta-Shard-(Quoted-)Root included when initializing the
local handoff for cleaving
* The X-Backend-(Quoted-)Container-Path the proxy sends to the object-server
for container updates
* The Location header the container-server sends to the object-updater
Note that a new header was required in requests so that servers would
know whether the value should be unquoted or not. We can get away with
reusing Location in responses by having clients opt-in to quoting with
a new X-Backend-Accept-Quoted-Location header.
During a rolling upgrade,
* old object-servers servicing requests from new proxy-servers will
not know about the container path override and so will try to update
the root container,
* in general, object updates are more likely to land in the root
container; the sharder will deal with them as misplaced objects, and
* shard containers created by new code on servers running old code
will think they are root containers until the server is running new
code, too; during this time they'll fail the sharder audit and report
stats to their account, but both of these should get cleared up upon
upgrade.
Drive-by: fix a "conainer_name" typo that prevented us from testing that
we can shard a container with unicode in its name. Also, add more UTF8
probe tests.
[0] See https://bugs.python.org/issue22928
Change-Id: Ie08f36e31a448a547468dd85911c3a3bc30e89f1
Closes-Bug: 1856894
Starting with Python 3.4, newly-created file descriptors are non-inheritable
[0], which causes trouble when we try to use a pipe for IPC. Fortunately, the
same PEP that implemented this change *also* provided a new API to mark file
descriptors as being inheritable -- so just do that.
While we're at it,
* Fix up the probe tests to work on py3
* Fix up the probe tests to work when policy-0 is erasure-coded
* Decode the bytes read so py3 doesn't log a b'pid'
* Log a warning if the read() is empty; something surely went wrong
in the re-exec
[0] https://www.python.org/dev/peps/pep-0446/
Change-Id: I2a8a9f3dc78abb99bf9cbcf6b44c32ca644bb07b
Related-Change: I3e5229d2fb04be67e53533ff65b0870038accbb7
Reserve the namespace starting with the NULL byte for internal
use-cases. Backend services will allow path names to include the NULL
byte in urls and validate names in the reserved namespace. Database
services will filter all names starting with the NULL byte from
responses unless the request includes the header:
X-Backend-Allow-Reserved-Names: true
The proxy server will not allow path names to include the NULL byte in
urls unless a middlware has set the X-Backend-Allow-Reserved-Names
header. Middlewares can use the reserved namespace to create objects
and containers that can not be directly manipulated by clients. Any
objects and bytes created in the reserved namespace will be aggregated
to the user's account totals.
When deploying internal proxys developers and operators may configure
the gatekeeper middleware to translate the X-Allow-Reserved-Names header
to the Backend header so they can manipulate the reserved namespace
directly through the normal API.
UpgradeImpact: it's not safe to rollback from this change
Change-Id: If912f71d8b0d03369680374e8233da85d8d38f85
Swift servers can now be seamlessly reloaded by sending them a SIGUSR1
(instead of a SIGHUP). The server forks off a synchronized child to
wait to close the old listen socket(s) until the new server has started
up and bound its listen socket(s). The new server is exec'ed from the
old one so its PID doesn't change. This makes Systemd happier, so a
ReloadExec= stanza can now be used.
The seamless part means that incoming connections will alwyas get
accepted either by the old server or the new one. This eliminates
client-perceived "downtime" during server reloads, while allowing the
server to fully reload, re-reading configuration, becoming a fresh
Python interpreter instance, etc. The SO_REUSEPORT socket option has
already been getting used, so nothing had to change there.
This patch also includes a non-invasive fix for a current eventlet bug;
see https://github.com/eventlet/eventlet/pull/590
That bug prevents a SIGHUP "reload" from properly servicing existing
requests before old worker processes close sockets and exit. The
existing probtests missed this, but the new ones, in this patch, caught
it.
New probe tests cover both old SIGHUP "reload" behavior as well as the
new SIGUSR1 seamless reload behavior.
Change-Id: I3e5229d2fb04be67e53533ff65b0870038accbb7
These are known to not work until https://bugs.python.org/issue37093
is addressed in CPython upstream.
Change-Id: I4a6877907d14b632a9a477c887913488427b62b7
AWS seems to support this, so let's allow s3api to do it, too.
Previously, S3 clients trying to use multi-character delimiters would
get 500s back, because s3api didn't know how to handle the 412s that the
container server would send.
As long as we're adding support for container listings, may as well do
it for accounts, too.
Change-Id: I62032ddd50a3493b8b99a40fb48d840ac763d0e7
Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
Closes-Bug: #1797305
There's still one problem, though: since swiftclient on py3 doesn't
support non-ASCII characters in metadata names, none of the tests in
TestReconstructorRebuildUTF8 will pass.
Change-Id: I4ec879ade534e09c3a625414d8aa1f16fd600fa4
We previously realized we needed to do that for accounts and containers
where the consequences of treating the 404 as authoritative were more
obvious: we'd cache the non-existence which prevented writes until it
fell out of cache.
The same basic logic applies for objects, though: if we see
(Timeout, Timeout, Timeout, 404, 404, 404)
on a triple-replica policy, we don't really have any reason to think
that a 404 is appropriate. In fact, it seems reasonably likely that
there's a thundering-herd problem where there are too many concurrent
requests for data that *definitely is there*. By responding with a 503,
we apply some back-pressure to clients, who hopefully have some
exponential backoff in their retries.
The situation gets a bit more complicated with erasure-coded data, but
the same basic principle applies. We're just more likely to have
confirmation that there *is* data out there, we just can't reconstruct
it (right now).
Note that we *still want to check* those handoffs, of course. Our
fail-in-place strategy has us replicate (and, more recently,
reconstruct) to handoffs to maintain durability; it'd be silly *not* to
look.
UpgradeImpact:
--------------
Be aware that this may cause an increase in 503 Service Unavailable
responses served by proxy-servers. However, this should more accurately
reflect the state of the system.
Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
Change-Id: Ia832e9bab13167948f01bc50aa8a61974ce189fb
Closes-Bug: #1837819
Related-Bug: #1833612
Related-Change: I53ed04b5de20c261ddd79c98c629580472e09961
Related-Change: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
Previously, we issued a GET to the root container for every object PUT,
POST, and DELETE. This puts load on the container server, potentially
leading to timeouts, error limiting, and erroneous 404s (!).
Now, cache the complete set of 'updating' shards, and find the shard for
this particular update in the proxy. Add a new config option,
recheck_updating_shard_ranges, to control the cache time; it defaults to
one hour. Set to 0 to fall back to previous behavior.
Note that we should be able to tolerate stale shard data just fine; we
already have to worry about async pendings that got written down with
one shard but may not get processed until that shard has itself sharded
or shrunk into another shard.
Also note that memcache has a default value limit of 1MiB, which may be
exceeded if a container has thousands of shards. In that case, set()
will act like a delete(), causing increased memcache churn but otherwise
preserving existing behavior. In the future, we may want to add support
for gzipping the cached shard ranges as they should compress well.
Change-Id: Ic7a732146ea19a47669114ad5dbee0bacbe66919
Closes-Bug: 1781291
Give storage nodes more time to complete requests for multi-node upgrade
and probetests.
Also slightly decouple probetests from default configs.
Change-Id: I334ef517d833916a3b7be3151a812d4f9c66a6e1
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.
Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring. If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.
To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve. In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.
Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced. After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.
Closes-Bug: #1510342
Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
In test_replication_servers_working, we delete a bunch of directories
without deleting hashes.pkl, then verify that nothing at that level is a
directory.
This would be trivially true except that throughout the test, we have
the replicators running constantly. However, we never verified that the
replicators actually *have* run and had a chance to re-create the
missing directories.
Now, stop the replicators before doing the deletes, run them
synchronously between doing the deletes and verifying that there are no
directories, and start them again before the final set of assertions.
Change-Id: I841f8250eb7abfb0fcdfca5c106f65e6e94dce0c
When we abort the replication process because we've got shard ranges and
the sharder is now responsible for ensuring object-row durability, we
log a warning like "refusing to replicate objects" which sounds scary.
That's because it *is*, of course -- if the sharder isn't running,
whatever rows that DB has may only exist in that DB, meaning we're one
drive failure away from losing track of them entirely.
However, when the sharder *is* running and everything's happy, we reach
a steady-state where the root containers are all sharded and none of
them have any object rows to lose. At that point, the warning does more
harm than good.
Only print the scary "refusing to replicate" warning if we're still
responsible for some object rows, whether deleted or not.
Change-Id: I35de08d6c1617b2e446e969a54b79b42e8cfafef
Note that eventlet 0.22.0+ closes connections between requests when
it stops accepting connections.
Partial-Bug: #1792615
Change-Id: Ia8d9ab95e2aad40e8d797acc3423a917e809ffdb
Otherwise, a sharded container AUTH_test/sharded will have its stats
included in the totals for both AUTH_test *and* .shards_AUTH_test
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I7fa74e13347601c5f44fd7e6cf65656cc3ebc2c5
Resolve outstanding TODO's. One TODO is removed because there isn't an
easy way to arrange for an async pending to be targeted at a shard
container.
Change-Id: I0b003904f73461ddb995b2e6a01e92f14283278d
Previously test_misplaced_object_movement() deleted objects from both
shards and then relied on override-partitions option to selectively
run the sharder on root or shard containers and thereby control when
each shard range was identified for shrinking. This approach is flawed
when the second shard container lands in the same partition as the
root: running the sharder on the empty second shard's partition would
also cause the sharder to process the root and identify the second
shard for shrinking, resulting in premature shrinking of the second
shard.
Now, objects are only deleted from each shard range as that shard is
wanted to shrink.
Change-Id: I9f51621e8414e446e4d3f3b5027f6c40e01192c3
Drive-by: use the run_sharders() helper more often.
Before, merge_objects() always used storage policy index of 0 when
inserting a fake misplaced object into a shard container. If the shard
broker had a different policy index then the misplaced object would
not show in listings causing test_misplaced_object_movement() to
fail. This test bug might be exposed by having policy index 0 be an EC
policy, since the probe test requires a replication policy and would
therefore choose a non-zero policy index.
The fix is simply to specify the shard's policy index when inserting
the fake object.
Change-Id: Iec3f8ec29950220bb1b2ead9abfdfb1a261517d6
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.
The workflow is, in overview:
- perform an audit of the container for sharding purposes.
- move any misplaced objects that do not belong in the container
to their correct shard.
- move shard ranges from FOUND state to CREATED state by creating
shard containers.
- move shard ranges from CREATED to CLEAVED state by cleaving objects
to shard dbs and replicating those dbs. By default this is done in
batches of 2 shard ranges per visit.
Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.
The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
...in preparation for the container sharding feature.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I4455677abb114a645cff93cd41b394d227e805de
We've already got it in the response, may as well apply it now rather
than wait for the other end to get around to running its replicators.
Change-Id: Ie36a6dd075beda04b9726dfa2bba9ffed025c9ef
When PUTting an object with `If-None-Match: *`, we rely 100-continue
support: the proxy checks the responses from all object-servers, and if
any of them respond 412, it closes down the connections. When there's
actual data for the object, this ensures that even nodes that *don't*
respond 412 will hit a ChunkReadTimeout and abort the PUT.
However, if the client does a PUT with a Content-Length of 0, that would
get sent all the way to the object server, which had all the information
it needed to respond 201. After replication, the PUT propagates to the
other nodes and the old object is lost, despite the client receiving a
412 indicating the operation failed.
Now, when PUTting a zero-byte object, switch to a chunked transfer so
the object-server still gets a ChunkReadTimeout.
Change-Id: Ie88e41aca2d59246c3134d743c1531c8e996f9e4
After deleting an object, the object expirer deletes the corresponding
row from the expirer queue by making DELETE requests directly to the
container servers. The same thing happens after attempting to delete
an object, but failing because the object has already been deleted. If
the DELETE requests fail, then the expirer will encounter that row
again on its next pass and retry the DELETE at that time. Therefore,
it is not necessary for the object server to write an async_pending
for that queue row's deletion.
Currently, however, two of the object servers do write such
async_pendings. Given Rc container replicas, that's 2 * Rc updates
from async_pendings and another Rc from the object expirer
directly. Given a typical Rc of 3, that's 9 container updates per
expiring object.
This commit makes the object server write no async_pendings for DELETE
requests coming from the object expirer. This reduces the number of
container server requests to Rc (typically 3), all issued directly
from the object expirer.
Closes-Bug: 1076202
Change-Id: Icd63c80c73f864d2561e745c3154fbfda02bd0cc