255 Commits

Author SHA1 Message Date
Tim Burke
99947150dd func tests: work with etag-quoter on by default
Also, run the in-process encryption func tests like that.

Change-Id: I984ab8d1304d23b89589973950b10dda4aea0db3
2020-06-01 18:38:23 -05:00
Zuul
948289151b Merge "probe tests: Work when fronted by a TLS terminator" 2020-05-05 00:45:51 +00:00
Tim Burke
630c9ef809 probe tests: Work when fronted by a TLS terminator
* Add a new config option, proxy_base_url
* Support HTTPS as well as HTTP connections
* Monkey-patch eventlet early so we never import an unpatched version
  from swiftclient

Change-Id: I4945d512966d3666f2738058f15a916c65ad4a6b
2020-05-04 10:54:01 -07:00
Tim Burke
d0f0d1d4f3 sharding: Add probe test that exercises swift-manage-shard-ranges
Change-Id: Ic7c40589679c290e5565f9581f70b9a1c070f6ab
2020-04-20 18:46:31 -07:00
Tim Burke
668242c422 pep8: Turn on E305
Change-Id: Ia968ec7375ab346a2155769a46e74ce694a57fc2
2020-04-03 21:22:38 +02:00
Andreas Jaeger
96b56519bf Update hacking for Python3
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.

Fix problems found.

Remove hacking and friends from lower-constraints, they are not needed
for installation.

Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
2020-04-03 21:21:07 +02:00
Tim Burke
ff885d30e4 py3: Fix up probe tests
Change-Id: Ic0f54f393002e2170e7f1459625ee5a2b37df900
2020-02-03 19:26:17 -08:00
Zuul
17e57e38cd Merge "probe: Add test for syncing a delete when the remote 404s" 2020-01-31 22:45:24 +00:00
Clay Gerrard
2759d5d51c New Object Versioning mode
This patch adds a new object versioning mode. This new mode provides
a new set of APIs for users to interact with older versions of an
object. It also changes the naming scheme of older versions and adds
a version-id to each object.

This new mode is not backwards compatible or interchangeable with the
other two modes (i.e., stack and history), especially due to the changes
in the namimg scheme of older versions. This new mode will also serve
as a foundation for adding S3 versioning compatibility in the s3api
middleware.

Note that this does not (yet) support using a versioned container as
a source in container-sync. Container sync should be enhanced to sync
previous versions of objects.

Change-Id: Ic7d39ba425ca324eeb4543a2ce8d03428e2225a1
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
2020-01-24 17:39:56 -08:00
Thiago da Silva
26ff2eb1cb container-sync: Sync static links similar to how we sync SLOs
This allows static symlinks to be synced before their target. Dynamic
symlinks could already be synced even if target object has not been
synced, but static links previously required that target object existed
before it can be PUT. Now, have container_sync middleware plumb in an
override like it does for SLO.

Change-Id: I3bfc62b77b247003adcee6bd4d374168bfd6707d
2020-01-24 17:15:57 -08:00
Tim Burke
73a0e8e9cf probe: Add test for syncing a delete when the remote 404s
This works fine; we continue processing the other rows in the DB. But it
*does* take longer than it really ought to require. See the related bug;
we ought to be able to shave some 17s off the test time by not retrying
on the 404.

Change-Id: I9ca2511651e9b2bc0045894baa4062d20bc15369
Related-Bug: #1849841
2020-01-20 22:06:43 -08:00
Zuul
a4f1078864 Merge "Allow reconciler to handle reserved names" 2020-01-07 03:01:36 +00:00
Zuul
e32689a96d Merge "Deprecate per-service auto_create_account_prefix" 2020-01-07 01:30:20 +00:00
Clay Gerrard
b1178b4a96 Allow reconciler to handle reserved names
Change-Id: Ib918f10e95970b9f562b88e923c25608b826b83f
2020-01-05 10:04:05 -06:00
Clay Gerrard
4601548dab Deprecate per-service auto_create_account_prefix
If we move it to constraints it's more globally accessible in our code,
but more importantly it's more obvious to ops that everything breaks if
you try to mis-configure different values per-service.

Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee
2020-01-05 09:53:30 -06:00
Tim Burke
3f88907012 sharding: Better-handle newlines in container names
Previously, if you were on Python 2.7.10+ [0], such a newline would cause the
sharder to fail, complaining about invalid header values when trying to create
the shard containers. On older versions of Python, it would most likely cause a
parsing error in the container-server that was trying to handle the PUT.

Now, quote all places that we pass around container paths. This includes:

  * The X-Container-Sysmeta-Shard-(Quoted-)Root sent when creating the (empty)
    remote shards
  * The X-Container-Sysmeta-Shard-(Quoted-)Root included when initializing the
    local handoff for cleaving
  * The X-Backend-(Quoted-)Container-Path the proxy sends to the object-server
    for container updates
  * The Location header the container-server sends to the object-updater

Note that a new header was required in requests so that servers would
know whether the value should be unquoted or not. We can get away with
reusing Location in responses by having clients opt-in to quoting with
a new X-Backend-Accept-Quoted-Location header.

During a rolling upgrade,

  * old object-servers servicing requests from new proxy-servers will
    not know about the container path override and so will try to update
    the root container,
  * in general, object updates are more likely to land in the root
    container; the sharder will deal with them as misplaced objects, and
  * shard containers created by new code on servers running old code
    will think they are root containers until the server is running new
    code, too; during this time they'll fail the sharder audit and report
    stats to their account, but both of these should get cleared up upon
    upgrade.

Drive-by: fix a "conainer_name" typo that prevented us from testing that
we can shard a container with unicode in its name. Also, add more UTF8
probe tests.

[0] See https://bugs.python.org/issue22928

Change-Id: Ie08f36e31a448a547468dd85911c3a3bc30e89f1
Closes-Bug: 1856894
2020-01-03 16:04:57 -08:00
Tim Burke
8c0fd3f138 py3: Make seamless reloads work
Starting with Python 3.4, newly-created file descriptors are non-inheritable
[0], which causes trouble when we try to use a pipe for IPC. Fortunately, the
same PEP that implemented this change *also* provided a new API to mark file
descriptors as being inheritable -- so just do that.

While we're at it,

* Fix up the probe tests to work on py3
* Fix up the probe tests to work when policy-0 is erasure-coded
* Decode the bytes read so py3 doesn't log a b'pid'
* Log a warning if the read() is empty; something surely went wrong
  in the re-exec

[0] https://www.python.org/dev/peps/pep-0446/

Change-Id: I2a8a9f3dc78abb99bf9cbcf6b44c32ca644bb07b
Related-Change: I3e5229d2fb04be67e53533ff65b0870038accbb7
2019-12-11 01:07:19 +00:00
Clay Gerrard
698717d886 Allow internal clients to use reserved namespace
Reserve the namespace starting with the NULL byte for internal
use-cases.  Backend services will allow path names to include the NULL
byte in urls and validate names in the reserved namespace.  Database
services will filter all names starting with the NULL byte from
responses unless the request includes the header:

    X-Backend-Allow-Reserved-Names: true

The proxy server will not allow path names to include the NULL byte in
urls unless a middlware has set the X-Backend-Allow-Reserved-Names
header.  Middlewares can use the reserved namespace to create objects
and containers that can not be directly manipulated by clients.  Any
objects and bytes created in the reserved namespace will be aggregated
to the user's account totals.

When deploying internal proxys developers and operators may configure
the gatekeeper middleware to translate the X-Allow-Reserved-Names header
to the Backend header so they can manipulate the reserved namespace
directly through the normal API.

UpgradeImpact: it's not safe to rollback from this change

Change-Id: If912f71d8b0d03369680374e8233da85d8d38f85
2019-11-27 11:22:00 -06:00
Darrell Bishop
1107f24179 Seamlessly reload servers with SIGUSR1
Swift servers can now be seamlessly reloaded by sending them a SIGUSR1
(instead of a SIGHUP).  The server forks off a synchronized child to
wait to close the old listen socket(s) until the new server has started
up and bound its listen socket(s).  The new server is exec'ed from the
old one so its PID doesn't change.  This makes Systemd happier, so a
ReloadExec= stanza can now be used.

The seamless part means that incoming connections will alwyas get
accepted either by the old server or the new one.  This eliminates
client-perceived "downtime" during server reloads, while allowing the
server to fully reload, re-reading configuration, becoming a fresh
Python interpreter instance, etc.  The SO_REUSEPORT socket option has
already been getting used, so nothing had to change there.

This patch also includes a non-invasive fix for a current eventlet bug;
see https://github.com/eventlet/eventlet/pull/590
That bug prevents a SIGHUP "reload" from properly servicing existing
requests before old worker processes close sockets and exit.  The
existing probtests missed this, but the new ones, in this patch, caught
it.

New probe tests cover both old SIGHUP "reload" behavior as well as the
new SIGUSR1 seamless reload behavior.

Change-Id: I3e5229d2fb04be67e53533ff65b0870038accbb7
2019-11-07 10:15:26 -08:00
John Dickinson
0c1b485ad6 exclude utf8 tests under py3
These are known to not work until https://bugs.python.org/issue37093
is addressed in CPython upstream.

Change-Id: I4a6877907d14b632a9a477c887913488427b62b7
2019-10-29 20:12:05 +00:00
Zuul
d059505aba Merge "sharding: Update probe test to verify CleavingContext cleanup" 2019-09-25 23:11:33 +00:00
Tim Burke
9495bc0003 sharding: Update probe test to verify CleavingContext cleanup
Change-Id: I219bbbfd6a3c7adcaf73f3ee14d71aadd183633b
Related-Change: I1e502c328be16fca5f1cca2186b27a0545fecc16
2019-09-23 16:03:58 -07:00
Tim Burke
1ded0d6c87 Allow arbitrary UTF-8 strings as delimiters in listings
AWS seems to support this, so let's allow s3api to do it, too.

Previously, S3 clients trying to use multi-character delimiters would
get 500s back, because s3api didn't know how to handle the 412s that the
container server would send.

As long as we're adding support for container listings, may as well do
it for accounts, too.

Change-Id: I62032ddd50a3493b8b99a40fb48d840ac763d0e7
Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
Closes-Bug: #1797305
2019-09-12 10:44:00 -07:00
Tim Burke
1d7e1558b3 py3: (mostly) port probe tests
There's still one problem, though: since swiftclient on py3 doesn't
support non-ASCII characters in metadata names, none of the tests in
TestReconstructorRebuildUTF8 will pass.

Change-Id: I4ec879ade534e09c3a625414d8aa1f16fd600fa4
2019-09-04 10:17:45 -07:00
Tim Burke
3189410f9d Ignore 404s from handoffs for objects when calculating quorum
We previously realized we needed to do that for accounts and containers
where the consequences of treating the 404 as authoritative were more
obvious: we'd cache the non-existence which prevented writes until it
fell out of cache.

The same basic logic applies for objects, though: if we see

    (Timeout, Timeout, Timeout, 404, 404, 404)

on a triple-replica policy, we don't really have any reason to think
that a 404 is appropriate. In fact, it seems reasonably likely that
there's a thundering-herd problem where there are too many concurrent
requests for data that *definitely is there*. By responding with a 503,
we apply some back-pressure to clients, who hopefully have some
exponential backoff in their retries.

The situation gets a bit more complicated with erasure-coded data, but
the same basic principle applies. We're just more likely to have
confirmation that there *is* data out there, we just can't reconstruct
it (right now).

Note that we *still want to check* those handoffs, of course. Our
fail-in-place strategy has us replicate (and, more recently,
reconstruct) to handoffs to maintain durability; it'd be silly *not* to
look.

UpgradeImpact:
--------------
Be aware that this may cause an increase in 503 Service Unavailable
responses served by proxy-servers. However, this should more accurately
reflect the state of the system.

Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
Change-Id: Ia832e9bab13167948f01bc50aa8a61974ce189fb
Closes-Bug: #1837819
Related-Bug: #1833612
Related-Change: I53ed04b5de20c261ddd79c98c629580472e09961
Related-Change: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
2019-08-01 14:07:39 -07:00
Tim Burke
a1af3811a7 sharding: Cache shard ranges for object writes
Previously, we issued a GET to the root container for every object PUT,
POST, and DELETE. This puts load on the container server, potentially
leading to timeouts, error limiting, and erroneous 404s (!).

Now, cache the complete set of 'updating' shards, and find the shard for
this particular update in the proxy. Add a new config option,
recheck_updating_shard_ranges, to control the cache time; it defaults to
one hour. Set to 0 to fall back to previous behavior.

Note that we should be able to tolerate stale shard data just fine; we
already have to worry about async pendings that got written down with
one shard but may not get processed until that shard has itself sharded
or shrunk into another shard.

Also note that memcache has a default value limit of 1MiB, which may be
exceeded if a container has thousands of shards. In that case, set()
will act like a delete(), causing increased memcache churn but otherwise
preserving existing behavior. In the future, we may want to add support
for gzipping the cached shard ranges as they should compress well.

Change-Id: Ic7a732146ea19a47669114ad5dbee0bacbe66919
Closes-Bug: 1781291
2019-07-11 10:40:38 -07:00
Zuul
f55167a735 Merge "Increase node_timeout in gate" 2019-04-30 22:12:01 +00:00
Kota Tsuyuzaki
a30a477755 Stop overwriting reserved term
`dir` is a reserved instruction term in python, so this patch
avoiding to assing a value to it.

Change-Id: If780c4ffb72808b834e25a396665f17bd8383870
2019-03-12 08:53:18 +00:00
Zuul
736e76d764 Merge "probe tests: wait to start replicators until after verifying initial state" 2019-02-27 06:35:54 +00:00
Tim Burke
f4689dd22f probe tests: wait to start replicators until after verifying initial state
Change-Id: Ida7c776201a068d44572d1e94472c975c4bc8e36
2019-02-19 12:11:47 -08:00
Zuul
89b3adc9fb Merge "probetests: make negative assertion more meaningful" 2019-02-18 18:50:47 +00:00
Clay Gerrard
771963c926 Increase node_timeout in gate
Give storage nodes more time to complete requests for multi-node upgrade
and probetests.

Also slightly decouple probetests from default configs.

Change-Id: I334ef517d833916a3b7be3151a812d4f9c66a6e1
2019-02-12 10:39:17 -06:00
Clay Gerrard
ea8e545a27 Rebuild frags for unmounted disks
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.

Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring.  If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.

To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve.  In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.

Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced.  After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.

Closes-Bug: #1510342

Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
2019-02-08 18:04:55 +00:00
Tim Burke
61c9aa4bf3 probetests: make negative assertion more meaningful
In test_replication_servers_working, we delete a bunch of directories
without deleting hashes.pkl, then verify that nothing at that level is a
directory.

This would be trivially true except that throughout the test, we have
the replicators running constantly. However, we never verified that the
replicators actually *have* run and had a chance to re-create the
missing directories.

Now, stop the replicators before doing the deletes, run them
synchronously between doing the deletes and verifying that there are no
directories, and start them again before the final set of assertions.

Change-Id: I841f8250eb7abfb0fcdfca5c106f65e6e94dce0c
2019-02-01 01:17:56 +00:00
Tim Burke
c0dbf5b885 sharding: Make replicator logging less scary
When we abort the replication process because we've got shard ranges and
the sharder is now responsible for ensuring object-row durability, we
log a warning like "refusing to replicate objects" which sounds scary.

That's because it *is*, of course -- if the sharder isn't running,
whatever rows that DB has may only exist in that DB, meaning we're one
drive failure away from losing track of them entirely.

However, when the sharder *is* running and everything's happy, we reach
a steady-state where the root containers are all sharded and none of
them have any object rows to lose. At that point, the warning does more
harm than good.

Only print the scary "refusing to replicate" warning if we're still
responsible for some object rows, whether deleted or not.

Change-Id: I35de08d6c1617b2e446e969a54b79b42e8cfafef
2019-01-31 15:20:12 -08:00
Tim Burke
050f8799ca Use latest eventlet in probe tests
Note that eventlet 0.22.0+ closes connections between requests when
it stops accepting connections.

Partial-Bug: #1792615
Change-Id: Ia8d9ab95e2aad40e8d797acc3423a917e809ffdb
2018-09-19 14:59:32 -07:00
Tim Burke
5652dec43b container-updater: Always report zero objects/bytes used for shards
Otherwise, a sharded container AUTH_test/sharded will have its stats
included in the totals for both AUTH_test *and* .shards_AUTH_test

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I7fa74e13347601c5f44fd7e6cf65656cc3ebc2c5
2018-06-22 10:35:36 +01:00
Zuul
c568b4b100 Merge "Resolve TODO's in test/probe/test_sharder.py" 2018-06-20 23:48:23 +00:00
Alistair Coles
a59c5e3bae Resolve TODO's in test/probe/test_sharder.py
Resolve outstanding TODO's. One TODO is removed because there isn't an
easy way to arrange for an async pending to be targeted at a shard
container.

Change-Id: I0b003904f73461ddb995b2e6a01e92f14283278d
2018-06-20 19:42:32 +00:00
Zuul
ec066392b5 Merge "Make If-None-Match:* work properly with 0-byte PUTs" 2018-06-05 02:45:06 +00:00
Zuul
9d2a1a1d14 Merge "Make the decision between primary/handoff sets more obvious" 2018-05-23 14:16:24 +00:00
Tim Burke
8c386fff40 Make the decision between primary/handoff sets more obvious
Change-Id: I419de59df3317d67c594fe768f5696de24148280
2018-05-22 12:12:42 -07:00
Alistair Coles
37ee89e47a Avoid premature shrinking in sharder probe test
Previously test_misplaced_object_movement() deleted objects from both
shards and then relied on override-partitions option to selectively
run the sharder on root or shard containers and thereby control when
each shard range was identified for shrinking. This approach is flawed
when the second shard container lands in the same partition as the
root: running the sharder on the empty second shard's partition would
also cause the sharder to process the root and identify the second
shard for shrinking, resulting in premature shrinking of the second
shard.

Now, objects are only deleted from each shard range as that shard is
wanted to shrink.

Change-Id: I9f51621e8414e446e4d3f3b5027f6c40e01192c3
Drive-by: use the run_sharders() helper more often.
2018-05-22 13:35:19 +01:00
Alistair Coles
c35285f14b Use correct policy when faking misplaced objects in probe test
Before, merge_objects() always used storage policy index of 0 when
inserting a fake misplaced object into a shard container. If the shard
broker had a different policy index then the misplaced object would
not show in listings causing test_misplaced_object_movement() to
fail. This test bug might be exposed by having policy index 0 be an EC
policy, since the probe test requires a replication policy and would
therefore choose a non-zero policy index.

The fix is simply to specify the shard's policy index when inserting
the fake object.

Change-Id: Iec3f8ec29950220bb1b2ead9abfdfb1a261517d6
2018-05-21 08:36:14 +01:00
Matthew Oliver
2641814010 Add sharder daemon, manage_shard_ranges tool and probe tests
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.

The workflow is, in overview:

- perform an audit of the container for sharding purposes.

- move any misplaced objects that do not belong in the container
  to their correct shard.

- move shard ranges from FOUND state to CREATED state by creating
  shard containers.

- move shard ranges from CREATED to CLEAVED state by cleaving objects
  to shard dbs and replicating those dbs. By default this is done in
  batches of 2 shard ranges per visit.

Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.

The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>

Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
2018-05-18 18:48:13 +01:00
Alistair Coles
9d742b85ad Refactoring, test infrastructure changes and cleanup
...in preparation for the container sharding feature.

Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>

Change-Id: I4455677abb114a645cff93cd41b394d227e805de
2018-05-15 18:18:25 +01:00
Tim Burke
b640631daf Apply remote metadata in _handle_sync_response
We've already got it in the response, may as well apply it now rather
than wait for the other end to get around to running its replicators.

Change-Id: Ie36a6dd075beda04b9726dfa2bba9ffed025c9ef
2018-03-06 19:52:59 +00:00
Tim Burke
748b29ef80 Make If-None-Match:* work properly with 0-byte PUTs
When PUTting an object with `If-None-Match: *`, we rely 100-continue
support: the proxy checks the responses from all object-servers, and if
any of them respond 412, it closes down the connections. When there's
actual data for the object, this ensures that even nodes that *don't*
respond 412 will hit a ChunkReadTimeout and abort the PUT.

However, if the client does a PUT with a Content-Length of 0, that would
get sent all the way to the object server, which had all the information
it needed to respond 201. After replication, the PUT propagates to the
other nodes and the old object is lost, despite the client receiving a
412 indicating the operation failed.

Now, when PUTting a zero-byte object, switch to a chunked transfer so
the object-server still gets a ChunkReadTimeout.

Change-Id: Ie88e41aca2d59246c3134d743c1531c8e996f9e4
2018-02-26 13:12:44 +00:00
Alistair Coles
1f4ebbc990 kill orphans during probe test setup
orphans processes sometimes cause probe test failures so
get rid of them before each test.

Change-Id: I4ba6748d30fbb28371f13aa95387c49bc8223402
2018-02-08 16:43:18 -08:00
Samuel Merritt
745581ff2f Don't make async_pendings during object expiration
After deleting an object, the object expirer deletes the corresponding
row from the expirer queue by making DELETE requests directly to the
container servers. The same thing happens after attempting to delete
an object, but failing because the object has already been deleted. If
the DELETE requests fail, then the expirer will encounter that row
again on its next pass and retry the DELETE at that time. Therefore,
it is not necessary for the object server to write an async_pending
for that queue row's deletion.

Currently, however, two of the object servers do write such
async_pendings. Given Rc container replicas, that's 2 * Rc updates
from async_pendings and another Rc from the object expirer
directly. Given a typical Rc of 3, that's 9 container updates per
expiring object.

This commit makes the object server write no async_pendings for DELETE
requests coming from the object expirer. This reduces the number of
container server requests to Rc (typically 3), all issued directly
from the object expirer.

Closes-Bug: 1076202
Change-Id: Icd63c80c73f864d2561e745c3154fbfda02bd0cc
2018-01-17 10:39:11 -08:00