...unless the client requests it specifically using a new flag:
X-Backend-Auto-Create: true
Previously, you could get real jittery listings during a rebalance:
* Partition with a shard DB get reassigned, so one primary has no DB.
* Proxy makes a listing, gets a 404, tries another node. Likely, one of
the other shard replicas responds. Things are fine.
* Update comes in. Since we use the auto_create_account_prefix
namespace for shards, container DB gets created and we write the row.
* Proxy makes another listing. There's a one-in-three chance that we
claim there's only one object in that whole range.
Note that unsharded databases would respond to the update with a 404 and
wait for one of the other primaries (or the old primary that's now a
hand-off) to rsync a whole DB over, keeping us in the happy state.
Now, if the account is in the shards namespace, 404 the object update if
we have no DB. Wait for replication like in the unsharded case.
Continue to be willing to create the DB when the sharder is seeding all
the CREATED databases before it starts cleaving, though.
Change-Id: I15052f3f17999e6f432951ba7c0731dcdc9475bb
Closes-Bug: #1881210
The idea is, if none of
- timestamp,
- object_count,
- bytes_used,
- state, or
- epoch
has changed, we shouldn't need to send an update back to the root
container.
This is more-or-less comparable to what the container-updater does to
avoid unnecessary writes to the account.
Closes-Bug: #1834097
Change-Id: I1ee7ba5eae3c508064714c4deb4f7c6bbbfa32af
Previously, the lack of container ACLs on the reserved container would
mean that attempting to grant access to the user-visible container would
not work; the user could not access the backing object.
Now, have symlinks with the allow-reserved-names sysmeta set be
pre-authed. Note that the user still has to be authorized to read the
symlink, and if the backing object was *itself* a symlink, that will be
authed separately.
Change-Id: Ifd744044421ef2ca917ce9502b155a6514ce8ecf
Closes-Bug: #1880013
We want to do the table scan without locking and group the locking
deletes into small indexed operations to minimize the impact of
background processes calling reclaim each cycle.
Change-Id: I3ccd145c14a9b68ff8a9da61f79034549c9bc127
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Closes-Bug: #1877651
We usually want to have ratelimit fairly far left in the pipeline -- the
assumption is that something like an auth check will be fairly expensive
and we should try to shield the auth system so it doesn't melt under the
load of a misbehaved swift client.
But with S3 requests, we can't know the account/container that a request
is destined for until *after* auth. Fortunately, we've already got some
code to make s3api play well with ratelimit.
So, let's have our cake and eat it, too: allow operators to place
ratelimit once, before auth, for swift requests and again, after auth,
for s3api. They'll both use the same memcached keys (so users can't
switch APIs to effectively double their limit), but still only have each
S3 request counted against the limit once.
Change-Id: If003bb43f39427fe47a0f5a01dbcc19e1b3b67ef
New flake8 came out with new & improved rules. Ignore E741; it would be
too much churn. Fix the rest.
Change-Id: I9125c8c53423232309a75cbcc5b695b378864c1b
Previously, we'd see warnings like
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
when setting lower/upper bounds with non-ascii byte strings.
Change-Id: I328f297a5403d7e59db95bc726428a3f92df88e1
Previously, attempting to GET, HEAD, or DELETE an object with a non-null
version-id would cause 500s, with logs complaining about how
version-aware operations require that the container is versioned
Now, we'll early-return with a 404 (on GET or HEAD) or 204 (on DELETE).
Change-Id: I46bfd4ae7d49657a94734962c087f350e758fead
Closes-Bug: 1874295
Ordinarily, an ENOENT in _finalize_durable should mean something's gone
off the rails -- we expected to be able to mark data durable, but
couldn't!
If there are concurrent writers, though, it might actually be OK:
Client A writes .data
Client B writes .data
Client B finalizes .data *and cleans up on-disk files*
Client A tries to finalize but .data is gone
Previously, the above would cause the object server to 500, and if
enough of them did this, the client may see a 503. Now, call it good so
clients get 201s.
Change-Id: I4e322a7be23870a62aaa6acee8435598a056c544
Closes-Bug: #1719860
When an operator does a `find_and_replace` on a DB that already has
shard ranges, they get a prompt like:
This will delete existing 58 shard ranges.
Do you want to show the existing ranges [s], delete the existing
ranges [yes] or quit without deleting [q]?
Previously, if they selected `q`, we would skip the delete but still do
the merge (!) and immediately warn about how there are now invalid shard
ranges. Now, quit without merging.
Change-Id: I7d869b137a6fbade59bb8ba16e4f3e9663e18822
Use swift.backend_path entry in wsgi environment to propagate
backend PATH_INFO.
Needed by ceilometermiddleware to extract account/container info
from PATH_INFO, patch: https://review.opendev.org/#/c/718085/
Change-Id: Ifb3c6c30835d912c5ba4b2e03f2e0b5cb392671a
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.
Fix problems found.
Remove hacking and friends from lower-constraints, they are not needed
for installation.
Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
Memcachering provided set_multi/get_multi to set/get list of key on an
unique memcached server selected in ring with server_key value. But
current api doesn't allow to deleted value save with set_multi.
This change add a server_key optional keyword to delete method to allow
to delete entry in the memcached selected with server_key instead of key.
Change-Id: I24c29540ee4b91adeb7b9f44fe84bc4d46f89218
The contextmanager eventlet.timeout.Timeout is scheduling a call to
throw an exception every time is is entered. The swift-proxy uses
Chunk(Read|Write)Timeout for every chunk read/written from the client or
object-server. For a single upload/download of a big object, it means
tens of thousands of scheduling in eventlet, which is very costly.
This patch replace the usage of these context managers by a watchdog
greenthread that will schedule itself by sleeping until the next timeout
expiration. Then, only if a timeout expired, it will schedule a call to
throw the appropriate exception.
The gain on bandwidth and CPU usage is significant. On a benchmark
environment, it gave this result for an upload of 6 Gbpson a replica
policy (average of 3 runs):
master: 5.66 Gbps / 849 jiffies consumed by the proxy-server
this patch: 7.56 Gbps / 618 jiffies consumed by the proxy-server
Change-Id: I19fd42908be5a6ac5905ba193967cd860cb27a0b
DaemonStrategy class calls Daemon.is_healthy() method every 0.1 seconds
to ensure that all workers are running as wanted.
On object replicator/reconstructor daemons, is_healthy() check if the rings
changed to decide if workers must be created/killed. With large rings,
this operation can be CPU intensive, especially on low-end CPU.
This patch:
- increases the check interval to 5 seconds by default, because none of
these daemons are critical for performance (they are not in the datapath).
But it allows each daemon to change this value if necessary
- ensures that before doing a computation of all devices in the ring,
object replicator/reconstructor checks that the ring really changed
(by checking the mtime of the ring.gz files)
On an Atom N2800 processor, this patch reduced the CPU usage of the main
object replicator/reconstructor from 70% of a core to 0%.
Change-Id: I2867e2be539f325778e2f044a151fd0773a7c390
This commit reduce the number of I/O done by the swift-object-relinker.
First, it saves a progress state of relinking and cleanup in case the
process is interrupted during the operation. This allow to resume
operation without rescanning all partitions.
Secondly, it prevents from being scanned by relink and cleanup all
partitions that are bigger than 2^part_power (or (2^next_part_power)/2).
These partitions were not existing before the beginning of the part_power
increase, so there is nothing to relink or cleanup.
Thirdly, it reverse-orders the partitions to scan so that some useless
work is avoided. If a device contains partitions 1 and 3, relinking
partition 1 will create "new" objects in partition 3, that will need to
be scanned when the relinker will work on partition 3. It is useless. If
partition 3 is done first, it will only contain the objects that need to
be relinked.
Fourthly, it allows to specify a unique device to work on.
To do that, some hooks were added in audit_location_generator to allow
to execute some custom code before/after iterating a
device/partition/suffix/hash.
Change-Id: If1bf8ed9036fb0ec619b0d4f16061a81a1af2082
They effectively already *were*, but if you used the RingBuilder API
directly (rather than the CLI) you could previously write down builders
that would hit KeyErrors on load.
Change-Id: I1de895d4571f7464be920345881789d47659729f