The idea is, if none of
- timestamp,
- object_count,
- bytes_used,
- state, or
- epoch
has changed, we shouldn't need to send an update back to the root
container.
This is more-or-less comparable to what the container-updater does to
avoid unnecessary writes to the account.
Closes-Bug: #1834097
Change-Id: I1ee7ba5eae3c508064714c4deb4f7c6bbbfa32af
Previously, we'd see warnings like
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
when setting lower/upper bounds with non-ascii byte strings.
Change-Id: I328f297a5403d7e59db95bc726428a3f92df88e1
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.
Fix problems found.
Remove hacking and friends from lower-constraints, they are not needed
for installation.
Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
The contextmanager eventlet.timeout.Timeout is scheduling a call to
throw an exception every time is is entered. The swift-proxy uses
Chunk(Read|Write)Timeout for every chunk read/written from the client or
object-server. For a single upload/download of a big object, it means
tens of thousands of scheduling in eventlet, which is very costly.
This patch replace the usage of these context managers by a watchdog
greenthread that will schedule itself by sleeping until the next timeout
expiration. Then, only if a timeout expired, it will schedule a call to
throw the appropriate exception.
The gain on bandwidth and CPU usage is significant. On a benchmark
environment, it gave this result for an upload of 6 Gbpson a replica
policy (average of 3 runs):
master: 5.66 Gbps / 849 jiffies consumed by the proxy-server
this patch: 7.56 Gbps / 618 jiffies consumed by the proxy-server
Change-Id: I19fd42908be5a6ac5905ba193967cd860cb27a0b
This commit reduce the number of I/O done by the swift-object-relinker.
First, it saves a progress state of relinking and cleanup in case the
process is interrupted during the operation. This allow to resume
operation without rescanning all partitions.
Secondly, it prevents from being scanned by relink and cleanup all
partitions that are bigger than 2^part_power (or (2^next_part_power)/2).
These partitions were not existing before the beginning of the part_power
increase, so there is nothing to relink or cleanup.
Thirdly, it reverse-orders the partitions to scan so that some useless
work is avoided. If a device contains partitions 1 and 3, relinking
partition 1 will create "new" objects in partition 3, that will need to
be scanned when the relinker will work on partition 3. It is useless. If
partition 3 is done first, it will only contain the objects that need to
be relinked.
Fourthly, it allows to specify a unique device to work on.
To do that, some hooks were added in audit_location_generator to allow
to execute some custom code before/after iterating a
device/partition/suffix/hash.
Change-Id: If1bf8ed9036fb0ec619b0d4f16061a81a1af2082
This patch adds a new object versioning mode. This new mode provides
a new set of APIs for users to interact with older versions of an
object. It also changes the naming scheme of older versions and adds
a version-id to each object.
This new mode is not backwards compatible or interchangeable with the
other two modes (i.e., stack and history), especially due to the changes
in the namimg scheme of older versions. This new mode will also serve
as a foundation for adding S3 versioning compatibility in the s3api
middleware.
Note that this does not (yet) support using a versioned container as
a source in container-sync. Container sync should be enhanced to sync
previous versions of objects.
Change-Id: Ic7d39ba425ca324eeb4543a2ce8d03428e2225a1
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Thiago da Silva <thiagodasilva@gmail.com>
Prior to the related change, clients may have written down X-Delete-At headers
that are outside of the Timestamp range, for example.
Change-Id: Ib8ae7ebcbdb32e0aa58446bd1ef949e5e2f63e74
Related-Change: I23666ec8a067d829eaf9bfe54bd086c320b3429e
Related-Bug: 1821204
Partial-Bug: 1860149
Swift servers can now be seamlessly reloaded by sending them a SIGUSR1
(instead of a SIGHUP). The server forks off a synchronized child to
wait to close the old listen socket(s) until the new server has started
up and bound its listen socket(s). The new server is exec'ed from the
old one so its PID doesn't change. This makes Systemd happier, so a
ReloadExec= stanza can now be used.
The seamless part means that incoming connections will alwyas get
accepted either by the old server or the new one. This eliminates
client-perceived "downtime" during server reloads, while allowing the
server to fully reload, re-reading configuration, becoming a fresh
Python interpreter instance, etc. The SO_REUSEPORT socket option has
already been getting used, so nothing had to change there.
This patch also includes a non-invasive fix for a current eventlet bug;
see https://github.com/eventlet/eventlet/pull/590
That bug prevents a SIGHUP "reload" from properly servicing existing
requests before old worker processes close sockets and exit. The
existing probtests missed this, but the new ones, in this patch, caught
it.
New probe tests cover both old SIGHUP "reload" behavior as well as the
new SIGUSR1 seamless reload behavior.
Change-Id: I3e5229d2fb04be67e53533ff65b0870038accbb7
Several tools are returning a misleading error message if swift.conf is
missing or not readable by the user, stating that the hash pre-/suffixes
are missing. Let's fix this by catching the real issue down below.
Change-Id: I7a47e6260ed51a3b7d9665b3a4510520429ae158
It's probably weird that StreamingPile has this interfaces that swallows
exceptions, but this seems better than hanging.
Change-Id: I8fe45c0f0d291efc84f3edf5d6b7cd116b5c7835
"self.assertTrue(policies[1].is_deprecated, True)" and
"self.assertTrue(crashy_calls[0], 1)" are not correct, this is
to fix them.
Change-Id: I7b07f0833d675d2939c910f679b54da2b8cda482
Add the log_msg_template option in proxy-server.conf and log_format in
a/c/o-server.conf. It is a string parsable by Python's format()
function. Some fields containing user data might be anonymized by using
log_anonymization_method and log_anonymization_salt.
Change-Id: I29e30ef45fe3f8a026e7897127ffae08a6a80cd9
It has not been necessary since we dropped support for Python 2.6.
See https://github.com/python/cpython/commit/8c6d9d7 and
https://bugs.python.org/issue2987.
Be sure to keep a `urlparse` name in utils, though; swauth (at least)
still expects there to be a swift.common.utils.urlparse.
Change-Id: If2502868f251b8a83aa929ee22b10046e708d111
This started with ShardRanges and its CLI. The sharder is at the
bottom of the dependency chain. Even container backend needs it.
Once we started tinkering with the sharder, it all snowballed to
include the rest of the container services.
Beware, this does affect some of Python 2 code. Mostly it's trivial
and obviously correct, but needs checking by reviewers.
About killing the stray "from __future__ import unicode_literals":
we do not do it in general. The specific problem it caused was
a failure of functional tests because unicode leaked into a field
that was supposed to be encoded. It is just too hard to track the
types when rules change from file to file, so off with its head.
Change-Id: Iba4e65d0e46d8c1f5a91feb96c2c07f99ca7c666
Previously o_tmpfile support was detected by checking the
kernel version as it was officially introduced in XFS in 3.15.
The problem is that RHEL has backported the support to at least
RHEL 7.6 but the kernel version is not updated.
This patch changes o_tmpfile is detected by actually attempting to
open a file with the O_TMPFILE flag and keeps the information cached
in DiskFileManager so that the check only happens once while process
is running.
Change-Id: I3599e2ab257bcd99467aee83b747939afac639d8
Commit e199192caefef068b5bf57da8b878e0bc82e3453 introduced the ability
to have multiple SSYNC running on a single device. It misses a security
to ensure that only one SSYNC request can be running on a partition.
This commit update replication_lock to lock N times the device, then
lock once the partition related to a SSYNC request.
Change-Id: Id053ed7dd355d414d7920dda79a968a1c6677c14
get_hub function was added in commit b155da42 with the idea to bypass
eventlet automatic hub selection that prefers epoll if available by default.
Since version 0.20.0 eventlet removed select.poll() function in its patched
select module (eventlet.green.select), see:
- https://github.com/eventlet/eventlet/commit/614a20462
So if eventlet monkey patching is done before a get_hub() call (as now in
wsgi.py since commit c9410c7d) if we use 'import select' we get the eventlet
version that don't have poll attribute.
To prevent that we use eventlet.patcher.original function to get python select
module to test if poll() is available on current platform.
Change-Id: I69b3db3951b3d3b6583845978deb2883492e7f0f
Closes-Bug: 1804627
Currently, the reconstructor will not remove empty object and suffixes
directories after processing a revert job. This will only happen during
its next run.
This patch will attempt to remove these empty directories immediately,
while we have the inodes cached.
Change-Id: I5dfc145b919b70ab7dae34fb124c8a25ba77222f
We kept hitting a floating error in the test, where fist ismount
in the test succeeds, while it should fail. As it turned out,
the return of gettempdir was the plain /tmp. So, a previous test
created /tmp/.ismount and the subsequent runs failed on it.
Re-generating the root filesystem (e.g. by a container) fixes
the problem, but still, there's no need to do this. This change
tightens the test up by placing the .ismount into a subdirectory
of the test directory instead of the global /tmp.
Change-Id: I006ba1f69982ef7513db3508d691723656f576c9
This is useful for deallocating disk blocks as part of an alternate disk
file implementation.
Additionally, add an offset argument to the existing fallocate utility
function; this allows you to grow an existing file.
Sam always had the best descriptions:
utils.fallocate(fd, size) allocates <size> bytes for the file referred
to by <fd>. It allows for keeping a reserve of an additional N bytes
or X% of the filesystem free. If neither fallocate() or
posix_fallocate() C functions are avaialble, utils.fallocate() will
log a warning (once only) and not actually allocate space.
utils.punch_hole(fd, offset, length) deallocates <length> bytes
starting at <offset> from the file referred to by <fd>. It uses the C
function fallocate(). If fallocate() is not available, calls to
utils.punch_hole() will raise an exception.
Since these both use the fallocate syscall, refactor that a bit and get
rid of FallocateWrapper. We add a new _LibcWrapper to do some
lazy-loading of a C function and expose whether the function is actually
available in Python, though. This allows utils.fallocate and
utils.punch_hole to keep their fancy logic pretty well-contained.
Modernized the tests for utils.fallocate() and utils.punch_hole().
Co-Authored-By: Samuel Merritt <sam@swiftstack.com>
Change-Id: Ieac30a477d784905c94742ee3d0898d7e0194b39
With this commit, each storage policy can define the diskfile to use to
access objects. Selection of the diskfile is done in swift.conf.
Example:
[storage-policy:0]
name = gold
policy_type = replication
default = yes
diskfile = egg:swift#replication.fs
The diskfile configuration item accepts the same format than middlewares
declaration: [[scheme:]egg_name#]entry_point
The egg_name is optional and default to "swift". The scheme is optional
and default to the only valid value "egg". The upstream entry points are
"replication.fs" and "erasure_coding.fs".
Co-Authored-By: Alexandre Lécuyer <alexandre.lecuyer@corp.ovh.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I070c21bc1eaf1c71ac0652cec9e813cadcc14851
The object server can be configured to leave a certain amount of disk
space free; default is 1%. This is useful in avoiding 100%-full
filesystems, as those can get Swift in a state where the filesystem is
too full to write tombstones, so you can't delete objects to free up
space.
When a cluster has accounts/containers and objects on the same disks,
then you can wind up with a 100%-full disk since account and container
servers don't respect fallocate_reserve. This commit makes account and
container servers respect fallocate_reserve so that disks shared
between account/container and object rings won't get 100% full.
When a disk's free space falls below the configured reserve, account
and container PUT, POST, and REPLICATE requests will fail with a 507
status code. These are the operations that can significantly increase
the disk space used by a given database.
I called the parameter "fallocate_reserve" for consistency with the
object server. No actual fallocate() call happens under Swift's
control in the account or container servers (sqlite3 might make such a
call, but it's out of our hands).
Change-Id: I083442eef14bf83c0ea717b1decb3e6b56dbf1d0
This patch adds an additional optional parameter to tempurl
which restricts the ip's from which a temp url can be used from.
Change-Id: I23fe998a980960d4a32df042b3f6a21f096c36af
In CPython commit e59af55c2, instantiating a logging.SysLogHandler
stopped raising an exception if the syslog server was
unavailable. This commit first appears in CPython
3.5.4. utils.get_logger() catches that error and retries the
instantiation, and there a test asserting that. The test fails on
Python 3.5.4 or greater, so now it has been corrected to only assert
things about the first instantiation of logging.SysLogHandler and
passes on Python 3.5.4 and 3.5.5.
This was noticed by running "tox -e py35" on an Ubuntu 18.04 system,
which ships with Python 3.5.5.
Change-Id: I43f231bd7d3566b9849a48f46ec9e2af4cd23be4
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.
The workflow is, in overview:
- perform an audit of the container for sharding purposes.
- move any misplaced objects that do not belong in the container
to their correct shard.
- move shard ranges from FOUND state to CREATED state by creating
shard containers.
- move shard ranges from CREATED to CLEAVED state by cleaving objects
to shard dbs and replicating those dbs. By default this is done in
batches of 2 shard ranges per visit.
Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.
The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
Enable the proxy to fetch a shard container location from the
container server in order to redirect an object update to the shard.
Enable the container server to redirect object updates to shard
containers.
Enable object updater to accept redirection of an object update.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I6ff85827eecdea746b3626c0d401f68139cce19d