Commit e199192caefef068b5bf57da8b878e0bc82e3453 introduced the ability
to have multiple SSYNC running on a single device. It misses a security
to ensure that only one SSYNC request can be running on a partition.
This commit update replication_lock to lock N times the device, then
lock once the partition related to a SSYNC request.
Change-Id: Id053ed7dd355d414d7920dda79a968a1c6677c14
get_hub function was added in commit b155da42 with the idea to bypass
eventlet automatic hub selection that prefers epoll if available by default.
Since version 0.20.0 eventlet removed select.poll() function in its patched
select module (eventlet.green.select), see:
- https://github.com/eventlet/eventlet/commit/614a20462
So if eventlet monkey patching is done before a get_hub() call (as now in
wsgi.py since commit c9410c7d) if we use 'import select' we get the eventlet
version that don't have poll attribute.
To prevent that we use eventlet.patcher.original function to get python select
module to test if poll() is available on current platform.
Change-Id: I69b3db3951b3d3b6583845978deb2883492e7f0f
Closes-Bug: 1804627
Currently, the reconstructor will not remove empty object and suffixes
directories after processing a revert job. This will only happen during
its next run.
This patch will attempt to remove these empty directories immediately,
while we have the inodes cached.
Change-Id: I5dfc145b919b70ab7dae34fb124c8a25ba77222f
We kept hitting a floating error in the test, where fist ismount
in the test succeeds, while it should fail. As it turned out,
the return of gettempdir was the plain /tmp. So, a previous test
created /tmp/.ismount and the subsequent runs failed on it.
Re-generating the root filesystem (e.g. by a container) fixes
the problem, but still, there's no need to do this. This change
tightens the test up by placing the .ismount into a subdirectory
of the test directory instead of the global /tmp.
Change-Id: I006ba1f69982ef7513db3508d691723656f576c9
This is useful for deallocating disk blocks as part of an alternate disk
file implementation.
Additionally, add an offset argument to the existing fallocate utility
function; this allows you to grow an existing file.
Sam always had the best descriptions:
utils.fallocate(fd, size) allocates <size> bytes for the file referred
to by <fd>. It allows for keeping a reserve of an additional N bytes
or X% of the filesystem free. If neither fallocate() or
posix_fallocate() C functions are avaialble, utils.fallocate() will
log a warning (once only) and not actually allocate space.
utils.punch_hole(fd, offset, length) deallocates <length> bytes
starting at <offset> from the file referred to by <fd>. It uses the C
function fallocate(). If fallocate() is not available, calls to
utils.punch_hole() will raise an exception.
Since these both use the fallocate syscall, refactor that a bit and get
rid of FallocateWrapper. We add a new _LibcWrapper to do some
lazy-loading of a C function and expose whether the function is actually
available in Python, though. This allows utils.fallocate and
utils.punch_hole to keep their fancy logic pretty well-contained.
Modernized the tests for utils.fallocate() and utils.punch_hole().
Co-Authored-By: Samuel Merritt <sam@swiftstack.com>
Change-Id: Ieac30a477d784905c94742ee3d0898d7e0194b39
With this commit, each storage policy can define the diskfile to use to
access objects. Selection of the diskfile is done in swift.conf.
Example:
[storage-policy:0]
name = gold
policy_type = replication
default = yes
diskfile = egg:swift#replication.fs
The diskfile configuration item accepts the same format than middlewares
declaration: [[scheme:]egg_name#]entry_point
The egg_name is optional and default to "swift". The scheme is optional
and default to the only valid value "egg". The upstream entry points are
"replication.fs" and "erasure_coding.fs".
Co-Authored-By: Alexandre Lécuyer <alexandre.lecuyer@corp.ovh.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I070c21bc1eaf1c71ac0652cec9e813cadcc14851
The object server can be configured to leave a certain amount of disk
space free; default is 1%. This is useful in avoiding 100%-full
filesystems, as those can get Swift in a state where the filesystem is
too full to write tombstones, so you can't delete objects to free up
space.
When a cluster has accounts/containers and objects on the same disks,
then you can wind up with a 100%-full disk since account and container
servers don't respect fallocate_reserve. This commit makes account and
container servers respect fallocate_reserve so that disks shared
between account/container and object rings won't get 100% full.
When a disk's free space falls below the configured reserve, account
and container PUT, POST, and REPLICATE requests will fail with a 507
status code. These are the operations that can significantly increase
the disk space used by a given database.
I called the parameter "fallocate_reserve" for consistency with the
object server. No actual fallocate() call happens under Swift's
control in the account or container servers (sqlite3 might make such a
call, but it's out of our hands).
Change-Id: I083442eef14bf83c0ea717b1decb3e6b56dbf1d0
This patch adds an additional optional parameter to tempurl
which restricts the ip's from which a temp url can be used from.
Change-Id: I23fe998a980960d4a32df042b3f6a21f096c36af
In CPython commit e59af55c2, instantiating a logging.SysLogHandler
stopped raising an exception if the syslog server was
unavailable. This commit first appears in CPython
3.5.4. utils.get_logger() catches that error and retries the
instantiation, and there a test asserting that. The test fails on
Python 3.5.4 or greater, so now it has been corrected to only assert
things about the first instantiation of logging.SysLogHandler and
passes on Python 3.5.4 and 3.5.5.
This was noticed by running "tox -e py35" on an Ubuntu 18.04 system,
which ships with Python 3.5.5.
Change-Id: I43f231bd7d3566b9849a48f46ec9e2af4cd23be4
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.
The workflow is, in overview:
- perform an audit of the container for sharding purposes.
- move any misplaced objects that do not belong in the container
to their correct shard.
- move shard ranges from FOUND state to CREATED state by creating
shard containers.
- move shard ranges from CREATED to CLEAVED state by cleaving objects
to shard dbs and replicating those dbs. By default this is done in
batches of 2 shard ranges per visit.
Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.
The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
Enable the proxy to fetch a shard container location from the
container server in order to redirect an object update to the shard.
Enable the container server to redirect object updates to shard
containers.
Enable object updater to accept redirection of an object update.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I6ff85827eecdea746b3626c0d401f68139cce19d
With this patch the ContainerBroker gains several new features:
1. A shard_ranges table to persist ShardRange data, along with
methods to merge and access ShardRange instances to that table,
and to remove expired shard ranges.
2. The ability to create a fresh db file to replace the existing db
file. Fresh db files are named using the hash of the container path
plus an epoch which is a serialized Timestamp value, in the form:
<hash>_<epoch>.db
During sharding both the fresh and retiring db files co-exist on
disk. The ContainerBroker is now able to choose the newest on disk db
file when instantiated. It also provides a method (get_brokers()) to
gain access to broker instance for either on disk file.
3. Methods to access the current state of the on disk db files i.e.
UNSHARDED (old file only), SHARDING (fresh and retiring files), or
SHARDED (fresh file only with shard ranges).
Container replication is also modified:
1. shard ranges are replicated between container db peers. Unlike
objects, shard ranges are both pushed and pulled during a REPLICATE
event.
2. If a container db is capable of being sharded (i.e. it has a set of
shard ranges) then it will no longer attempt to replicate objects to
its peers. Object record durability is achieved by sharding rather than
peer to peer replication.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Ie4d2816259e6c25c346976e181fb9d350f947190
A ShardRange represents the part of the object namespace that
is managed by a container. It encapsulates:
- the namespace range, from an excluded lower bound to an included upper bound
- the object count and bytes used in the range
- the current state of the range, including whether it is deleted or not
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Kazuhiro MIYAHARA <miyahara.kazuhiro@lab.ntt.co.jp>
Change-Id: Iae090dc170843f15fd2a3ea8f167bec2848e928d
...in preparation for the container sharding feature.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I4455677abb114a645cff93cd41b394d227e805de
The object reconstructor will now fork all available worker processes
when operating on a subset of local devices.
Example:
A system has 24 disks, named "d1" through "d24"
reconstructor_workers = 8
invoked with --override-devices=d1,d2,d3,d4,d5,d6
In this case, the reconstructor will now use 6 worker processes, one
per disk. The old behavior was to use 2 worker processes, one for d1,
d3, and d5 and the other for d2, d4, and d6 (because 24 / 8 = 3, so we
assigned 3 disks per worker before creating another).
I think the new behavior better matches operators' expectations. If I
give a concurrent program six tasks to do and tell it to operate on up
to eight at a time, I'd expect it to do all six tasks at once, not run
two concurrent batches of three tasks apiece.
This has no effect when --override-devices is not specified. When
operating on all local devices instead of a subset, the new and old
code produce the same result.
The reconstructor's behavior now matches the object replicator's
behavior.
Change-Id: Ib308c156c77b9b92541a12dd7e9b1a8ea8307a30
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
[worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
Seen during a retart-storm:
Traceback (most recent call last):
File ".../swift/common/db_replicator.py", line 134, in replicate
{'Content-Type': 'application/json'})
File ".../httplib.py", line 1057, in request
self._send_request(method, url, body, headers)
File ".../httplib.py", line 1097, in _send_request
self.endheaders(body)
File ".../httplib.py", line 1053, in endheaders
self._send_output(message_body)
File ".../httplib.py", line 897, in _send_output
self.send(msg)
File ".../httplib.py", line 859, in send
self.connect()
File ".../swift/common/bufferedhttp.py", line 108, in connect
return HTTPConnection.connect(self)
File ".../httplib.py", line 836, in connect
self.timeout, self.source_address)
File ".../eventlet/green/socket.py", line 72, in create_connection
raise err
error: [Errno 104] ECONNRESET
Traceback (most recent call last):
File ".../swift/obj/replicator.py", line 282, in update
'', headers=self.headers).getresponse()
File ".../swift/common/bufferedhttp.py", line 157, in http_connect
ipaddr, port, method, path, headers, query_string, ssl)
File ".../swift/common/bufferedhttp.py", line 189, in http_connect_raw
conn.endheaders()
File ".../httplib.py", line 1053, in endheaders
self._send_output(message_body)
File ".../httplib.py", line 897, in _send_output
self.send(msg)
File ".../httplib.py", line 859, in send
self.connect()
File ".../swift/common/bufferedhttp.py", line 108, in connect
return HTTPConnection.connect(self)
File ".../httplib.py", line 836, in connect
self.timeout, self.source_address)
File ".../eventlet/green/socket.py", line 72, in create_connection
raise err
error: [Errno 101] ENETUNREACH
Traceback (most recent call last):
File ".../swift/obj/replicator.py", line 282, in update
'', headers=self.headers).getresponse()
File ".../swift/common/bufferedhttp.py", line 123, in getresponse
response = HTTPConnection.getresponse(self)
File ".../httplib.py", line 1136, in getresponse
response.begin()
File ".../httplib.py", line 453, in begin
version, status, reason = self._read_status()
File ".../httplib.py", line 417, in _read_status
raise BadStatusLine(line)
BadStatusLine: ''
(Different transactions, of course.)
Change-Id: I07192b8d2ece2d2ee04fe0d877ead6fbfc321d86
Unit tests using O_TMPFILE only rely on the kernel version to check
for the feature. This is wrong, as some filesystem, like tmpfs, doesn't
support O_TMPFILE.
So, instead of checking kernel version, this patch actually attempts to
open a file using O_TMPFILE and see if that's supported. If not, then
the test is skipped.
Change-Id: I5d652f1634b1ef940838573cfdd799ea17b8b572
Reviewer, beware: we determined that the test was using the
facilities improperly. This patch adjusts the test but does
not fix the code under test.
The time.time() output looks like this:
[zaitcev@lembas swift-tsrep]$ python2
Python 2.7.14 (default, Dec 11 2017, 14:52:53)
[GCC 7.2.1 20170915 (Red Hat 7.2.1-2)] on linux2
>>> import time
>>> time.time()
1519861559.96239
>>> time.time()
1519861561.046204
>>> time.time()
1519861561.732341
>>>
(it's never beyond 6 digits on py2)
[zaitcev@lembas swift-tsrep]$ python3
Python 3.6.3 (default, Oct 9 2017, 12:07:10)
[GCC 7.2.1 20170915 (Red Hat 7.2.1-2)] on linux
>>> import time
>>> time.time()
1519861541.7662468
>>> time.time()
1519861542.893482
>>> time.time()
1519861546.56222
>>> time.time()
1519861547.3297756
>>>
(can go beyond 6 digits on py3)
When fraction is too long on py3, you get:
>>> now = 1519830570.6949349
>>> now
1519830570.6949348
>>> timestamp = Timestamp(now, offset=1)
>>> timestamp
1519830570.69493_0000000000000001
>>> value = '%f' % now
>>> value
'1519830570.694935'
>>> timestamp > value
False
>>>
Note that the test fails in exactly the same way on py2, if time.time()
returns enough digits. Therefore, rounding changes are not the culprit.
The real problem is the assumption that you can take a float T, print
it with '%f' into S, then do arithmetic on T to get O, convert S, T,
and O into Timestamp, then make comparisons. This does not work,
because rounding happens twice: once when you interpolate %f, and
then when you construct a Timestamp. The only valid operation is
to accept a timestamp (e.g. from X-Delete-At) as a floating point
number as a decimal string, and convert it once. Only then you can
do arithmetics to find the expiration.
Change-Id: Ie3b002abbd4734c675ee48a7535b8b846032f9d1
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.
While we're at it (and since they were so close anyway), port
* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py
Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
The goal is to make the successful statsd buckets
(e.g. "object-server.GET.timing") have timing information for all the
requests that the server handled correctly, while the error buckets
(e.g. "object-server.GET.errors.timing") have the rest.
Currently, we don't do that great a job of it. We special-case a few
4xx status codes (404, 412, 416) to not count as errors, but we leave
some pretty large holes. If you're graphing errors, you'll see spikes
when client is sending bogus requests (400) or failing to
re-authenticate (403). You'll also see spikes when your drives are
unmounted (507) and when there's bugs that need fixing (500).
This commit makes .errors.timing be just 5xx in the hope that its
graph will be more useful.
Change-Id: I92b41bcbb880c0688c37ab231c19ebe984b18215