With this patch the ContainerBroker gains several new features:
1. A shard_ranges table to persist ShardRange data, along with
methods to merge and access ShardRange instances to that table,
and to remove expired shard ranges.
2. The ability to create a fresh db file to replace the existing db
file. Fresh db files are named using the hash of the container path
plus an epoch which is a serialized Timestamp value, in the form:
<hash>_<epoch>.db
During sharding both the fresh and retiring db files co-exist on
disk. The ContainerBroker is now able to choose the newest on disk db
file when instantiated. It also provides a method (get_brokers()) to
gain access to broker instance for either on disk file.
3. Methods to access the current state of the on disk db files i.e.
UNSHARDED (old file only), SHARDING (fresh and retiring files), or
SHARDED (fresh file only with shard ranges).
Container replication is also modified:
1. shard ranges are replicated between container db peers. Unlike
objects, shard ranges are both pushed and pulled during a REPLICATE
event.
2. If a container db is capable of being sharded (i.e. it has a set of
shard ranges) then it will no longer attempt to replicate objects to
its peers. Object record durability is achieved by sharding rather than
peer to peer replication.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Ie4d2816259e6c25c346976e181fb9d350f947190
A ShardRange represents the part of the object namespace that
is managed by a container. It encapsulates:
- the namespace range, from an excluded lower bound to an included upper bound
- the object count and bytes used in the range
- the current state of the range, including whether it is deleted or not
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Kazuhiro MIYAHARA <miyahara.kazuhiro@lab.ntt.co.jp>
Change-Id: Iae090dc170843f15fd2a3ea8f167bec2848e928d
...in preparation for the container sharding feature.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I4455677abb114a645cff93cd41b394d227e805de
The object reconstructor will now fork all available worker processes
when operating on a subset of local devices.
Example:
A system has 24 disks, named "d1" through "d24"
reconstructor_workers = 8
invoked with --override-devices=d1,d2,d3,d4,d5,d6
In this case, the reconstructor will now use 6 worker processes, one
per disk. The old behavior was to use 2 worker processes, one for d1,
d3, and d5 and the other for d2, d4, and d6 (because 24 / 8 = 3, so we
assigned 3 disks per worker before creating another).
I think the new behavior better matches operators' expectations. If I
give a concurrent program six tasks to do and tell it to operate on up
to eight at a time, I'd expect it to do all six tasks at once, not run
two concurrent batches of three tasks apiece.
This has no effect when --override-devices is not specified. When
operating on all local devices instead of a subset, the new and old
code produce the same result.
The reconstructor's behavior now matches the object replicator's
behavior.
Change-Id: Ib308c156c77b9b92541a12dd7e9b1a8ea8307a30
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
[worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
Seen during a retart-storm:
Traceback (most recent call last):
File ".../swift/common/db_replicator.py", line 134, in replicate
{'Content-Type': 'application/json'})
File ".../httplib.py", line 1057, in request
self._send_request(method, url, body, headers)
File ".../httplib.py", line 1097, in _send_request
self.endheaders(body)
File ".../httplib.py", line 1053, in endheaders
self._send_output(message_body)
File ".../httplib.py", line 897, in _send_output
self.send(msg)
File ".../httplib.py", line 859, in send
self.connect()
File ".../swift/common/bufferedhttp.py", line 108, in connect
return HTTPConnection.connect(self)
File ".../httplib.py", line 836, in connect
self.timeout, self.source_address)
File ".../eventlet/green/socket.py", line 72, in create_connection
raise err
error: [Errno 104] ECONNRESET
Traceback (most recent call last):
File ".../swift/obj/replicator.py", line 282, in update
'', headers=self.headers).getresponse()
File ".../swift/common/bufferedhttp.py", line 157, in http_connect
ipaddr, port, method, path, headers, query_string, ssl)
File ".../swift/common/bufferedhttp.py", line 189, in http_connect_raw
conn.endheaders()
File ".../httplib.py", line 1053, in endheaders
self._send_output(message_body)
File ".../httplib.py", line 897, in _send_output
self.send(msg)
File ".../httplib.py", line 859, in send
self.connect()
File ".../swift/common/bufferedhttp.py", line 108, in connect
return HTTPConnection.connect(self)
File ".../httplib.py", line 836, in connect
self.timeout, self.source_address)
File ".../eventlet/green/socket.py", line 72, in create_connection
raise err
error: [Errno 101] ENETUNREACH
Traceback (most recent call last):
File ".../swift/obj/replicator.py", line 282, in update
'', headers=self.headers).getresponse()
File ".../swift/common/bufferedhttp.py", line 123, in getresponse
response = HTTPConnection.getresponse(self)
File ".../httplib.py", line 1136, in getresponse
response.begin()
File ".../httplib.py", line 453, in begin
version, status, reason = self._read_status()
File ".../httplib.py", line 417, in _read_status
raise BadStatusLine(line)
BadStatusLine: ''
(Different transactions, of course.)
Change-Id: I07192b8d2ece2d2ee04fe0d877ead6fbfc321d86
Unit tests using O_TMPFILE only rely on the kernel version to check
for the feature. This is wrong, as some filesystem, like tmpfs, doesn't
support O_TMPFILE.
So, instead of checking kernel version, this patch actually attempts to
open a file using O_TMPFILE and see if that's supported. If not, then
the test is skipped.
Change-Id: I5d652f1634b1ef940838573cfdd799ea17b8b572
Reviewer, beware: we determined that the test was using the
facilities improperly. This patch adjusts the test but does
not fix the code under test.
The time.time() output looks like this:
[zaitcev@lembas swift-tsrep]$ python2
Python 2.7.14 (default, Dec 11 2017, 14:52:53)
[GCC 7.2.1 20170915 (Red Hat 7.2.1-2)] on linux2
>>> import time
>>> time.time()
1519861559.96239
>>> time.time()
1519861561.046204
>>> time.time()
1519861561.732341
>>>
(it's never beyond 6 digits on py2)
[zaitcev@lembas swift-tsrep]$ python3
Python 3.6.3 (default, Oct 9 2017, 12:07:10)
[GCC 7.2.1 20170915 (Red Hat 7.2.1-2)] on linux
>>> import time
>>> time.time()
1519861541.7662468
>>> time.time()
1519861542.893482
>>> time.time()
1519861546.56222
>>> time.time()
1519861547.3297756
>>>
(can go beyond 6 digits on py3)
When fraction is too long on py3, you get:
>>> now = 1519830570.6949349
>>> now
1519830570.6949348
>>> timestamp = Timestamp(now, offset=1)
>>> timestamp
1519830570.69493_0000000000000001
>>> value = '%f' % now
>>> value
'1519830570.694935'
>>> timestamp > value
False
>>>
Note that the test fails in exactly the same way on py2, if time.time()
returns enough digits. Therefore, rounding changes are not the culprit.
The real problem is the assumption that you can take a float T, print
it with '%f' into S, then do arithmetic on T to get O, convert S, T,
and O into Timestamp, then make comparisons. This does not work,
because rounding happens twice: once when you interpolate %f, and
then when you construct a Timestamp. The only valid operation is
to accept a timestamp (e.g. from X-Delete-At) as a floating point
number as a decimal string, and convert it once. Only then you can
do arithmetics to find the expiration.
Change-Id: Ie3b002abbd4734c675ee48a7535b8b846032f9d1
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.
While we're at it (and since they were so close anyway), port
* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py
Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
The goal is to make the successful statsd buckets
(e.g. "object-server.GET.timing") have timing information for all the
requests that the server handled correctly, while the error buckets
(e.g. "object-server.GET.errors.timing") have the rest.
Currently, we don't do that great a job of it. We special-case a few
4xx status codes (404, 412, 416) to not count as errors, but we leave
some pretty large holes. If you're graphing errors, you'll see spikes
when client is sending bogus requests (400) or failing to
re-authenticate (403). You'll also see spikes when your drives are
unmounted (507) and when there's bugs that need fixing (500).
This commit makes .errors.timing be just 5xx in the hope that its
graph will be more useful.
Change-Id: I92b41bcbb880c0688c37ab231c19ebe984b18215
Since Python 2.7, unittest in the standard library has included mulitple
facilities for skipping tests by decorators as well as an exception.
Switch to that directly, rather than importing nose.
Change-Id: I4009033473ea24f0d0faed3670db844f40051f30
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.
This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.
Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.
As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.
The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?
Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.
Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).
References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.
THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS
With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.
So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.
`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.
The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.
Co-Authored-By: John Dickinson <me@not.mn>
Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
Keymaster middleware does some nice input validation on
base64-encoded strings; pull that out somewhere common so
other things (like SLOs with inlined data) can use it, too.
Change-Id: I3436bf3724884fe252c6cb603243c1195f67b701
Currently if limit=0 is passed to lock_path then the method will time
out and never acquire a lock which is reasonable but not useful. This
is a potential pitfall given that in other contexts, for example
diskfile's replication_lock, a concurrency value of 0 has the meaning
'no limit'. It would be easy to erroneously assume that the same
semantic holds for lock_path.
To avoid that pitfall, this patch makes it an error to pass limit<1 to
lock_path.
Related-Change: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
Change-Id: I9ea7ee5b93e3d6924bff9790141b107b53f77883
use assertRaises where applicable; use the with_tempdir
decorator
Related-Change: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
Change-Id: Ie83584fc22a5ec6e0a568e39c90caf577da29497
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.
Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.
Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
add more assertions about args that are passed to
os module functions
Related-Change: Ida15e72ae4ecdc2d6ce0d37bd99c2d86bd4e5ddc
Change-Id: Iee483274aff37fc9930cd54008533de2917157f4
Drop the group comparison from drop_privileges test as it isn't
valid since os.setgroups() is mocked.
Change-Id: Ida15e72ae4ecdc2d6ce0d37bd99c2d86bd4e5ddc
Closes-Bug: #1724342
The object server runs certain IO-intensive methods outside the main
pthread for performance. If one of those methods tries to log, this can
cause a crash that eventually leads to an object server with hundreds
or thousands of greenthreads, all deadlocked.
The short version of the story is that logging.SysLogHandler has a
mutex which Eventlet monkey-patches. However, the monkey-patched mutex
sometimes breaks if used across different pthreads, and it breaks in
such a way that it is still considered held. After that happens, any
attempt to emit a log message blocks the calling greenthread forever.
The fix is to use a mutex that works across different greenlets and
across different pthreads. This patch introduces such a lock based on
an anonymous pipe.
Change-Id: I57decefaf5bbed57b97a62d0df8518b112917480
Closes-Bug: 1710328
Calling dump_recon_cache with a key mapped to an empty dict value
causes the key to be removed from the cache entry. Doing the same
again causes the key to be added back and mapped an empty dict, and
the key continues to toggle as calls are repeated. This behavior is
seen on the Related-Bug report.
This patch fixes dump_recon_cache to make deletion of a key
idempotent. This fix is needed for the Related-Change which makes use
of empty dicts with dump_recon_cache to clear unwanted keys from the
cache.
The only caller that currently set empty dict values is
obj/auditor.py where the current intended behavior would appear to be
as per this patch.
Related-Change: I28925a37f3985c9082b5a06e76af4dc3ec813abe
Related-Bug: #1704858
Change-Id: If9638b4e7dba0ec2c7bd95809cec6c5e18e9301e
If swift-recon/swift-get-nodes/swift-object-info is used with the
swiftdir option they will read rings from the given directory; however
they are still using /etc/swift/swift.conf to find the policies on the
current node.
This makes it impossible to maintain a local swift.conf copy (if you
don't have write access to /etc/swift) or check multiple clusters from
the same node.
Until now swift-recon was also not usable with storage policy aliases,
this patch fixes this as well.
Closes-Bug: 1577582
Closes-Bug: 1604707
Closes-Bug: 1617951
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Change-Id: I13188d42ec19e32e4420739eacd1e5b454af2ae3
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
Following OpenStack Style Guidelines:
[1] http://docs.openstack.org/developer/hacking/#unit-tests-and-assertraises
[H203] Unit test assertions tend to give better messages for more specific
assertions. As a result, assertIsNone(...) is preferred over
assertEqual(None, ...) and assertIs(..., None)
Change-Id: If4db8872c4f5705c1fff017c4891626e9ce4d1e4
The ismount_raw method does not work inside containers if disks are
mounted on the hostsystem and only mountpoints are exposed inside the
containers. In this case the inode and device checks fail, making this
option unusable.
Mounting devices into the containers would solve this. However, this
would require that all processes that require access to a device are
running inside the same container, which counteracts the container
concept.
This patch adds the possiblity to place stubfiles named ".ismount" into
the root directory of any device, and Swift assumes a given device to be
mounted if that file exists. This should be transparent to existing
clusters.
Change-Id: I9d9fc0a4447a8c5dd39ca60b274c119af6b4c28f
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.
Change-Id: I3581c635c094a8c4339e9b770331a03eab704074