DaemonStrategy class calls Daemon.is_healthy() method every 0.1 seconds
to ensure that all workers are running as wanted.
On object replicator/reconstructor daemons, is_healthy() check if the rings
changed to decide if workers must be created/killed. With large rings,
this operation can be CPU intensive, especially on low-end CPU.
This patch:
- increases the check interval to 5 seconds by default, because none of
these daemons are critical for performance (they are not in the datapath).
But it allows each daemon to change this value if necessary
- ensures that before doing a computation of all devices in the ring,
object replicator/reconstructor checks that the ring really changed
(by checking the mtime of the ring.gz files)
On an Atom N2800 processor, this patch reduced the CPU usage of the main
object replicator/reconstructor from 70% of a core to 0%.
Change-Id: I2867e2be539f325778e2f044a151fd0773a7c390
unittest2 was needed for Python version <= 2.6, so it hasn't been needed
for quite some time. See unittest2 note one:
https://docs.python.org/2.7/library/unittest.html
This drops unittest2 in favor of the standard unittest module.
Change-Id: I2e787cfbf1709b7f9c889230a10c03689e032957
Signed-off-by: Sean McGinnis <sean.mcginnis@gmail.com>
Since we don't use 404s from handoffs anymore, we need to not let errors
on handoffs overwhelm primary responses either
Change-Id: I2624e113c9d945542f787e5f18f487bd7be3d32e
Closes-Bug: #1857909
Mostly this ammounts to
Exception.message -> Exception.args[0]
'...' -> b'...'
StringIO -> BytesIO
makefile() -> makefile('rwb')
iter.next() -> next(iter)
bytes[n] -> bytes[n:n + 1]
integer division
Note that the versioning tests are mostly untouched; they seemed to get
a little hairy.
Change-Id: I167b5375e7ed39d4abecf0653f84834ea7dac635
This started with ShardRanges and its CLI. The sharder is at the
bottom of the dependency chain. Even container backend needs it.
Once we started tinkering with the sharder, it all snowballed to
include the rest of the container services.
Beware, this does affect some of Python 2 code. Mostly it's trivial
and obviously correct, but needs checking by reviewers.
About killing the stray "from __future__ import unicode_literals":
we do not do it in general. The specific problem it caused was
a failure of functional tests because unicode leaked into a field
that was supposed to be encoded. It is just too hard to track the
types when rules change from file to file, so off with its head.
Change-Id: Iba4e65d0e46d8c1f5a91feb96c2c07f99ca7c666
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.
Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring. If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.
To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve. In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.
Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced. After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.
Closes-Bug: #1510342
Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
This would cause some weird issues where get_more_nodes() would actually
yield out something, despite us only having two drives.
Change-Id: Ibf658d69fce075c76c0870a542348f220376c87a
When looking at the tuple thing, remember that tuples are comparable
with ints in py2, but they do not compare like (100)[0]. Instead, they
are always greater, acting almost like MAX_INT. No wonder py3 banned
that practice.
We had a much of other noncomparables, but we dealt with them in
obvious ways, like adding key= to sorted().
Change-Id: I52e96406c3c1f39b98c1d81bdc057805cd1a6278
Consider a client that's downloading a large replicated object of size
N bytes. If the object server process dies (e.g. with a segfault)
partway through the download, the proxy will have read fewer than N
bytes, and then read(sockfd) will start returning 0 bytes. At this
point, the proxy believes the object download is complete, and so the
WSGI server waits for a new request to come in. Meanwhile, the client
is waiting for the rest of their bytes. Until the client times out,
that socket will be held open.
The fix is to look at the Content-Length and Content-Range headers in
the response from the object server, then retry with another object
server in case the original GET is truncated. This way, the client
gets all the bytes they should.
Note that ResumingGetter already had retry logic for when an
object-server is slow to send bytes -- this extends it to also cover
unexpected disconnects.
Change-Id: Iab1e07706193ddc86832fd2cff0d7c2cb6d79ad9
Related-Change: I74d8c13eba2a4917b5a116875b51a781b33a7abf
Closes-Bug: 1568650
Previously o_tmpfile support was detected by checking the
kernel version as it was officially introduced in XFS in 3.15.
The problem is that RHEL has backported the support to at least
RHEL 7.6 but the kernel version is not updated.
This patch changes o_tmpfile is detected by actually attempting to
open a file with the O_TMPFILE flag and keeps the information cached
in DiskFileManager so that the check only happens once while process
is running.
Change-Id: I3599e2ab257bcd99467aee83b747939afac639d8
This does not do anything about replicator or other daemons,
only ports the server.
- wsgi_to_bytes(something.get('X-Foo')) assumes that None is possible
- Dunno if del-in-for-in-dict calls for clone or list(), using list()
- Fixed the zero-copy with EventletPlungerString(bytes)
- Yet another reminder that Request.blank() takes a WSGI string path
Curiously enough, the sleep(0) before checking for logging was
already present in the tests. The py3 scheduling merely forces us
to do it consistently throughout.
Change-Id: I203a54fddddbd4352be0e6ea476a628e3f747dc1
These are trivial, but need to be done sooner or later.
About the isEnabledFor, our FakeLogger causes this on py3:
AttributeError: 'FakeLogger' object has no attribute '_cache'
Adding the isEnabledFor short-cuts a need for that private member.
Change-Id: I4d1df857a24801fe2a396dc003719f54d099f72c
Previously, _check_node() wouldn't catch the raise ValueError when
a drive was unmounted. Therefore the error would bubble up, uncaught,
and stop the shard cycle. The practical effect is that an unmounted
drive on a node would prevent sharding for happening.
This patch updates _check_node() to properly use the check_drive()
method. Furthermore, the _check_node() return value has been modified
to be more similar to what check_drive() actually returns. This
should help prevent similar errors from being introduced in the future.
Closes-Bug: #1806500
Change-Id: I3da9b5b120a5980e77ef5c4dc8fa1697e462ce0d
If swift-ring-builder is erroneously given a composite builder
file, which it will fail to load, it will now print a hint
that the file is a composite builder file.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: If4517f3b61977a7f6ca3e08ed5deb182aa87a366
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.
The workflow is, in overview:
- perform an audit of the container for sharding purposes.
- move any misplaced objects that do not belong in the container
to their correct shard.
- move shard ranges from FOUND state to CREATED state by creating
shard containers.
- move shard ranges from CREATED to CLEAVED state by cleaving objects
to shard dbs and replicating those dbs. By default this is done in
batches of 2 shard ranges per visit.
Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.
The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f
...in preparation for the container sharding feature.
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I4455677abb114a645cff93cd41b394d227e805de
Unit tests using O_TMPFILE only rely on the kernel version to check
for the feature. This is wrong, as some filesystem, like tmpfs, doesn't
support O_TMPFILE.
So, instead of checking kernel version, this patch actually attempts to
open a file using O_TMPFILE and see if that's supported. If not, then
the test is skipped.
Change-Id: I5d652f1634b1ef940838573cfdd799ea17b8b572
Bring under test
- test/unit/cli/test_dispersion_report.py
- test/unit/cli/test_info.py and
- test/unit/cli/test_relinker.py
I've verified that swift-*-info (at least) behave reasonably under
py3, even swift-object-info when there's non-utf8 metadata on the
data/meta file.
Change-Id: Ifed4b8059337c395e56f5e9f8d939c34fe4ff8dd
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.
While we're at it (and since they were so close anyway), port
* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py
Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.
This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.
Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.
As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.
The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?
Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.
Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).
References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.
THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS
With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.
So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.
`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.
The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.
Co-Authored-By: John Dickinson <me@not.mn>
Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.
This behavior is consistent with the object-server's checks in diskfile.
Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
This change adds a new Strategy concept to the daemon module similar to
how we manage WSGI workers. We need to leverage multiple python
processes to get the concurrency properties we need. More workers will
rebalance much faster on dense chassis with many devices.
Currently the default is still only one process, and no workers. Set
reconstructor_workers in the [object-reconstructor] section to some
whole number <= the number of devices on a node to get that many
reconstructor workers.
Each worker will operate on a different subset of disks.
Once mode works as before, but tends to want to update recon drops a
little bit more.
If you change the rings, the strategy will shutdown workers and spawn
new ones.
You can kill the worker pids and the daemon strategy will respawn them.
New per-disk reconstructor stats are dumped to recon under the
object_reconstruction_per_disk key. To maintain legacy compatibility
and replication monitoring based on cycle times they are aggregated
every stats_interval (default 5 mins).
Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
We've got these lovely __enter__ and __exit__ methods; let's use them!
Note that this also changes how we patch classes' setUp methods so we
don't set self._orig_POLICIES when the class is already patched. I
hope this may fix some sporadic failures that include tracebacks
that look like
proxy ERROR: ERROR 500 Traceback (most recent call last):
File ".../swift/obj/server.py", line 1105, in __call__
res = getattr(self, req.method)(req)
File ".../swift/common/utils.py", line 1626, in _timing_stats
resp = func(ctrl, *args, **kwargs)
File ".../swift/obj/server.py", line 880, in GET
policy=policy, frag_prefs=frag_prefs)
File ".../swift/obj/server.py", line 211, in get_diskfile
return self._diskfile_router[policy].get_diskfile(
File ".../swift/obj/diskfile.py", line 555, in __getitem__
return self.policy_to_manager[policy]
KeyError: ECStoragePolicy(...)
... and try to unpatch more gracefully with TestCase.addCleanup
Change-Id: Iaa3d42ec21758b0707155878a645e665aa36696c
Often, we want the current timestamp. May as well improve the ergonomics
a bit and provide a class method for it.
Change-Id: I3581c635c094a8c4339e9b770331a03eab704074
In particular, in TestObjectController.test_object_delete_at_async_update
Rarely (<0.1% of the time?), it would fail with:
======================================================================
FAIL: test_object_delete_at_async_update
(test.unit.obj.test_server.TestObjectController)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/vagrant/swift/test/unit/obj/test_server.py", line 4826, in
test_object_delete_at_async_update
resp = req.get_response(self.object_controller)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/vagrant/swift/test/unit/__init__.py", line 1075, in
mocked_http_conn
raise AssertionError('left over status %r' % left_over_status)
AssertionError: left over status [500, 500]
-------------------- >> begin captured stdout << ---------------------
test INFO: None - - [26/Apr/2017:22:32:13 +0000] "PUT /sda1/p/a/c/o" 400
19 "-" "-" "-" 0.0003 "-" 23801 0
--------------------- >> end captured stdout << ----------------------
>> raise AssertionError('left over status %r' % [500, 500])
----------------------------------------------------------------------
Related-Bug: 1514111
Change-Id: I1af4a291fb67cf4b1829f167998a08644117a800
SSYNC is designed to limit concurrent incoming connections in order to
prevent IO contention. The reconstructor should expect remote
replication servers to fail ssync_sender when the remote is too busy.
When the remote rejects SSYNC - it should avoid forcing additional IO
against the remote with a REPLICATE request which causes suffix
rehashing.
Suffix rehashing via REPLICATE verbs takes two forms:
1) a initial pre-flight call to REPLICATE /dev/part will cause a remote
primary to rehash any invalid suffixes and return a map for the local
sender to compare so that a sync can be performed on any mis-matched
suffixes.
2) a final call to REPLICATE /dev/part/suf1-suf2-suf3[-sufX[...]] will
cause the remote primary to rehash the *given* suffixes even if they are
*not* invalid. This is a requirement for rsync replication because
after a suffix is synced via rsync the contents of a suffix dir will
likely have changed and the remote server needs to update it hashes.pkl
to reflect the new data.
SSYNC does not *need* to send a post-sync REPLICATE request. Any
suffixes that are modified by the SSYNC protocol will call _finalize_put
under the hood as it is syncing. It is however not harmful and
potentially useful to go ahead refresh hashes after an SSYNC while the
inodes of those suffixes are warm in the cache.
However, that only makes sense if the SSYNC conversation actually synced
any suffixes - if SSYNC is rejected for concurrency before it ever got
started there is no value in the remote performing a rehash. It may be
that *another* reconstructor is pushing data into that same partition
and the suffixes will become immediately invalidated.
If a ssync_sender does not successful finish a sync the reconstructor
should skip the REPLICATE call entirely and move on to the next
partition without causing any useless remote IO.
Closes-Bug: #1665141
Change-Id: Ia72c407247e4525ef071a1728750850807ae8231
Follow up for related change:
- fix typos
- use common helper methods
- refactor some tests to reduce duplicate code
Related-Change: Idd155401982a2c48110c30b480966a863f6bd305
Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429
This patch enables efficent PUT/GET for global distributed cluster[1].
Problem:
Erasure coding has the capability to decrease the amout of actual stored
data less then replicated model. For example, ec_k=6, ec_m=3 parameter
can be 1.5x of the original data which is smaller than 3x replicated.
However, unlike replication, erasure coding requires availability of at
least some ec_k fragments of the total ec_k + ec_m fragments to service
read (e.g. 6 of 9 in the case above). As such, if we stored the
EC object into a swift cluster on 2 geographically distributed data
centers which have the same volume of disks, it is likely the fragments
will be stored evenly (about 4 and 5) so we still need to access a
faraway data center to decode the original object. In addition, if one
of the data centers was lost in a disaster, the stored objects will be
lost forever, and we have to cry a lot. To ensure highly durable
storage, you would think of making *more* parity fragments (e.g.
ec_k=6, ec_m=10), unfortunately this causes *significant* performance
degradation due to the cost of mathmetical caluculation for erasure
coding encode/decode.
How this resolves the problem:
EC Fragment Duplication extends on the initial solution to add *more*
fragments from which to rebuild an object similar to the solution
described above. The difference is making *copies* of encoded fragments.
With experimental results[1][2], employing small ec_k and ec_m shows
enough performance to store/retrieve objects.
On PUT:
- Encode incomming object with small ec_k and ec_m <- faster!
- Make duplicated copies of the encoded fragments. The # of copies
are determined by 'ec_duplication_factor' in swift.conf
- Store all fragments in Swift Global EC Cluster
The duplicated fragments increase pressure on existing requirements
when decoding objects in service to a read request. All fragments are
stored with their X-Object-Sysmeta-Ec-Frag-Index. In this change, the
X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index
encoded by PyECLib, there *will* be duplicates. Anytime we must decode
the original object data, we must only consider the ec_k fragments as
unique according to their X-Object-Sysmeta-Ec-Frag-Index. On decode no
duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an
object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and
avoided if possible.
On GET:
This patch inclues following changes:
- Change GET Path to sort primary nodes grouping as subsets, so that
each subset will includes unique fragments
- Change Reconstructor to be more aware of possibly duplicate fragments
For example, with this change, a policy could be configured such that
swift.conf:
ec_num_data_fragments = 2
ec_num_parity_fragments = 1
ec_duplication_factor = 2
(object ring must have 6 replicas)
At Object-Server:
node index (from object ring): 0 1 2 3 4 5 <- keep node index for
reconstruct decision
X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual
fragment index for
backend (PyEClib)
Additional improvements to Global EC Cluster Support will require
features such as Composite Rings, and more efficient fragment
rebalance/reconstruction.
1: http://goo.gl/IYiNPk (Swift Design Spec Repository)
2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo)
Doc-Impact
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Idd155401982a2c48110c30b480966a863f6bd305
Despite a check to prevent zero values in the denominator python
integer division could result in ZeroDivisionError in the compute_eta
helper function. Make sure we always have a non-zero value even if it
is small.
NotImplemented:
* stats calculation is still not great, see lp bug #1488608
Closes-Bug: #1549110
Change-Id: I54f2081c92c2a0b8f02c31e82f44f4250043d837
This patch makes the ECDiskFileReader check the validity of EC
fragment metadata as it reads chunks from disk and quarantine a
diskfile with bad metadata. This in turn means that both the object
auditor and a proxy GET request will cause bad EC fragments to be
quarantined.
This change is motivated by bug 1631144 which may result in corrupt EC
fragments being written to disk but appear valid to the object auditor
md5 hash and content-length checks.
NotImplemented:
* perform metadata check when a read starts on any frag_size
boundary, not just at zero
Related-Bug: #1631144
Closes-Bug: #1633647
Change-Id: Ifa6a7f8aaca94c7d39f4aeb9d4fa3f59c4f6ee13
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>
Linux 3.11 introduced O_TMPFILE as a flag to open() sys call. This would
enable users to get a fd to an unnamed temporary file. As it's unnamed,
it does not require the caller to devise unique names. It is also not
accessible through any path. Hence, file creation is race-free.
This file is initially unreachable. It is then populated with data(write),
metadata(fsetxattr) and fsync'd before being atomically linked into the
filesystem in a fully formed state using linkat() sys call. Only after a
successful linkat() will the object file will be available for reference.
Caveats
* Unlike os.rename(), linkat() cannot overwrite destination path if it
already exists. If path exists, we unlink and try again.
* XFS support for O_TMPFILE was only added in Linux 3.15.
* If client disconnects during object upload, although there is no
incomplete/stale file on disk, the object directory would persist
and is not cleaned up immediately.
Change-Id: I8402439fab3aba5d7af449b5e465f89332f606ec
Signed-off-by: Prashanth Pai <ppai@redhat.com>