Object metadata is stored as a pickled hash: first the data is
pickled, then split into strings of length <= 254, then stored in a
series of extended attributes named "user.swift.metadata",
"user.swift.metadata1", "user.swift.metadata2", and so forth.
The choice of length 254 is odd, undocumented, and dates back to the
initial commit of Swift. From talking to people, I believe this was an
attempt to fit the first xattr in the inode, thus avoiding a
seek. However, it doesn't work. XFS _either_ stores all the xattrs
together in the inode (local), _or_ it spills them all to blocks
located outside the inode (extents or btree). Using short xattrs
actually hurts us here; by splitting into more pieces, we end up with
more names to store, thus reducing the metadata size that'll fit in
the inode.
[Source: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Extended_Attributes.html]
I did some benchmarking of read_metadata with various xattr sizes
against an XFS filesystem on a spinning disk, no VMs involved.
Summary:
name | rank | runs | mean | sd | timesBaseline
------|------|------|-----------|-----------|--------------
32768 | 1 | 2500 | 0.0001195 | 3.75e-05 | 1.0
16384 | 2 | 2500 | 0.0001348 | 1.869e-05 | 1.12809122912
8192 | 3 | 2500 | 0.0001604 | 2.708e-05 | 1.34210998858
4096 | 4 | 2500 | 0.0002326 | 0.0004816 | 1.94623473988
2048 | 5 | 2500 | 0.0003414 | 0.0001409 | 2.85674781189
1024 | 6 | 2500 | 0.0005457 | 0.0001741 | 4.56648611635
254 | 7 | 2500 | 0.001848 | 0.001663 | 15.4616067887
Here, "name" is the chunk size for the pickled metadata. A total
metadata size of around 31.5 KiB was used, so the "32768" runs
represent storing everything in one single xattr, while the "254" runs
represent things as they are without this change.
Since bigger xattr chunks make things go faster, the new chunk size is
64 KiB. That's the biggest xattr that XFS allows.
Reading of metadata from existing files is unaffected; the
read_metadata() function already handles xattrs of any size.
On non-XFS filesystems, this is no worse than what came before:
ext4 has a limit of one block (typically 4 KiB) for all xattrs (names
and values) taken together [1], so this change slightly increases the
amount of Swift metadata that can be stored on ext4.
ZFS let me store an xattr with an 8 MiB value, so that's plenty. It'll
probably go further, but I stopped there.
[1] https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Extended_Attributes
Change-Id: Ie22db08ac0050eda693de4c30d4bc0d620e7f7d4
The swift-ring-builder list_parts before rebalance failed abnormally so
this patch fix the behavior. After this patch applies the behavior is
completion normally with the following messages.
Specified builder file "<builder_file>" is not rebalanced yet.
Please rebalance first.
Closes-Bug: #1399529
Change-Id: I9e5db6da85de4188915c51bc401604733f0e1b77
Replace URLs for workflow documentation to appropriate parts of the
OpenStack Project Infrastructure Manual.
Change-Id: I060e5f6869fd302a47a54556f31763b5ab668012
The common db replicator's code path for reclaiming deleted db's beyond the
reclaim age was not covered by unittests, and a AttributeError snuck in. In
writing the test that would cover the common code both for accounts and
containers I discovered another KeyError with the container conditional for
validating the container's fully reported status.
This fixes both those issues and adds additional tests for the cleanup empty
account container partition and suffix directories.
Change-Id: I2a1bfaefebd05b01231bf71dd908fcc49adb4c36
Removing method _remaining_items from object controller class.
The only caller to this function was removed as part of the
work to move all DLO functionality to middleware:
https://review.openstack.org/63326
Change-Id: I7fbc208746bba8142ae51bf27cfa1979cae00301
Signed-off-by: Thiago da Silva <thiago@redhat.com>
Because we iterate over these directories on a replication run,
and they are not (previously) cleaned up, the time to start the
replication increases incrementally for each stale directory
lying around. Thousands of directories across dozens of disks
on a single machine can make for non-trivial startup times.
Plus it just seems like good housekeeping.
Closes-Bug: #1396152
Change-Id: Iab607b03b7f011e87b799d1f9af7ab3b4ff30019
We can't order a Timestamp with an offset larger than 16 hex digits
correctly, so we raise a ValueError if you try to create one.
Change-Id: I8c8d4cf13785a1a8eb7416392263eae5242aa407
Commit 6978275 changed xprofile middleware's usage of mktemp
and moved to using tempfile. But it was clearly never tested,
because the os.close() calls never worked. This patch updates
that previous patch to use a context to open and close the file.
Change-Id: I40ee42e8539551fd8e4dfb353f50146ab40a7847
Sysmeta included with an object PUT persists with the PUT data - if an
internal operation such as POST-as-copy during partial failure, or ssync
with fast-POST (not supported), causes that data to be lost then the
associated sysmeta will also be lost.
Since object sys-meta persistence in the face of a POST when the
original .data is unavailable requires fast-POST with .meta files the
probetest that validates object sys-meta persistence of a POST when the
most up-to-date copy of the object with sys-meta is unavailable
configures an InternalClient with object_post_as_copy = false.
This non-default configuration option is not supported by ssync and
results in a loss of sys-meta very similar to the object sys-meta
failure you would see with object_post_as_copy = true when the COPY part
of the POST is unable to retrieve the most recently written object with
sys-meta.
Until we can fix the default POST behavior to make metadata updates
without stomping on newer data file timestamps we should expect object
sys-meta to be "very very best possible but not really guaranteed
effort".
Until we can fix ssync to replicate metadata updates without stomping on
newer data file timestamps we should expect this test to fail.
When ssync replication of fast-POST metadata update is fixed this test
will fail signaling that the expected failure cruft should be removed,
but other parts of ssync replication will still work and some other bugs
can be fixed while we wait.
Change-Id: Ifc5d49514de79b78f7715408e0fe0908357771d3
Container quota is not currently checking Destination-Account header
which could cause quota to not be enforced in case of copies
Change-Id: I43adb0d7d2fc14ba6c0ca419a52a5c3f138f799a
Signed-off-by: Thiago da Silva <thiago@redhat.com>
Account/Container-replicator checks connection generation and timeout
in HTTP REPLICATE Request in _repl_to_node, but it doesn't really checks
connection but only construction of ReplConnection class.
This patch removes that invalid checking.
Change-Id: Ie6b4062123d998e69c15638b741e7d1ba8a08b62
Closes-Bug: #1359018
While investigating bug 1375348 I discovered the problem
reported there was not limited to the object-auditor. The
object-updater has similar bugs.
This patch catches the unhandled exception that can be thrown
by os.listdir if the self.devices directory is inaccessible.
Change-Id: I6293b840916bb63cf9eebbc05068d9a3c871bdc3
Related-bug: 1375348
os.listdir returns a list of items. The test case had been
written to return a single item which, though not really changing
the result of the test, was not the best approach.
This patch updates the test case to return a list instead of a single
item.
Change-Id: I793e0636440c0de0ca339c6592adec3e8b4ee1b4
Better isolation and consistency for in-process functests to always use
the FakeMemcache. If you want to test the real memcache you have real
functional tests.
Change-Id: Ic483f794e122130bd7694c9a5f9a2b1cd0b9a653
After discussion https://review.openstack.org/#/c/129384/ moving
to the doc directory in swift repo.
This lets us eliminate the object-api repo along with all the <service>-
api repos and move content to audience-centric locations.
Change-Id: Ia0d9973847f7409a02dcc1a0e19400a3c3ecdf32
Noticed that slo and dlo middleware were placed before
tempauth, they should be placed after
DocImpact
Change-Id: Ia931e2280125d846f248b23e219aebad14c66210
Signed-off-by: Thiago da Silva <thiago@redhat.com>
After the release of Swift ver. 2.0.0, some recon responses do not
show each policy's information yet. To make things worse, some recon
results only count on policy-0's score, therefore the total is not
shown in the recon results.
With this patch, async_pending count of recon results becomes
policy-aware. Suppose a number of async_pending files for policy-0 is 2
and a number for policy-1 is 3, recon sums up every policy's amount
as follows.
$ curl http://<host>:<port>/recon/async
{"async_pending": 5} # It showed 2 before this commit
Related-Bug: 1375332
Change-Id: Ifc88b8c9e06b9f022a926a87ed807e938e1e0412
This patch was first motivated by noticing that the proxy
server pipeline used for in process functional tests was
out of date with respect to the pipeline in
/etc/proxy-server.conf.sample. Rather than cut and paste
the current pipeline into the in process setup, it seems
like a better idea would be to have the in process tests
always use the sample config.
A further benefit is that in process functional tests will
pick up changes to the sample config introduced by patches -
previously test/functional/__init__.py would need to be
manually modified to run in process functional tests
on new middleware for example.
Note: because the pipeline is now loaded using entry points,
'python setup.py [develop|install]' will now be needed
before running the tests.
Obvious next steps would be to do the same for the backend
servers, and to allow alternative config files and dir's
to be specified, but this patch is the first step.
Also drive-by fixes some typos in proxy-server.conf.sample
Change-Id: If442bd7c2b1721ec92839c4490924ba33e1545d8