The sharder find_compactible_shard_sequences function was vulnerable
to looping with some combinations of shrink_threshold and merge_size
parameters. The inner loop might not consume a shard range, resulting
in the same shard range being submitted to the inner loop again.
This patch simplifies the function in an attempt to make it more
obvious that the loops are always making progress towards termination
by consuming shard ranges from the list.
Change-Id: Ia87ab6feaf5172d91f1c60c2e0f72e03182e3c9b
Calls to process_compactible_shard_sequences were always followed by
calls to finalize_shrinking, so simplify the sharder module interface
by making process_compactible_shard_sequences call finalize_shrinking.
Change-Id: I22b8d23f32a5e776c37f711a913e4b40425d5e54
This patch adds shrinking candidates to the sharding recon dump
shrinking candidates will always be SHARDED root containers. And
get added to the candidate list if they have any ranges that are
compactible, that is to say they have ranges that can be compacted into
an upper neighbour.
The shrinking_candidates data comes out something like:
{
'found': 1,
'top': [
{
'object_count': <some number>,
'account': 'a',
'meta_timestamp': <ts1>,
'container': 'c',
'file_size': <something>,
'path': <something>,
'root': <something>,
'node_index': 0,
'compactible_ranges': 2
}]
}
In this case 'compactible_ranges' is the number of donors that can be shrunk
in a single command.
Change-Id: I63fc9ae39e164c2ce82865d055527b52c86b5b2a
If a sequence of shard ranges is already shrinking then in some
circumstances we do not want to report it as a candidate for
shrinking. For backwards compatibility allow already shrinking
sequences to be optionally included in return value of
find_compactible_shard_sequences.
Also refactor to add an is_shrinking_candidate() function.
Change-Id: Ifa20b7c08aba7254185918dfcee69e8206f51cea
Fix the find_compactible_shard_sequences function to prevent skipping
a shard range after finding a sequence of compactible shard ranges
that approaches the merge size.
Previously a compactible sequence would correctly terminate on the nth
shard range if the n+1th shard range would take the object count over
the merge_size, but the n+1th shard range would then be skipped and
not considered for the start of the next sequence.
Change-Id: I670441e7426b28ab2247563c7fa854d1cd502316
This method was previously only indirectly covered by
test_manage_shard_ranges. Add unit tests in test_sharder.
Change-Id: I9d0403f6dfa7a988e79f79a38ff713d05476cb84
This patch adds a 'compact' command to swift-manage-shard-ranges that
enables sequences of contiguous shards with low object counts to be
compacted into another existing shard, or into the root container.
Change-Id: Ia8f3297d610b5a5cf5598d076fdaf30211832366
Shard containers learn about their own shard range by fetching shard
ranges from the root container during the sharder audit phase. Since
[1], if the shard is shrinking, it may also learn about acceptor
shards in the shard ranges fetched from the root. However, the
fetched shard ranges do not currently include the root's own shard
range, even when the root is to be the acceptor for a shrinking shard.
This prevents the mechanism being used to perform shrinking to root.
This patch modifies the root container behaviour to include its own
shard range in responses to shard containers when the container GET
request param 'states' has value 'auditing'. This parameter is used to
indicate that a particular GET request is from the sharder during
shard audit; the root does not otherwise include its own shard range
in GET responses.
When the 'states=auditing' parameter is used with a container GET
request the response includes all shard ranges except those in the
FOUND state. The shard ranges of relevance to a shard are its own
shard range and any overlapping shard ranges that may be acceptors if
the shard is shrinking. None of these relevant shard ranges should be
in state FOUND: the shard itself cannot be in FOUND state since it has
been created; acceptor ranges should not be in FOUND state. The FOUND
state is therefore excluded from the 'auditing' states to prevent an
unintended overlapping FOUND shard range that has not yet been
resolved at the root container being fetched by a shrinking shard,
which might then proceed to create and cleave to it.
The shard only merges the root's shard range (and any other shard
ranges) when the shard is shrinking. If the root shard range is ACTIVE
then it is the acceptor and will be used when the shard cleaves. If
the root shard range is in any other state then it will be ignored
when the shard cleaves to other acceptors.
The sharder cleave loop is modified to break as soon as cleaving is
done i.e. cleaving has been completed up to the shard's upper bound.
This prevents misleading logging that cleaving has stopped when
in fact cleaving to a non-root acceptor has completed but the shard
range list still contains an irrelevant root shard range in SHARDED
state. This also prevents cleaving to more than one acceptor in the
unexpected case that multiple active acceptors overlap the shrinking
shard - cleaving will now complete once the first acceptor has
cleaved.
[1] Related-Change: I9034a5715406b310c7282f1bec9625fe7acd57b6
Change-Id: I5d48b67217f705ac30bb427ef8d969a90eaad2e5
We don't normally issue any DELETEs to shards when an empty root accepts
a DELETE from the client. If we allow root dbs to reclaim while they
still have shards we risk letting undeleted shards get orphaned.
Partial-Bug: 1911232
Change-Id: I4f591e393a526bb74675874ba81bf743936633c1
Use predictable timestamp iterators to avoid some sharder audit tests
intermittently failing due to timestamps not advancing as assumed.
Change-Id: I1caea0925a6e9a853c7d6d7ad23c27bd37c5056f
Shard shrinking can be instigated by a third party modifying shard
ranges, moving one shard to shrinking state and expanding the
namespace of one or more other shard(s) to act as acceptors. These
state and namespace changes must propagate to the shrinking and
acceptor shards. The shrinking shard must also discover the acceptor
shard(s) into which it will shard itself.
The sharder audit function already updates shards with their own state
and namespace changes from the root. However, there is currently no
mechanism for the shrinking shard to learn about the acceptor(s) other
than by a PUT request being made to the shrinking shard container.
This patch modifies the shard container audit function so that other
overlapping shards discovered from the root are merged into the
audited shard's db. In this way, the audited shard will have acceptor
shards to cleave to if shrinking.
This new behavior is restricted to when the shard is shrinking. In
general, a shard is responsible for processing its own sub-shard
ranges (if any) and reporting them to root. Replicas of a shard
container synchronise their sub-shard ranges via replication, and do
not rely on the root to propagate sub-shard ranges between shard
replicas. The exception to this is when a third party (or
auto-sharding) wishes to instigate shrinking by modifying the shard
and other acceptor shards in the root container. In other
circumstances, merging overlapping shard ranges discovered from the
root is undesirable because it risks shards inheriting other unrelated
shard ranges. For example, if the root has become polluted by
split-brain shard range management, a sharding shard may have its
sub-shards polluted by an undesired shard from the root.
During the shrinking process a shard range's own shard range state may
be either shrinking or, prior to this patch, sharded. The sharded
state could occur when one replica of a shrinking shard completed
shrinking and moved the own shard range state to sharded before other
replica(s) had completed shrinking. This makes it impossible to
distinguish a shrinking shard (with sharded state), which we do want
to inherit shard ranges, from a sharding shard (with sharded state),
which we do not want to inherit shard ranges.
This patch therefore introduces a new shard range state, 'SHRUNK', and
applies this state to shard ranges that have completed shrinking.
Shards are now restricted to inherit shard ranges from the root only
when their own shard range state is either SHRINKING or SHRUNK.
This patch also:
- Stops overlapping shrinking shards from generating audit warnings:
overlaps are cured by shrinking and we therefore expect shrinking
shards to sometimes overlap.
- Extends an existing probe test to verify that overlapping shard
ranges may be resolved by shrinking a subset of the shard ranges.
- Adds a --no-auto-shard option to swift-container-sharder to enable the
probe tests to disable auto-sharding.
- Improves sharder logging, in particular by decrementing ranges_todo
when a shrinking shard is skipped during cleaving.
- Adds a ShardRange.sort_key class method to provide a single definition
of ShardRange sort ordering.
- Improves unit test coverage for sharder shard auditing.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I9034a5715406b310c7282f1bec9625fe7acd57b6
md5 is not an approved algorithm in FIPS mode, and trying to
instantiate a hashlib.md5() will fail when the system is running in
FIPS mode.
md5 is allowed when in a non-security context. There is a plan to
add a keyword parameter (usedforsecurity) to hashlib.md5() to annotate
whether or not the instance is being used in a security context.
In the case where it is not, the instantiation of md5 will be allowed.
See https://bugs.python.org/issue9216 for more details.
Some downstream python versions already support this parameter. To
support these versions, a new encapsulation of md5() is added to
swift/common/utils.py. This encapsulation is identical to the one being
added to oslo.utils, but is recreated here to avoid adding a dependency.
This patch is to replace the instances of hashlib.md5() with this new
encapsulation, adding an annotation indicating whether the usage is
a security context or not.
While this patch seems large, it is really just the same change over and
again. Reviewers need to pay particular attention as to whether the
keyword parameter (usedforsecurity) is set correctly. Right now, all
of them appear to be not used in a security context.
Now that all the instances have been converted, we can update the bandit
run to look for these instances and ensure that new invocations do not
creep in.
With this latest patch, the functional and unit tests all pass
on a FIPS enabled system.
Co-Authored-By: Pete Zaitcev
Change-Id: Ibb4917da4c083e1e094156d748708b87387f2d87
A new header `X-Backend-Use-Replication-Network` is added; if true, use
the replication network instead of the client-data-path network.
Several background daemons are updated to use the replication network:
* account-reaper
* container-reconciler
* container-sharder
* container-sync
* object-expirer
Note that if container-sync is being used to sync data within the same
cluster, the replication network will only be used when communicating
with the "source" container; the "destination" traffic will continue to
use the configured realm endpoint.
The direct and internal client APIs still default to using the
client-data-path network; this maintains backwards compatibility for
external tools written against them.
UpgradeImpact
=============
Until recently, servers configured with
replication_server = true
would only handle REPLICATE (and, in the case of object servers, SSYNC)
requests, and would respond 405 Method Not Allowed to other requests.
When upgrading from Swift 2.25.0 or earlier, remove the config option
and restart services prior to upgrade to avoid a flood of background
daemon errors in logs.
Note that some background daemons find work by querying Swift rather
than walking local drives that should be available on the replication
network:
* container-reconciler
* object-expirer
Previosuly these may have been configured without access to the
replication network; ensure they have access before upgrading.
Closes-Bug: #1883302
Related-Bug: #1446873
Related-Change: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87
Change-Id: Ieef534bf5d5fb53602e875b51c15ef565882fbff
The existing tests cover a lot of behaviors and carry around a lot of
state that makes them hard to extend in a descriptive mannor to cover
new or changed behaviors.
Change-Id: Ie52932d8d4a66b11c295d5568aa3a60895b84f3b
In the previous patch, we could clean up all container DBs, but only if
the daemons went in a specific order (which cannot be guaranteed in a
production system).
Once a reclaim age passes, there's a race: If the container-replicator
processes the root container before the container-sharder processes the
shards, the deleted shards would get reaped from the root so they won't
be available for the sharder. The shard containers then hang around
indefinitely.
Now, be willing to mark shard DBs as deleted even when we can't find our
own shard range in the root. Fortunately, the shard already knows that
its range has been deleted; we don't need to get that info from the root.
Change-Id: If08bccf753490157f27c95b4038f3dd33d3d7f8c
Related-Change: Icba98f1c9e17e8ade3f0e1b9a23360cf5ab8c86b
The idea is, if none of
- timestamp,
- object_count,
- bytes_used,
- state, or
- epoch
has changed, we shouldn't need to send an update back to the root
container.
This is more-or-less comparable to what the container-updater does to
avoid unnecessary writes to the account.
Closes-Bug: #1834097
Change-Id: I1ee7ba5eae3c508064714c4deb4f7c6bbbfa32af
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.
Fix problems found.
Remove hacking and friends from lower-constraints, they are not needed
for installation.
Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
If we move it to constraints it's more globally accessible in our code,
but more importantly it's more obvious to ops that everything breaks if
you try to mis-configure different values per-service.
Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee
Previously, if you were on Python 2.7.10+ [0], such a newline would cause the
sharder to fail, complaining about invalid header values when trying to create
the shard containers. On older versions of Python, it would most likely cause a
parsing error in the container-server that was trying to handle the PUT.
Now, quote all places that we pass around container paths. This includes:
* The X-Container-Sysmeta-Shard-(Quoted-)Root sent when creating the (empty)
remote shards
* The X-Container-Sysmeta-Shard-(Quoted-)Root included when initializing the
local handoff for cleaving
* The X-Backend-(Quoted-)Container-Path the proxy sends to the object-server
for container updates
* The Location header the container-server sends to the object-updater
Note that a new header was required in requests so that servers would
know whether the value should be unquoted or not. We can get away with
reusing Location in responses by having clients opt-in to quoting with
a new X-Backend-Accept-Quoted-Location header.
During a rolling upgrade,
* old object-servers servicing requests from new proxy-servers will
not know about the container path override and so will try to update
the root container,
* in general, object updates are more likely to land in the root
container; the sharder will deal with them as misplaced objects, and
* shard containers created by new code on servers running old code
will think they are root containers until the server is running new
code, too; during this time they'll fail the sharder audit and report
stats to their account, but both of these should get cleared up upon
upgrade.
Drive-by: fix a "conainer_name" typo that prevented us from testing that
we can shard a container with unicode in its name. Also, add more UTF8
probe tests.
[0] See https://bugs.python.org/issue22928
Change-Id: Ie08f36e31a448a547468dd85911c3a3bc30e89f1
Closes-Bug: 1856894
When a container is being cleaved there is a possiblity that we're
dealing with an empty or near empty container created on a handoff node.
These containers may have a valid list of shard ranges, so would need
to cleave to the new shards.
Currently, when using a `cleave_batch_size` that is smaller then the
number of shard ranges on the cleaving container, these containers will
have to take a few shard passes to shard, even though there maybe
nothing in them.
This is worse if a really large container is sharding, and due to being
slow, error limitted a node causing a new container on a handoff
location. This empty container would have a large number of shard ranges
and could take a _very_ long time to shard away, slowing the process
down.
This patch eliminates the issue by detecting when no objects are
returned for a shard range. The `_cleave_shard_range` method now
returns 3 possible results:
- CLEAVE_SUCCESS
- CLEAVE_FAILED
- CLEAVE_EMPTY
They all are pretty self explanitory. When `CLEAVE_EMPTY` is returned
the code will:
- Log
- Not replicate the empty temp shard container sitting in a
handoff location
- Not count the shard range in the `cleave_batch_size` count
- Update the cleaving context so sharding can move forward
If there already is a shard range DB existing on a handoff node to use
then the sharder wont skip it, even if there are no objects, it'll
replicate it and treat it as normal, including using a `cleave_batch_size`
slot.
Change-Id: Id338f6c3187f93454bcdf025a32a073284a4a159
Closes-Bug: #1839355
This is a follow up patch from the cleaning up cleave context's patch
(patch 681970). Instead of tracking a last_modified timestamp, and storing
it in the context metadata, use the timestamp we use when storing any
metadata.
Reducing duplication is nice, but there's a more significant reason to
do this: affected container DBs can start getting cleaned up as soon as
they're running the new code rather than needing to wait for an
additional reclaim_age.
Change-Id: I2cdbe11f06ffb5574e573c4a60ba4e5d41a00c50
There is a sharding edge case where more CleaveContext are generated and
stored in the sharding container DB. If this number get's high enough,
like in the linked bug. If enough CleaveContects build up in the DB then
this can lead to the 503's when attempting to list the container due to
all the `X-Container-Sysmeta-Shard-Context-*` headers.
This patch resolves this by tracking the a CleaveContext's last
modified. And during the sharding audit, any context's that hasn't been
touched after reclaim_age are deleted.
This plus the skip empty ranges patches should improve these handoff
shards.
Change-Id: I1e502c328be16fca5f1cca2186b27a0545fecc16
Closes-Bug: #1843313
This started with ShardRanges and its CLI. The sharder is at the
bottom of the dependency chain. Even container backend needs it.
Once we started tinkering with the sharder, it all snowballed to
include the rest of the container services.
Beware, this does affect some of Python 2 code. Mostly it's trivial
and obviously correct, but needs checking by reviewers.
About killing the stray "from __future__ import unicode_literals":
we do not do it in general. The specific problem it caused was
a failure of functional tests because unicode leaked into a field
that was supposed to be encoded. It is just too hard to track the
types when rules change from file to file, so off with its head.
Change-Id: Iba4e65d0e46d8c1f5a91feb96c2c07f99ca7c666
Previously, _check_node() wouldn't catch the raise ValueError when
a drive was unmounted. Therefore the error would bubble up, uncaught,
and stop the shard cycle. The practical effect is that an unmounted
drive on a node would prevent sharding for happening.
This patch updates _check_node() to properly use the check_drive()
method. Furthermore, the _check_node() return value has been modified
to be more similar to what check_drive() actually returns. This
should help prevent similar errors from being introduced in the future.
Closes-Bug: #1806500
Change-Id: I3da9b5b120a5980e77ef5c4dc8fa1697e462ce0d
Most daemons have a "go as fast as you can then sleep for 30 seconds"
strategy towards resource utilization; the object-updater and
object-auditor however have some "X_per_second" options that allow
operators much better control over how they spend their I/O budget.
This change extends that pattern into the account-replicator,
container-replicator, and container-sharder which have been known to peg
CPUs when they're not IO limited.
Partial-Bug: #1784753
Change-Id: Ib7f2497794fa2f384a1a6ab500b657c624426384
...instead of 10,000,000. The sample configs were already using one
million, all of our testing with non-SAIO containers was done with
one million, and the resulting container DBs were around 100MB which
seems like a comfortable size. Pretty sure this was just a typo during
some code cleanup.
Change-Id: Icd31f9d8efaac2d5dc0f021cad550687859558b9
The sharder daemon visits container dbs and when necessary executes
the sharding workflow on the db.
The workflow is, in overview:
- perform an audit of the container for sharding purposes.
- move any misplaced objects that do not belong in the container
to their correct shard.
- move shard ranges from FOUND state to CREATED state by creating
shard containers.
- move shard ranges from CREATED to CLEAVED state by cleaving objects
to shard dbs and replicating those dbs. By default this is done in
batches of 2 shard ranges per visit.
Additionally, when the auto_shard option is True (NOT yet recommeneded
in production), the sharder will identify shard ranges for containers
that have exceeded the threshold for sharding, and will also manage
the sharding and shrinking of shard containers.
The manage_shard_ranges tool provides a means to manually identify
shard ranges and merge them to a container in order to trigger
sharding. This is currently the recommended way to shard a container.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I7f192209d4d5580f5a0aa6838f9f04e436cf6b1f