29 Commits

Author SHA1 Message Date
Tim Burke
3f88907012 sharding: Better-handle newlines in container names
Previously, if you were on Python 2.7.10+ [0], such a newline would cause the
sharder to fail, complaining about invalid header values when trying to create
the shard containers. On older versions of Python, it would most likely cause a
parsing error in the container-server that was trying to handle the PUT.

Now, quote all places that we pass around container paths. This includes:

  * The X-Container-Sysmeta-Shard-(Quoted-)Root sent when creating the (empty)
    remote shards
  * The X-Container-Sysmeta-Shard-(Quoted-)Root included when initializing the
    local handoff for cleaving
  * The X-Backend-(Quoted-)Container-Path the proxy sends to the object-server
    for container updates
  * The Location header the container-server sends to the object-updater

Note that a new header was required in requests so that servers would
know whether the value should be unquoted or not. We can get away with
reusing Location in responses by having clients opt-in to quoting with
a new X-Backend-Accept-Quoted-Location header.

During a rolling upgrade,

  * old object-servers servicing requests from new proxy-servers will
    not know about the container path override and so will try to update
    the root container,
  * in general, object updates are more likely to land in the root
    container; the sharder will deal with them as misplaced objects, and
  * shard containers created by new code on servers running old code
    will think they are root containers until the server is running new
    code, too; during this time they'll fail the sharder audit and report
    stats to their account, but both of these should get cleared up upon
    upgrade.

Drive-by: fix a "conainer_name" typo that prevented us from testing that
we can shard a container with unicode in its name. Also, add more UTF8
probe tests.

[0] See https://bugs.python.org/issue22928

Change-Id: Ie08f36e31a448a547468dd85911c3a3bc30e89f1
Closes-Bug: 1856894
2020-01-03 16:04:57 -08:00
Tim Burke
f4bb1bea28 reconciler: Enqueue right work for shard containers
This fixes newly-enqueued work going forward, but doesn't offer anything
to try to parse existing bad work items or even to kick shards so they
reset their reconciler high-water marks.

Change-Id: I79d20209cea70a6447c4e94941e5e854889cbec5
Closes-Bug: 1836082
2019-07-10 23:48:39 -07:00
Tim Burke
4c4bd778ea container-replicator: Add a timeout for get_shard_ranges
Previously this had no timeout, which meant that the replicator might
hang and fail to make progress indefinitely while trying to receive
shard ranges.

While we're at it, only call get_shard_ranges when the remote indicates
that it has shard ranges for us to sync -- this reduces the number of
requests necessary to bring unsharded replicas in sync.

Change-Id: I32f51f42d76db38271442a261600089404a00f91
Closes-Bug: #1835260
2019-07-03 22:29:47 -07:00
Tim Burke
bc4494f24d py3: port account/container replicators
Change-Id: Ia2662d8f75883e1cc41b9277c65f8b771f56f902
2018-11-06 16:54:20 -08:00
Tim Burke
f6a436aeda Only try to fetch or sync shard ranges if the remote supports sharding
Change-Id: I7231e8af310e268484f2075f0194b7783cf1c3ea
2018-06-14 16:58:29 -07:00
Alistair Coles
e748ef4637 Verify diff stat is unchanged when syncing only shard ranges
Add test assertions to verify that the related change fixes usync diff
stats being erroneously incremented as a side-effect of syncing shard
ranges when the object tables are in sync.

Related-Change: I2630bb127841837b35e7786b59895fa50090719b
Change-Id: Idffe93c63d16e74ea9ca42b33a636c0c0d9e35b5
2018-06-01 08:49:52 +01:00
Alistair Coles
e4045fb475 Add unit tests for replicator sync_shard_ranges
Related-Change: Ie4d2816259e6c25c346976e181fb9d350f947190

Change-Id: Icd558c1f92c24724a76931f1d281a9a20122b683
2018-05-23 09:57:41 +01:00
Alistair Coles
14af38a899 Add support for sharding in ContainerBroker
With this patch the ContainerBroker gains several new features:

1. A shard_ranges table to persist ShardRange data, along with
methods to merge and access ShardRange instances to that table,
and to remove expired shard ranges.

2. The ability to create a fresh db file to replace the existing db
file. Fresh db files are named using the hash of the container path
plus an epoch which is a serialized Timestamp value, in the form:

  <hash>_<epoch>.db

During sharding both the fresh and retiring db files co-exist on
disk. The ContainerBroker is now able to choose the newest on disk db
file when instantiated. It also provides a method (get_brokers()) to
gain access to broker instance for either on disk file.

3. Methods to access the current state of the on disk db files i.e.
UNSHARDED (old file only), SHARDING (fresh and retiring files), or
SHARDED (fresh file only with shard ranges).

Container replication is also modified:

1. shard ranges are replicated between container db peers. Unlike
objects, shard ranges are both pushed and pulled during a REPLICATE
event.

2. If a container db is capable of being sharded (i.e. it has a set of
shard ranges) then it will no longer attempt to replicate objects to
its peers. Object record durability is achieved by sharding rather than
peer to peer replication.

Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>

Change-Id: Ie4d2816259e6c25c346976e181fb9d350f947190
2018-05-18 18:42:38 +01:00
Clay Gerrard
feee399840 Use check_drive consistently
We added check_drive to the account/container servers to unify how all
the storage wsgi servers treat device dirs/mounts.  Thus pushes that
unification down into the consistency engine.

Drive-by:
 * use FakeLogger less
 * clean up some repeititon in probe utility for device re-"mounting"

Related-Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
2017-11-01 16:33:40 +00:00
Alistair Coles
e91de49d68 Update container on fast-POST
This patch makes a number of changes to enable content-type
metadata to be updated when using the fast-POST mode of
operation, as proposed in the associated spec [1].

* the object server and diskfile are modified to allow
  content-type to be updated by a POST and the updated value
  to be stored in .meta files.

* the object server accepts PUTs and DELETEs with older
  timestamps than existing .meta files. This is to be
  consistent with replication that will leave a later .meta
  file in place when replicating a .data file.

* the diskfile interface is modified to provide accessor
  methods for the content-type and its timestamp.

* the naming of .meta files is modified to encode two
  timestamps when the .meta file contains a content-type value
  that was set prior to the latest metadata update; this
  enables consistency to be achieved when rsync is used for
  replication.

* ssync is modified to sync meta files when content-type
  differs between local and remote copies of objects.

* the object server issues container updates when handling
  POST requests, notifying the container server of the current
  immutable metadata (etag, size, hash, swift_bytes),
  content-type with their respective timestamps, and the
  mutable metadata timestamp.

* the container server maintains the most recently reported
  values for immutable metadata, content-type and mutable
  metadata, each with their respective timestamps, in a single
  db row.

* new probe tests verify that replication achieves eventual
  consistency of containers and objects after discrete updates
  to content-type and mutable metadata, and that container-sync
  sync's objects after fast-post updates.

[1] spec change-id: I60688efc3df692d3a39557114dca8c5490f7837e

Change-Id: Ia597cd460bb5fd40aa92e886e3e18a7542603d01
2016-03-03 14:25:10 +00:00
Eran Rom
85a0a6a28e Container-Sync to iterate only over synced containers
This change introduces a sync_store which holds only containers that
are enabled for sync. The store is implemented using a directory
structure that resembles that of the containers directory, but has
entries only for containers enabled for sync.
The store is maintained in two ways:
1. Preemptively by the container server when processing
PUT/POST/DELETE operations targeted at containers with
x-container-sync-key / x-container-sync-to
2. In the background using the containers replicator
whenever it processes a container set up for sync

The change updates [1]
[1] http://docs.openstack.org/developer/swift/overview_container_sync.html

Change-Id: I9ae4d4c7ff6336611df4122b7c753cc4fa46c0ff
Closes-Bug: #1476623
2016-01-06 16:46:31 +02:00
Matthew Oliver
4a13dcc4a8 Make db_replicator usync smaller containers
The current rule inside the db_replicator is to rsync+merge
containers during replication if the difference between rowids
differ by more than 50%:

  # if the difference in rowids between the two differs by
  # more than 50%, rsync then do a remote merge.
  if rinfo['max_row'] / float(info['max_row']) < 0.5:

This mean on smaller containers, that only have few rows, and differ
by a small number still rsync+merge rather then copying rows.

This change adds a new condition, the difference in the rowids must
be greater than the defined per_diff otherwise usync will be used:

  # if the difference in rowids between the two differs by
  # more than 50% and the difference is greater than per_diff,
  # rsync then do a remote merge.
  # NOTE: difference > per_diff stops us from dropping to rsync
  # on smaller containers, who have only a few rows to sync.
  if rinfo['max_row'] / float(info['max_row']) < 0.5 and \
          info['max_row'] - rinfo['max_row'] > self.per_diff:

Change-Id: I9e779f71bf37714919a525404565dd075762b0d4
Closes-bug: #1019712
2015-10-19 15:26:12 +01:00
janonymous
cd7b2db550 unit tests: Replace "self.assert_" by "self.assertTrue"
The assert_() method is deprecated and can be safely replaced by assertTrue().
This patch makes sure that running the tests does not create undesired
warnings.

Change-Id: I0602ba39ef93263386644ee68088d5f65fcb4a71
2015-07-21 19:23:00 +05:30
janonymous
09e7477a39 Replace it.next() with next(it) for py3 compat
The Python 2 next() method of iterators was renamed to __next__() on
Python 3. Use the builtin next() function instead which works on Python
2 and Python 3.

Change-Id: Ic948bc574b58f1d28c5c58e3985906dee17fa51d
2015-06-15 22:10:45 +05:30
Alistair Coles
bfbc94c3cb Fix intermittent container replicator test failure
Intermittent failure of this test could be due to
insufficient time elapsing between either the local
and remote db's being created or during the
debug_timing calls. This patch forces greater timestamp
separation and forces debug_timing to always log timings.

Also add message to the failign assertion so if this does
fail again we get some clue as to why.

Closes-Bug: 1369663

Change-Id: I4b69b2e759d586a14abd0931a68dbdf256d57c32
2015-04-28 10:53:05 +01:00
Clay Gerrard
5ef2e9ec00 fixup random test failure
Change-Id: I053cec441623e7556d6cdfbc1d271c5cdf21aa8b
2015-04-14 16:06:24 -07:00
Clay Gerrard
404ac092d1 Fix large out of sync out of date containers
As I understand it db replication starts with a preflight sync request
to the remote container server who's response will include the last
synced row_id that it has on file for the sending nodes database id.

If the difference in the last sync point returned is more than 50% of
the local sending db's rows, it'll fall back to sending the whole db
over rsync and let the remote end merge items locally - but generally
there's just a few rows missing and they're shipped over the wire as
json and stuffed into some rather normal looking merge_items calls.

The one thing that's a bit different with these remote merge_items calls
(compared to your average run of the mill eat a bunch of entries out of
a .pending file) is the is source kwarg.  When this optional kwarg comes
into merge_items it's the remote sending db's uuid, and after we eat all
the rows it sent us we update our local incoming_sync table for that
uuid so that next time when it makes it's pre-flight sync request we can
tell it where it left off.

Now normally the sending db is going to push out it's rows up from the
returned sync_point in 1000 item diffs, up to 10 batches total (per_diff
and max_diffs options) - 10K rows.  If that goes well then everything is
in sync up to at least the point it started, and the sending db will
*also* ship over *it's* incoming_sync rows to merge_syncs on the remote
end.  Since the sending db is in sync with these other db's up to those
points so is the remote db now by way of the transitive property.  Also
note through some weird artifact that I'm not entirely convinced isn't
an unrelated and possibly benign bug the incoming_sync table on the
sending db will often also happen to include it's own uuid - maybe it
got pushed back to it from another node?

Anyway, that seemed to work well enough until a sending db got diff
capped (i.e. sent it's 10K rows and wasn't finished), when this happened
the final merge_syncs call never gets sent because the remote end is
definitely *not* up to date with the other databases that the sending db
is - it's not even up-to-date with the sending db yet!  But the hope is
certainly that on the next pass it'll be able to finish sending the
remaining items.  But since the remote end is who decides what the last
successfully synced row with this local sending db was - it's super
important that the incoming_sync table is getting updated in merge_items
when that source kwarg is there.

I observed this simple and straight forward process wasn't working well
in one case - which is weird considering it didn't have much in the way
of tests.  After I had the test and started looking into it seemed maybe
the source kwarg handling got over-indented a bit in the bulk insert
merge_items refactor.  I think this is correct - maybe we could send
someone up to the mountain temple to seek out gholt?

Change-Id: I4137388a97925814748ecc36b3ab5f1ac3309659
2015-01-07 17:20:35 -08:00
Clay Gerrard
233e0aebf7 Fix reclaim on deleted containers
The common db replicator's code path for reclaiming deleted db's beyond the
reclaim age was not covered by unittests, and a AttributeError snuck in.  In
writing the test that would cover the common code both for accounts and
containers I discovered another KeyError with the container conditional for
validating the container's fully reported status.

This fixes both those issues and adds additional tests for the cleanup empty
account container partition and suffix directories.

Change-Id: I2a1bfaefebd05b01231bf71dd908fcc49adb4c36
2014-12-03 17:10:15 -08:00
Alistair Coles
b9ae377eab Check for change before container replicator updates db
As described in the related bug report, unnecessary updates
to the container db during replication can impact object
object GET performance in certain circumstances.

This patch changes swift/container/replicator.py so that
calls to merge_timestamps and update_reconciler_sync
are conditional on values having actually changed.

Related-Bug: 1332025
Change-Id: If498251656500ed7a3d7ca4b109ea1079b8513c2
2014-09-02 17:08:58 +01:00
Clay Gerrard
c1dc2fa624 Add two vector timestamps
The normalized form of the X-Timestamp header looks like a float with a fixed
width to ensure stable string sorting - normalized timestamps look like
"1402464677.04188"

To support overwrites of existing data without modifying the original
timestamp but still maintain consistency a second internal offset
vector is append to the normalized timestamp form which compares and
sorts greater than the fixed width float format but less than a newer
timestamp.  The internalized format of timestamps looks like
"1402464677.04188_0000000000000000" - the portion after the underscore
is the offset and is a formatted hexadecimal integer.

The internalized form is not exposed to clients in responses from Swift.
Normal client operations will not create a timestamp with an offset.

The Timestamp class in common.utils supports internalized and normalized
formatting of timestamps and also comparison of timestamp values.  When the
offset value of a Timestamp is 0 - it's considered insignificant and need not
be represented in the string format; to support backwards compatibility during
a Swift upgrade the internalized and normalized form of a Timestamp with an
insignificant offset are identical.  When a timestamp includes an offset it
will always be represented in the internalized form, but is still excluded
from the normalized form.  Timestamps with an equivalent timestamp portion
(the float part) will compare and order by their offset.  Timestamps with a
greater timestamp portion will always compare and order greater than a
Timestamp with a lesser timestamp regardless of it's offset.  String
comparison and ordering is guaranteed for the internalized string format, and
is backwards compatible for normalized timestamps which do not include an
offset.

The reconciler currently uses a offset bump to ensure that objects can move to
the wrong storage policy and be moved back.  This use-case is valid because
the content represented by the user-facing timestamp is not modified in way.
Future consumers of the offset vector of timestamps should be mindful of HTTP
semantics of If-Modified and take care to avoid deviation in the response from
the object server without an accompanying change to the user facing timestamp.

DocImpact
Implements: blueprint storage-policies
Change-Id: Id85c960b126ec919a481dc62469bf172b7fb8549
2014-06-19 10:18:06 -07:00
Clay Gerrard
a14d2c857c Enqueue misplaced objects during container replication
After a container database is replicated, a _post_replicate_hook will enqueue
misplaced objects for the container-reconciler into the .misplaced_objects
containers.  Items to be reconciled are "batch loaded" into the reconciler
queue and the end of a container replication cycle by levering container
replication itself.

DocImpact
Implements: blueprint storage-policies
Change-Id: I3627efcdea75403586dffee46537a60add08bfda
2014-06-18 21:09:50 -07:00
Clay Gerrard
81bc31e6ec Merge container storage_policy_index
Keep status_changed_at in container databases current with status changes that
occur as a result of container creation, deletion, or re-creation.

Merge container put/delete/created timestamps when handling replicate
responses from remote servers in addition to during the handling of the
REPLICATE request.

When storage policies are configured on a cluster send status_changed_at,
object_count and storage_policy_index as part of container replication sync
args.

Use status_changed_at during replication to determine the oldest active
container and merge storage_policy_index.

DocImpact
Implements: blueprint storage-policies
Change-Id: Ib9a0dd42c271145e641437dc04d0ebea1e11fc47
2014-06-18 20:57:09 -07:00
Peter Portante
9411a24ba7 Revert "Refactor common/utils methods to common/ondisk"
This reverts commit 7760f41c3ce436cb23b4b8425db3749a3da33d32

Change-Id: I95e57a2563784a8cd5e995cc826afeac0eadbe62
Signed-off-by: Peter Portante <peter.portante@redhat.com>
2013-10-07 17:18:09 -04:00
ZhiQiang Fan
f72704fc82 Change OpenStack LLC to Foundation
Change-Id: I7c3df47c31759dbeb3105f8883e2688ada848d58
Closes-bug: #1214176
2013-09-20 01:02:31 +08:00
Peter Portante
7760f41c3c Refactor common/utils methods to common/ondisk
Place all the methods related to on-disk layout and / or configuration
into a new common module that can be shared by the various modules
using the same on-disk layout.

Change-Id: I27ffd4665d5115ffdde649c48a4d18e12017e6a9
Signed-off-by: Peter Portante <peter.portante@redhat.com>
2013-09-17 17:32:04 -04:00
gholt
213f385348 Fixed bug with container reclaim/report race
Before, a really lagged cluster might not get its final report for a
deleted container database sent to its corresponding account
database. In such a case, the container database file would be
permanently deleted while still leaving the container listed in the
account database, never to be updated since the actual container
database file was gone. The only way to fix such the situation before
was to recreate and redelete the container.

Now, the container database file will not be permanently deleted
until it has sent its final report successfully to its corresponding
account database.

Change-Id: I1f42202455e7ecb0533b84ce7f45fcc7b98aeaa3
2012-06-03 03:23:51 +00:00
John Dickinson
1ecf5ebba1 updated copyright date for all files
Change-Id: Ifd909d3561c2647770a7e0caa3cd91acd1b4f298
2012-03-19 13:45:34 -05:00
Anne Gentle
8823427161 Changed copyright notices on py files and the single rst file with a copyright notice 2011-01-04 17:34:43 -06:00
gholt
e6e354c483 Added some missing test stubs so we can better see coverage (and get a little syntax-level "testing"). 2010-10-07 08:23:17 -07:00