49 Commits

Author SHA1 Message Date
Jenkins
7f953ce0b9 Merge "Follow up the reconstructor test coverage" 2017-03-07 15:10:03 +00:00
Jenkins
30524392f8 Merge "Follow up on reconstructor handoffs_only" 2017-03-07 06:45:09 +00:00
Kota Tsuyuzaki
0e44770991 Follow up the reconstructor test coverage
This is follow up for https://review.openstack.org/#/c/436522/

I'd like to use same assertion if it goes the same path.
Both Exception and Timeout will be in the exception log starts with
"Trying to GET". "Timeout" is an extra word appeared in the log.

And more, this adds assertions for the return value from the
get_response for error cases which should be as None.

Change-Id: Iba86b495a14c15fc6eca8bf8a7df7d110256b0af
2017-03-05 18:02:09 -08:00
Jenkins
cf1c44dff0 Merge "Fixups for EC frag duplication tests" 2017-03-03 23:08:34 +00:00
Jenkins
3891721d59 Merge "Cleanup reconstructor tests" 2017-03-03 21:28:32 +00:00
Kota Tsuyuzaki
54347f92ed Cleanup reconstructor tests
Fixes:

* assertTrue(xxxx in yyyyy) -> assertIn(xxxx, yyyy)
* assertTrue(xxxx > yyyy) -> assertGreater(xxxx, yyyy)

Change-Id: I353ec389f9abed3427951cd473d7c3ebcbca1669
2017-03-03 00:57:13 -08:00
Mahati Chamarthy
96f8b957ee Increase test coverage for reconstructor
Some part of the test coverage was omitted in related change
and some has been missing. This change fixes it.

Change-Id: I403b493bd8e59f6bcb586b4263a8e8c267728505
Related-Change-Id: I69e4c4baee64fd2192cbf5836b0803db1cc71705
2017-02-28 00:54:11 +05:30
Jenkins
1f36b5dd16 Merge "EC Fragment Duplication - Foundational Global EC Cluster Support" 2017-02-26 06:26:08 +00:00
Alistair Coles
e4972f5ac7 Fixups for EC frag duplication tests
Follow up for related change:
- fix typos
- use common helper methods
- refactor some tests to reduce duplicate code

Related-Change: Idd155401982a2c48110c30b480966a863f6bd305

Change-Id: I2f91a2f31e4c1b11f3d685fa8166c1a25eb87429
2017-02-25 20:40:04 -08:00
Kota Tsuyuzaki
40ba7f6172 EC Fragment Duplication - Foundational Global EC Cluster Support
This patch enables efficent PUT/GET for global distributed cluster[1].

Problem:
Erasure coding has the capability to decrease the amout of actual stored
data less then replicated model. For example, ec_k=6, ec_m=3 parameter
can be 1.5x of the original data which is smaller than 3x replicated.
However, unlike replication, erasure coding requires availability of at
least some ec_k fragments of the total ec_k + ec_m fragments to service
read (e.g. 6 of 9 in the case above). As such, if we stored the
EC object into a swift cluster on 2 geographically distributed data
centers which have the same volume of disks, it is likely the fragments
will be stored evenly (about 4 and 5) so we still need to access a
faraway data center to decode the original object. In addition, if one
of the data centers was lost in a disaster, the stored objects will be
lost forever, and we have to cry a lot. To ensure highly durable
storage, you would think of making *more* parity fragments (e.g.
ec_k=6, ec_m=10), unfortunately this causes *significant* performance
degradation due to the cost of mathmetical caluculation for erasure
coding encode/decode.

How this resolves the problem:
EC Fragment Duplication extends on the initial solution to add *more*
fragments from which to rebuild an object similar to the solution
described above. The difference is making *copies* of encoded fragments.
With experimental results[1][2], employing small ec_k and ec_m shows
enough performance to store/retrieve objects.

On PUT:

- Encode incomming object with small ec_k and ec_m  <- faster!
- Make duplicated copies of the encoded fragments. The # of copies
  are determined by 'ec_duplication_factor' in swift.conf
- Store all fragments in Swift Global EC Cluster

The duplicated fragments increase pressure on existing requirements
when decoding objects in service to a read request.  All fragments are
stored with their X-Object-Sysmeta-Ec-Frag-Index.  In this change, the
X-Object-Sysmeta-Ec-Frag-Index represents the actual fragment index
encoded by PyECLib, there *will* be duplicates.  Anytime we must decode
the original object data, we must only consider the ec_k fragments as
unique according to their X-Object-Sysmeta-Ec-Frag-Index.  On decode no
duplicate X-Object-Sysmeta-Ec-Frag-Index may be used when decoding an
object, duplicate X-Object-Sysmeta-Ec-Frag-Index should be expected and
avoided if possible.

On GET:

This patch inclues following changes:
- Change GET Path to sort primary nodes grouping as subsets, so that
  each subset will includes unique fragments
- Change Reconstructor to be more aware of possibly duplicate fragments

For example, with this change, a policy could be configured such that

swift.conf:
ec_num_data_fragments = 2
ec_num_parity_fragments = 1
ec_duplication_factor = 2
(object ring must have 6 replicas)

At Object-Server:
node index (from object ring):  0 1 2 3 4 5 <- keep node index for
                                               reconstruct decision
X-Object-Sysmeta-Ec-Frag-Index: 0 1 2 0 1 2 <- each object keeps actual
                                               fragment index for
                                               backend (PyEClib)

Additional improvements to Global EC Cluster Support will require
features such as Composite Rings, and more efficient fragment
rebalance/reconstruction.

1: http://goo.gl/IYiNPk (Swift Design Spec Repository)
2: http://goo.gl/frgj6w (Slide Share for OpenStack Summit Tokyo)

Doc-Impact

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Idd155401982a2c48110c30b480966a863f6bd305
2017-02-22 10:56:13 -08:00
Jenkins
1f3dd83f41 Merge "Remove per-device reconstruction stats" 2017-02-20 21:55:27 +00:00
Tim Burke
8973ceb31a Remove per-device reconstruction stats
Now that we're shuffling parts before going through them, those stats no
longer make sense -- device completion would always be 100%.

Also, always use delete_partition for cleanup, so we have one place to
make improvements. This means we'll properly clean up non-numeric
directories.

Also also, put more I/O in the tpool in delete_partition.

Change-Id: Ie06bb16c130d46ccf887c8fcb252b8d018072d68
Related-Change: I69e4c4baee64fd2192cbf5836b0803db1cc71705
2017-02-20 16:22:45 +00:00
Jenkins
cdd72dd34f Merge "Deprecate broken handoffs_first in favor of handoffs_only" 2017-02-15 03:54:49 +00:00
Kota Tsuyuzaki
600db4841e Follow up on reconstructor handoffs_only
This is a follow-up for https://review.openstack.org/#/c/425493
This patch includes:

- Add more tests on the configuration with handoffs_first and
  handoffs_only
- Remove unnecessary space in a warning log line. (2 places)
- Change test conf from True/False to "True"/"False" (string) because in
  the conf dict, those value should be string.

Co-Authored-By: Janie Richling <jrichli@us.ibm.com>

Change-Id: Ida90c32d16481a15fa68c9fdb380932526c366f6
2017-02-14 18:21:58 -08:00
Clay Gerrard
da557011ec Deprecate broken handoffs_first in favor of handoffs_only
The handoffs_first mode in the replicator has the useful behavior of
processing all handoff parts across all disks until there aren't any
handoffs anymore on the node [1] and then it seemingly tries to drop
back into normal operation.  In practice I've only ever heard of
handoffs_first used while rebalancing and turned off as soon as the
rebalance finishes - it's not recommended to run with handoffs_first
mode turned on and it emits a warning on startup if option is enabled.

The handoffs_first mode on the reconstructor doesn't work - it was
prioritizing handoffs *per-part* [2] - which is really unfortunate
because in the reconstructor during a rebalance it's often *much* more
attractive from an efficiency disk/network perspective to revert a
partition from a handoff than it is to rebuild an entire partition from
another primary using the other EC fragments in the cluster.

This change deprecates handoffs_first in favor of handoffs_only in the
reconstructor which is far more useful - and just like handoffs_first
mode in the replicator - it gives the operator the option of forcing the
consistency engine to focus on rebalance.  The handoffs_only behavior is
somewhat consistent with the replicator's handoffs_first option (any
error on any handoff in the replicactor will make it essentially handoff
only forever) but the option does what you want and is named correctly
in the reconstructor.

For consistency with the replicator the reconstructor will mostly honor
the handoffs_first option, but if you set handoffs_only in the config it
always takes precedence.  Having handoffs_first in your config always
results in a warning, but if handoff_only is not set and handoffs_first
is true the reconstructor will assume you need handoffs_only and behaves
as such.

When running in handoffs_only mode the reconstructor will start to log a
warning every cycle if you leave it running in handoffs_only after it
finishes reverting handoffs.  However you should be monitoring on-disk
partitions and disable the option as soon as the cluster finishes the
full rebalance cycle.

1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator
handoffs_first "mode"

2. Unlike replication each partition in a EC policy can have a different
kind of job per frag_index, but the cardinality of jobs is typically
only one (either sync or revert) unless there's been a bunch of errors
during write and then handoffs partitions maybe hold a number of
different fragments.

Known-Issues:

handoffs_only is not documented outside of the example config, see lp
bug #1626290

Closes-Bug: #1653018

Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898
2017-02-13 21:13:29 -08:00
Jenkins
65744c8448 Merge "Shuffle disks and parts in reconstructor" 2017-02-01 07:15:50 +00:00
Clay Gerrard
eadb01b8af Do not revert fragments to handoffs
We're already a handoff - just wait until we can ship it to the right
primary location.

If we timeout talking to a couple of nodes (or more likely get rejected
for connection limits because of contention during a rebalance) we can
actually end up making *more* work if we move the part to another node.
I've seen clusters get stuck on rebalance just passing parts around
handoffs for *days*.

Known-Issues:

If we see a 507 from a primary and we're not in the handoff list (we're
an old primary post rebalance) it'd probably be not so terrible to try
to revert it to the first handoff if it's not already holding a part.
But that's more work and sounds more like lp bug #1510342

Closes-Bug: #1653169

Change-Id: Ie351d8342fc8e589b143f981e95ce74e70e52784
2017-01-31 02:37:31 +00:00
Clay Gerrard
2f0ab78f9f Shuffle disks and parts in reconstructor
The main problem with going disk by disk is that it means all of your
I/O is only on one spindle at a time and no matter how high you set
concurrency it doesn't go any faster.

Closes-Bug: #1491605

Change-Id: I69e4c4baee64fd2192cbf5836b0803db1cc71705
2017-01-25 18:30:17 -08:00
Mahati Chamarthy
69f7be99a6 Move documented reclaim_age option to correct location
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.

I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age.  This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.

There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.

UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup.  All object
services must use the same reclaim_age.  If you require a non-default
reclaim age it should be set in the [DEFAULT] section.  If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.

If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade.  If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.

Closes-Bug: #1626296

Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
2017-01-13 03:10:47 +00:00
Kota Tsuyuzaki
b09360d447 Fix stats calculation in object-reconstructor
This patch fixes the object-reconstructor to calculate device_count
as the total number of local devices in all policies. Previously
Swift counts it for each policy but reconstruction_device_count
which means the number of devices actually swift needs to reconstruct
is counted as sum of ones for all polices.

With this patch, Swift will gather all local devices for all policies
at first, and then, collect parts for each devices as well as current.
To do so, we can see the statuses for remaining job/disks percentage via
stats_line output.

To enable this change, this patch also touchs the object replicator
to get a DiskFileManager via the DiskFileRouter class so that
DiskFileManager instances are policy specific. Currently the same
replication policy DiskFileManager class is always used, but this
change future proofs the replicator for possible other DiskFileManager
implementations.

The change also gives the ObjectReplicator a _df_router variable,
making it consistent with the ObjectReconstructor, and allowing a
common way for ssync.Sender to access DiskFileManager instances via
it's daemon's _df_router instance.

Also, remove the use of FakeReplicator from the ssync test suite. It
was not necessary and risked masking divergence between ssync and the
replicator and reconstructor daemon implementations.

Co-Author: Alistair Coles <alistair.coles@hpe.com>

Closes-Bug: #1488608
Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068
2016-12-12 21:26:54 -08:00
Clay Gerrard
f4adb2f28f Fix ZeroDivisionError in reconstructor.stats_line
Despite a check to prevent zero values in the denominator python
integer division could result in ZeroDivisionError in the compute_eta
helper function.  Make sure we always have a non-zero value even if it
is small.

NotImplemented:

 * stats calculation is still not great, see lp bug #1488608

Closes-Bug: #1549110
Change-Id: I54f2081c92c2a0b8f02c31e82f44f4250043d837
2016-11-07 18:19:20 -08:00
Alistair Coles
6574ce31ee EC: reconstruct using non-durable fragments
Previously the reconstructor would only reconstruct a missing fragment
when a set of ec_ndata other fragments was available, *all* of which
were durable. Since change [1] it has been possible to retrieve
non-durable fragments from object servers. This patch changes the
reconstructor to take advantage of [1] and use non-durable fragments.

A new probe test is added to test scenarios with a mix of failed and
non-durable nodes. The existing probe tests in
test_reconstructor_rebuild.py and test_reconstructor_durable.py were
broken. These were intended to simulate cases where combinations of
nodes were either failed or had non-durable fragments, but the test
scenarios defined were not actually created - every test scenario
broke only one node instead of the intent of breaking multiple
nodes. The existing tests have been refactored to re-use most of their
setup and assertion code, and merged with the new test into a single
class in test_reconstructor_rebuild.py.

test_reconstructor_durable.py is removed.

[1] Related-Change: I2310981fd1c4622ff5d1a739cbcc59637ffe3fc3

Change-Id: Ic0cdbc7cee657cea0330c2eb1edabe8eb52c0567
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Closes-Bug: #1624088
2016-11-03 16:54:09 +00:00
Ondřej Nový
33c18c579e Remove executable flag from some test modules
Change-Id: I36560c2b54c43d1674b007b8105200869b5f7987
2016-10-31 21:22:10 +00:00
Jenkins
264e728364 Merge "Prevent ssync writing bad fragment data to diskfile" 2016-10-14 23:29:29 +00:00
Alistair Coles
3218f8b064 Prevent ssync writing bad fragment data to diskfile
Previously, if a reconstructor sync type job failed to provide
sufficient bytes from a reconstructed fragment body iterator to match
the content-length that the ssync sender had already sent to the ssync
receiver, the sender would still proceed to send the next
subrequest. The ssync receiver might then write the start of the next
subrequest to the partially complete diskfile for the previous
subrequest (including writing subrequest headers to that diskfile)
until it has received content-length bytes.

Since a reconstructor ssync job does not send an ETag header (it
cannot because it does not know the ETag of a reconstructed fragment
until it has been sent) then the receiving object server does not
detect the "bad" data written to the fragment diskfile, and worse,
will label it with an ETag that matches the md5 sum of the bad
data. The bad fragment file will therefore appear good to the auditor.

There is no easy way for the ssync sender to communicate a lack of
source data to the receiver other than by disconnecting the
session. So this patch adds a check in the ssync sender that the sent
byte count is equal to the sent Content-Length header value for each
subrequest, and disconnect if a mismatch is detected.

The disconnect prevents the receiver finalizing the bad diskfile, but
also prevents subsequent fragments in the ssync job being sync'd until
the next cycle.

Closes-Bug: #1631144
Co-Authored-By: Kota Tsuyuzaki <tsuyuzaki.kota@lab.ntt.co.jp>

Change-Id: I54068906efdb9cd58fcdc6eae7c2163ea92afb9d
2016-10-13 17:15:10 +01:00
Alistair Coles
b13b49a27c EC - eliminate .durable files
Instead of using a separate .durable file to indicate
the durable status of a .data file, rename the .data
to include a durable marker in the filename. This saves
one inode for every EC fragment archive.

An EC policy PUT will, as before, first rename a temp
file to:

   <timestamp>#<frag_index>.data

but now, when the object is committed, that file will be
renamed:

   <timestamp>#<frag_index>#d.data

with the '#d' suffix marking the data file as durable.

Diskfile suffix hashing returns the same result when the
new durable-data filename or the legacy durable file is
found in an object directory. A fragment archive that has
been created on an upgraded object server will therefore
appear to be in the same state, as far as the consistency
engine is concerned, as the same fragment archive created
on an older object server.

Since legacy .durable files will still exist in deployed
clusters, many of the unit tests scenarios have been
duplicated for both new durable-data filenames and legacy
durable files.

Change-Id: I6f1f62d47be0b0ac7919888c77480a636f11f607
2016-10-10 18:11:02 +01:00
Tim Burke
ad16e2c77b Stop complaining about auditor_status files
Following fd86d5a, the object-auditor would leave status files so it
could resume where it left off if restarted. However, this would also
cause the object-reconstructor to print warnings like:

  Unexpected entity in data dir: u'/srv/node4/sdb8/objects/auditor_status_ZBF.json'

...which isn't actually terribly useful or actionable. The auditor will
clean it up (eventually); the operator doesn't have to do anything.

Now, the reconstructor will specifically ignore those status files.

Change-Id: I2f3d0bd2f1e242db6eb263c7755f1363d1430048
2016-05-11 20:13:46 -07:00
Shashirekha Gundur
cf48e75c25 change default ports for servers
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.

Updated SAIO docs.

DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
2016-04-29 14:47:38 -04:00
Samuel Merritt
9430f4c9f5 Move HeaderKeyDict to avoid an inline import
There was a function in swift.common.utils that was importing
swob.HeaderKeyDict at call time. It couldn't import it at compilation
time since utils can't import from swob or else it blows up with a
circular import error.

This commit just moves HeaderKeyDict into swift.common.header_key_dict
so that we can remove the inline import.

Change-Id: I656fde8cc2e125327c26c589cf1045cb81ffc7e5
2016-03-07 12:26:48 -08:00
Ondřej Nový
f53cf1043d Fixed few misspellings in comments
Change-Id: I8479c85cb8821c48b5da197cac37c80e5c1c7f05
2016-01-05 20:20:15 +01:00
Tushar Gohad
2d85a3f699 EC: Use best available ec_type in unittests
To minimize external library dependencies for Swift unit
tests and SAIO, PyECLib 1.1.1 introduces a native backend
'liberasurecode_rs_vand.'  This patch is to migrate over
the unit tests to the new ec_type when available.

This change will work with current pyeclib requirements
(==1.0.7) and also future requirements (>=1.0.7).

When we're able to raise *our* requirements to >=1.1.1 we
should remove jerasure from the list of preferred backends.
Related SAIO doc and example config changes should be
included with that patch.

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: Idf657f0acf0479bc8158972e568a29dbc08eaf3b
2015-11-10 12:18:50 -08:00
Samuel Merritt
e31ecb24b6 Get rid of contextlib.nested() for py3
contextlib.nested() is missing completely in Python 3.

Since 2.7, we can use multiple context managers in a 'with' statement,
like so:

    with thing1() as t1, thing2() as t2:
        do_stuff()

Now, if we had some code that needed to nest an arbitrary number of
context managers, there's stuff we could do with contextlib.ExitStack
and such... but we don't. We only use contextlib.nested() in tests to
set up bunches of mocks without crazy-deep indentation, and all that
stuff fits perfectly into multiple-context-manager 'with' statements.

Change-Id: Id472958b007948f05dbd4c7fb8cf3ffab58e2681
2015-10-23 11:44:54 -07:00
Alistair Coles
29c10db0cb Add POST capability to ssync for .meta files
ssync currently does the wrong thing when replicating object dirs
containing both a .data and a .meta file. The ssync sender uses a
single PUT to send both object content and metadata to the receiver,
using the metadata (.meta file) timestamp. This results in the object
content timestamp being advanced to the metadata timestamp,
potentially overwriting newer object data on the receiver and causing
an inconsistency with the container server record for the object.

For example, replicating an object dir with {t0.data(etag=x), t2.meta}
to a receiver with t1.data(etag=y) will result in the creation of
t2.data(etag=x) on the receiver. However, the container server will
continue to list the object as t1(etag=y).

This patch modifies ssync to replicate the content of .data and .meta
separately using a PUT request for the data (no change) and a POST
request for the metadata. In effect, ssync replication replicates the
client operations that generated the .data and .meta files so that
the result of replication is the same as if the original client requests
had persisted on all object servers.

Apart from maintaining correct timestamps across sync'd nodes, this has
the added benefit of not needing to PUT objects when only the metadata
has changed and a POST will suffice.

Taking the same example, ssync sender will no longer PUT t0.data but will
POST t2.meta resulting in the receiver having t1.data and t2.meta.

The changes are backwards compatible: an upgraded sender will only sync
data files to a legacy receiver and will not sync meta files (fixing the
erroneous behavior described above); a legacy sender will operate as
before when sync'ing to an upgraded receiver.

Changes:
- diskfile API provides methods to get the data file timestamp
  as distinct from the diskfile timestamp.

- diskfile yield_hashes return tuple now passes a dict mapping data and
  meta (if any) timestamps to their respective values in the timestamp
  field.

- ssync_sender will encode data and meta timestamps in the
  (hash_path, timestamp) tuple sent to the receiver during
  missing_checks.

- ssync_receiver compares sender's data and meta timestamps to any
  local diskfile and may specify that only data or meta parts are sent
  during updates phase by appending a qualifier to the hash returned
  in its 'wanted' list.

- ssync_sender now sends POST subrequests when a meta file
  exists and its content needs to be replicated.

- ssync_sender may send *only* a POST if the receiver indicates that
  is the only part required to be sync'd.

- object server will allow PUT and DELETE with earlier timestamp than
  a POST

- Fixed TODO related to replicated objects with fast-POST and ssync

Related spec change-id: I60688efc3df692d3a39557114dca8c5490f7837e

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Closes-Bug: 1501528
Change-Id: I97552d194e5cc342b0a3f4b9800de8aa6b9cb85b
2015-10-02 11:24:19 +00:00
Jenkins
f1b5a1f4c5 Merge "Reconstructor logging to omit 404 warnings" 2015-09-19 18:45:31 +00:00
Jenkins
227e1f8235 Merge "Fix purge for tombstone only REVERT job" 2015-09-19 18:42:30 +00:00
Minwoo Bae
a63f70c17d Reconstructor logging to omit 404 warnings
Currently, the replicator does not log warning messages
for 404 responses. We would like the reconstructor to
do the same, as 404s are not considered unusual, and
are already handled by the object server.

Change-Id: Ia927bf30362548832e9f451923ff94053e11b758
Closes-Bug: #1491883
2015-09-18 15:25:32 -05:00
Bill Huber
9324ce83c6 Reconstructor GET excludes user_agent in log
To make it easier for users to deduce in the log to find out
where the request originates from, it is necessary to include
the user_agent field in the reconstructor for a GET method
and to have this particular log consistent with other servers'
methods.

Change-Id: I0ca7443436e97c2db64c966ab4d73c5c12a1f059
Closes-Bug: 1491871
Co-Authored-By: Kota Tsuyuzakai <tsuyuzaki.kota@lab.ntt.co.jp>
2015-09-11 14:49:41 -05:00
Clay Gerrard
369447ec47 Fix purge for tombstone only REVERT job
When we revert a partition we normally push it off to the specific
primary node for the index of the data files in the partition.  However,
when a partition is devoid of any data files (only tombstones) we build
a REVERT job with a frag_index of None.

This change updates the ECDiskFile's purge method to be robust to
purging tombstones when the frag_index is None.

Add probetest to validate tombstone only revert jobs will clean
themselves up if they can validate they're in-sync with part-replica
count nodes - even if one of the primaries is down (in which case they
sync tombstones with other handoffs to fill in for the primaries)

Change-Id: Ib9a42f412fb90d51959efce886c0f8952aba8d85
2015-09-10 11:07:04 +01:00
janonymous
9456af35a2 pep8 fix: assertEquals -> assertEqual
assertEquals is deprecated in py3,changes in
dir:
*test/unit/obj/*
*test/unit/test_locale/*

Change-Id: I3dd0c1107165ac529f1cd967363e5cf408a1d02b
2015-08-07 19:28:35 +05:30
Jenkins
e7205fd7d6 Merge "cPickle is deprecated in py3, replacing it from six.moves" 2015-07-28 12:33:24 +00:00
Charles Hsu
39b6ef6e4f Fix reconstructor stats mssage.
Calculate reconstruction job count and remaining time that
would be inappropriate for user. Use real partition count would
be suitable for user.

Change-Id: I6b025854baf4757dddf9d7fe7bc2cece58a49157
Closes-Bug: #1468298
2015-07-08 12:52:30 +08:00
Jenkins
131668f359 Merge "EC Reconstructor: Do not reconstruct existing fragments." 2015-07-07 22:24:16 +00:00
janonymous
c907107fe4 cPickle is deprecated in py3, replacing it from six.moves
cPickle is deprecated and should be replaced with six.moves
to provide py2 and py3 compatibility.

Change-Id: Ibad990708722360d188c641e61444d50a16a1e93
2015-07-07 22:46:37 +05:30
Minwoo Bae
44b76a1b1b EC Reconstructor: Do not reconstruct existing fragments.
The EC reconstructor needs to verify that the fragment needing to
be reconstructed does not reside in the collection of node responses.
Otherwise, resources will be spent unnecessarily reconstructing
the fragment. Moreover, this could cause a segfault on some backends.

This change adds the necessary verification steps to make sure
that a fragment will only be rebuilt in the case it is missing from
the other fragment archives.

Added some tests to provide coverage for these scenarios.

Change-Id: I91f3d4af52cbc66c9f7ce00726f247b5462e66f9
Closes-Bug: #1452553
2015-06-26 16:46:58 -05:00
Darrell Bishop
df134df901 Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs.  The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring.  In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket.  The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk.  The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring.  The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).

Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM.  Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.

The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True.  This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.

A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses).  The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.

This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server".  The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks.  Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None.  Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled.  Bonus improvement for IPv6
addresses in is_local_device().

This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/

Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").

However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.

This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.

In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.

Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.

For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md

Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md

If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md

DocImpact

Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-06-18 12:43:50 -07:00
Clay Gerrard
a3559edc23 Exclude local_dev from sync partners on failure
If the primary left or right hand partners are down, the next best thing
is to validate the rest of the primary nodes.  Where the rest should
exclude not just the left and right hand partners - but ourself as well.

This fixes a accidental noop when partner node is unavailable and
another node is missing data.

Validation:

Add probetests to cover ssync failures for the primary sync_to nodes for
sync jobs.

Drive-by:

Make additional plumbing for the check_mount and check_dir constraints into
the remaining daemons.

Change-Id: I4d1c047106c242bca85c94b569d98fd59bb255f4
2015-05-26 12:50:31 -07:00
Kota Tsuyuzaki
27f6fba5c3 Use reconstruct insetad of decode/encode
With bumping PyECLib up to 1.0.7 on global requirements,
we can use the "reconstruct" function directly instead
of the current hack doing decode/encode on reconstructor.
That is because the hack was for treating PyECLib < 1.0.7
(strictly jearsure scheme) reconstruction bug so we don't
have to do decode/encode anymore.

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I69aae495670e3d0bdebe665f73915547a4d56f99
2015-04-20 16:48:10 -07:00
Clay Gerrard
52b102163e Don't apply the wrong Etag validation to rebuilt fragments
Because of the object-server's interaction with ssync sender's
X-Backend-Replication-Headers when a object (or fragment archive) is
pushed unmodified to another node it's ETag value is duped into the
recieving ends metadata as Etag.  This interacts poorly with the
reconstructor's RebuildingECDiskFileStream which can not know ahead of
time the ETag of the fragment archive being rebuilt.

Don't send the Etag from the local source fragment archive being used as
the basis for the rebuilt fragent archive's metadata along to ssync.

Change-Id: Ie59ad93a67a7f439c9a84cd9cff31540f97f334a
2015-04-15 23:33:32 +01:00
paul luse
647b66a2ce Erasure Code Reconstructor
This patch adds the erasure code reconstructor. It follows the
design of the replicator but:
  - There is no notion of update() or update_deleted().
  - There is a single job processor
  - Jobs are processed partition by partition.
  - At the end of processing a rebalanced or handoff partition, the
    reconstructor will remove successfully reverted objects if any.

And various ssync changes such as the addition of reconstruct_fa()
function called from ssync_sender which performs the actual
reconstruction while sending the object to the receiver

Co-Authored-By: Alistair Coles <alistair.coles@hp.com>
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: John Dickinson <me@not.mn>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com>
Co-Authored-By: Samuel Merritt <sam@swiftstack.com>
Co-Authored-By: Christian Schwede <christian.schwede@enovance.com>
Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com>
blueprint ec-reconstructor
Change-Id: I7d15620dc66ee646b223bb9fff700796cd6bef51
2015-04-14 00:52:17 -07:00