Add global ec cluster improvement
Add new spec "Improve Erasure Coding Efficiency for Global Cluster". Change-Id: I9d743ae45ab983d0d151c7a6790afb5f082f0d39
This commit is contained in:
parent
565c5a9fc5
commit
8545ec7d7e
179
specs/in_progress/global_ec_cluster.rst
Normal file
179
specs/in_progress/global_ec_cluster.rst
Normal file
@ -0,0 +1,179 @@
|
|||||||
|
::
|
||||||
|
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0
|
||||||
|
Unported License.
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
=====================================================
|
||||||
|
Improve Erasure Coding Efficiency for Global Cluster
|
||||||
|
=====================================================
|
||||||
|
|
||||||
|
This SPEC describes an improvement of efficiency for Global Cluster with
|
||||||
|
Erasure Coding. It proposes a way to improve the PUT/GET performance
|
||||||
|
in the case of Erasure Coding with more than 1 regions ensuring original
|
||||||
|
data even if a region is lost.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
Swift now supports Erasure Codes (EC) which ensures higher durability and lower
|
||||||
|
disk cost than the replicated case for a one region cluster. However, currently
|
||||||
|
if Swift were running EC over 2 regions, using < 2x data redundancy
|
||||||
|
(e.g. ec_k=10, ec_m=4) and then one of the regions is gone due to some unfortunate
|
||||||
|
reasons (e.g. huge earthquake, fire, tsunami), there is a chance data would be lost.
|
||||||
|
That is because, assuming each region has an even available volume of disks, each
|
||||||
|
region should have around 7 fragments, less than ec_k, which is not enough data
|
||||||
|
for the EC scheme to rebuild the original data.
|
||||||
|
|
||||||
|
To protect stored data and to ensure higher durability, Swift has to keep >= 1
|
||||||
|
data size for each region (i.e. >= 2x for 2 regions) by employing larger ec_m
|
||||||
|
like ec_m=14 for ec_k=10. However, this increase sacrifices encode performance.
|
||||||
|
In my measurements running PyECLib encode/decode on an Intel Xeon E5-2630v3 [1], the
|
||||||
|
benchmark result was as follows:
|
||||||
|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
|scheme |ec_k|ec_m|encode |decode |
|
||||||
|
+================+====+====+=========+=========+
|
||||||
|
|jerasure_rs_vand|10 |4 |7.6Gbps |12.21Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
| |10 |14 |2.67Gbps |12.27Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
| |20 |4 |7.6Gbps |12.87Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
| |20 |24 |1.6Gbps |12.37Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
|isa_lrs_vand |10 |4 |14.27Gbps|18.4Gbps |
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
| |10 |14 |6.53Gbps |18.46Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
| |20 |4 |15.33Gbps|18.12Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
| |20 |24 |4.8Gbps |18.66Gbps|
|
||||||
|
+----------------+----+----+---------+---------+
|
||||||
|
|
||||||
|
Note that "decode" uses (ec_k + ec_m) - 2 fragments so performance will
|
||||||
|
decrease less than when encoding as is shown in the results above.
|
||||||
|
|
||||||
|
In the results above, comparing ec_k=10, ec_m=4 vs ec_k=10, ec_m=14, the encode
|
||||||
|
performance falls down about 1/3 and other encodings follow a similar trend.
|
||||||
|
This demonstrates that there is a problem when building a 2+ region EC cluster.
|
||||||
|
|
||||||
|
1: http://ark.intel.com/ja/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Add an option like "duplication_factor". Which will create duplicated (copied)
|
||||||
|
fragments instead of employing a larger ec_m.
|
||||||
|
|
||||||
|
For example, with a duplication_factor=2, Swift will encode ec_k=10, ec_m=4 and
|
||||||
|
store 28 fragments (10x2 data fragments and 4x2 parity fragments) in Swift.
|
||||||
|
|
||||||
|
This requires a change to PUT/GET and the reconstruct sequence to map from the
|
||||||
|
fragment index in Swift to actual fragment index for PyECLib but IMHO we don't
|
||||||
|
need to make an effort to build much modification for the conversation among
|
||||||
|
proxy-server <-> object-server <-> disks.
|
||||||
|
|
||||||
|
I don't want describe the implementation in detail in the first patch of the spec
|
||||||
|
because it should be an idea to improve Swift. More discussion on the implementation
|
||||||
|
side will following in subsequent patches.
|
||||||
|
|
||||||
|
Considerations of acutal placement
|
||||||
|
----------------------------------
|
||||||
|
Placement of these doubled fragments are important. If the same fragments,
|
||||||
|
original and copied, appear in the same region and the second region fails,
|
||||||
|
then we would be in the same situation where we couldn't rebuild the original
|
||||||
|
object as we were in the smaller parity fragments case.`
|
||||||
|
|
||||||
|
e.g:
|
||||||
|
|
||||||
|
- duplication_factor=2, k=4, m=2
|
||||||
|
- 1st Region: [0, 1, 2, 6, 7, 8]
|
||||||
|
- 2nd Region: [3, 4, 5, 9, 10, 11]
|
||||||
|
- (Assuming actual indices to rebuild mapped as index // (k+m))
|
||||||
|
|
||||||
|
In this case, 1st region has only fragments consisting of fragment index 0, 1, 2
|
||||||
|
and 2nd has only 3, 4, 5. Therefore, it is not able to rebuild the original object
|
||||||
|
from the fragments in only one region because the fragment uniqueness in the
|
||||||
|
region is less than k. The worst case scenario, like this, will cause significant data
|
||||||
|
loss as would happen with no duplication factor.
|
||||||
|
|
||||||
|
i.e. In fact, data durability will be
|
||||||
|
|
||||||
|
- "no duplication" < "with duplication" < "more unique parities"
|
||||||
|
|
||||||
|
In future work, we can find a way to tie a fragment index to a region,
|
||||||
|
something like "1st subset should be in 1st Region and 2nd subset
|
||||||
|
should be ..." but so far this is beyond this spec.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
We can find a way to use container-sync as a solution to the problem rather
|
||||||
|
then employing my proposed change.
|
||||||
|
This section will describe the pros/cons for my "proposed change" and "container-sync".
|
||||||
|
|
||||||
|
Proposed Change
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
Pros:
|
||||||
|
|
||||||
|
- Higher performance way to spread objects across regions (No need to re-decode/encode for transferring across regions)
|
||||||
|
- No extra configuration other than storage policy is needed for users to turn on the global replication. (strictly global erasure coding?)
|
||||||
|
- Able to use other global cluster efficiency improvements (affinity control)
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
|
||||||
|
- Need to employ more complex handling around ECObjecController
|
||||||
|
|
||||||
|
Container-Sync
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
Pros:
|
||||||
|
|
||||||
|
- Simple and able to reuse existing swift mechanisms
|
||||||
|
- Less data transfer between regions
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
|
||||||
|
- Re-decode/encode is required when transferring objects to another region
|
||||||
|
- Need to set the sync option for each container
|
||||||
|
- Impossible to retrieve/reconstruct an object when > ec_m disks unavailable (includes ip unreachable)
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
- Proxy-Server PUT/GET path
|
||||||
|
- Object-Reconstructor
|
||||||
|
- (Optional) Ring placement strategy
|
||||||
|
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
kota\_ (Kota Tsuyuzaki)
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
Develop codes around proxy-server and object-reconstructor
|
||||||
|
|
||||||
|
Repositories
|
||||||
|
------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Servers
|
||||||
|
-------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
DNS Entries
|
||||||
|
-----------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
Loading…
Reference in New Issue
Block a user