Merge "Initial Erasure Code Docs" into feature/ec

This commit is contained in:
Jenkins 2014-07-22 17:35:06 +00:00 committed by Gerrit Code Review
commit ae82c61bfe
4 changed files with 293 additions and 1 deletions

View File

@ -86,3 +86,5 @@ Other
* `Swiftsync <https://github.com/stackforge/swiftsync>`_ - A massive syncer between two swift clusters.
* `Swiftbrowser <https://github.com/cschwede/django-swiftbrowser>`_ - Simple web app to access Openstack Swift.
* `Swift-account-stats <https://github.com/enovance/swift-account-stats>`_ - Swift-account-stats is a tool to report statistics on Swift usage at tenant and global levels.
* `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_ - Erasure Code library used by Swift
* TODO: tsg add liberasurecode reference here

View File

@ -56,6 +56,7 @@ Overview and Concepts
overview_expiring_objects
cors
crossdomain
overview_erasure_code
associated_projects
Developer Documentation

View File

@ -69,7 +69,8 @@ implementing a particular differentiation.
For example, one might have the default policy with 3x replication, and create
a second policy which, when applied to new containers only uses 2x replication.
Another might add SSDs to a set of storage nodes and create a performance tier
storage policy for certain containers to have their objects stored there.
storage policy for certain containers to have their objects stored there. Yet
another might be the use of Erasure Coding to define a cold-storage tier.
This mapping is then exposed on a per-container basis, where each container
can be assigned a specific storage policy when it is created, which remains in

View File

@ -0,0 +1,288 @@
NOTE: EC related docs are WIP and may/may not reflect the current state
of the codebase on the feature/ec branch. They will continue to be a WIP
until we get closer and closer to code complete.
====================
Erasure Code Support
====================
-------------------------------
History and Theory of Operation
-------------------------------
There's a lot of good material out there on Erasure Code (EC) theory, this short
introduction is just meant to provide some basic context to help the reader
better understand the implementation in Swift.
Erasure Coding for storage applications grew out of Coding Theory as far back as
the 1960s with the Reed-Solomon codes. These codes have been used for years in
applications ranging from CDs to DVDs to general communications and, yes, even in
the space program starting with Voyager! The basic idea is that some amount of data
is broken up into smaller pieces called fragments and coded in such a way that it
can be transmitted with the ability to tolerate the loss of some number of the
coded fragments. That's where the word "erasure" comes in, if you transmit 14
fragments and only 13 are received then one of them is said to be "erased".
The word "erasure" provides an important distinction with EC; it isn't about
detecting errors, it's about dealing with failures. Another important element of
EC is that the number of erasures that can be tolerated can be adjusted to meet
the needs of the application.
At a high level EC works by using a specific scheme to break up a single data buffer
into several smaller data buffers then, depending on the scheme, performing some encoding
operation on that data in order to generate additional information. So you end up with more
data than you started with and that extra data is often called "parity". Note that there are
many, many different encoding techniques that vary both in how they organize and manipulate
the data as well by what means they use to calculate parity. For example, one scheme might
rely on `Galois Field Arithmetic <http://www.ssrc.ucsc.edu/Papers/plank-fast13.pdf>`_ while others may work with only XOR. The number of variations
and details about their differences are well beyond the scope of this introduction, but we'll
talk more about a few of them when we get into the implementation of EC in Swift.
--------------------------------
Overview of EC Support in Swift
--------------------------------
First and foremost, from an application perspective EC support is totally transparent. There
are no EC related external API; a container is simply created using a Storage Policy
defined to use EC and then interaction with the cluster is the same as any other durability
policy.
EC is implemented in Swift as a Storage Policy, see :doc:`overview_policies` for complete
details on Storage Policies. Because support is implemented as a Storage Policy, all of
the storage devices associated with your cluster's EC capability can be isolated. It is
entirely possible to share devices between storage policies, but for EC it may make more
sense to not only use separate devices but possibly even entire nodes dedicated for EC.
Which direction one chooses depends on why the EC policy is being deployed. If, for
example, there is a production replication policy in place already and the goal is to add
a cold storage tier such that the existing nodes performing replication are impacted as
little as possible, adding a new set of nodes dedicated to EC might make the most sense
but also incurs the most cost. On the other hand, if EC is being added as a capability
to provide additional durability for a specific set of applications and the existing
infrastructure is well suited for EC (sufficient number of nodes, zones for the EC scheme
that is chosen) then leveraging the existing infrastructure such that the EC ring shares
nodes with the replication ring makes the most sense. These are some of the main
considerations:
* Layout of existing infrastructure
* Cost of adding a dedicated EC nodes (or just dedicated EC devices)
* Intended usage model(s)
TODO: reword this paragraph:
EC support impacts many of the code paths and background operations for data stored in a
container that was created with an EC policy. However, as mentioned above, this is all
transparent to users of the cluster. Note that it is possible to have multiple EC containers
each with different EC policies that have different EC parameters just as it's possible to
have a cluster that has multiple different replication policies with different
replication levels.
The Swift code base doesn't include any of the algorithms necessary to perform the actual
encoding and decoding of data; that is left to an external library. The Storage Policies
architecture is leveraged to allow EC on a per container basis and the object rings still
provide for the placement of EC data fragments. Although there are several code paths that are
unique to an operation associated with an EC policy, an external dependency to an Erasure Code
library is what Swift counts on to perform the low level EC functions. The use of an external
library allows for maximum flexibility as there are a significant number of options out there,
each with its owns pros and cons that can vary greatly from one use case to another.
---------------------------------------
PyECLib: External Erasure Code Library
---------------------------------------
PyECLib is a Python Erasure Coding Library originally designed and written as part of the
effort to add EC support to the Swift project, however it is an independent project. The
library provides a well-defined and simple Python interface and internally implements a
plug-in architecture allowing it to take advantage of many well-known C libraries such as:
* Jerasure at https://bitbucket.org/jimplank/jerasure
* GFComplete at https://bitbucket.org/jimplank/gf-complete
* Intel(R) ISA-L at https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version
* Or write your own!
PyECLib itself therefore allows for not only choice, but further extensibility as well. PyECLib also
comes with a handy utility to help determine the best algorithm to use based on the equipment that
will be used (processors and server configurations may vary in performance per algorithm). More on
this will be covered in the configuration section. PyECLib is included as a Swift requirement.
For complete details see `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_
------------------------------
Storing and Retrieving Objects
------------------------------
We will discuss the details of how PUT and GET work in the "Under the Hood" section later on.
The key point here is that all of the erasure code work goes on behind the scenes; this summary
is a high level information overview only.
The PUT flow looks like this:
#. The proxy server streams in an object and buffers up "a segment" of data (size is configurable)
#. The proxy server calls on PyECLib to encode the data into fragments
#. The proxy streams the encoded fragments out to the storage nodes based on ring locations
#. Repeat until the client is done sending data
#. The client is notified of completion when a quorum is met (configurable).
The GET flow looks like this:
#. The proxy server makes simultaneous requests to some number of participating nodes
#. As soon as the proxy has the fragments it needs, it calls on PyECLib to decode the data
#. The proxy streams the decoded data it has back to the client
#. Repeat until the proxy is done sending data back to the client
.. note::
There are multiple options for how many and which nodes are used to retrieve fragments
on a GET request with varying performance tradeoffs. TODO: once we characterize this
we need to include what the exact options are and explain some of the details on
the tradeoffs.
It may sound like, from this high level overview, that using EC is going to cause an
explosion in the number of actual files stored in each node's local file system. Although
it is true that more files will be stored (because an object is broken into pieces), the
implementation works to minimize this where possible, more details are available in the
Under the Hood section.
-------------
Handoff Nodes
-------------
TODO
--------------
Reconstruction
--------------
For an EC policy, reconstruction is analogous to the process of replication for a replication
type policy -- essentially "the reconstructor" replaces "the replicator" for EC policy types.
The basic framework of reconstruction is very similar to that of replication with a
few notable exceptions:
* Because EC does not actually replicate partitions, it needs to operate at a finer granularity than what is provided with rsync, therefore EC leverages much of ssync.
* Once a pair of nodes has determined the need to replace a missing object fragment, instead of pushing over a copy like replication would do, the reconstructor has to read in enough surviving fragments from other nodes and perform a local reconstruction before it has the correct data to push to the other node.
.. note::
EC work (encode and decode) takes place both on the proxy nodes, for GET operations, as
well as on the storage node, for reconstruction. As with replication, reconstruction can
be the result of rebalancing, bit-rot, drive failure or reverting data from a hand-off
node back to its primary.
--------------------------
Performance Considerations
--------------------------
Big TODO here.
----------------------------
Using an Erasure Code Policy
----------------------------
To use an EC policy, the administrator simply needs to define an EC policy in `swift.conf`
and create/configure the associated object ring. An example of how an EC policy can be
setup is shown below::
[storage-policy:2]
name = deepfreeze10-4
type = erasure_coding
ec_type = rs_vand
ec_num_data_fragments = 10
ec_num_parity_fragments = 4
Let's take a closer look at each configuration parameter:
* name: this is a standard storage policy parameter. See :doc:`overview_policies` for details.
* type: set this to 'erasure_coding' to indicate that this is an EC policy
* ec_type: set this value according to the available options in the selected PyECLib back-end. This specifies the EC scheme that is to be used. For example the option shown here selects Vandermonde Reed-Solomon encoding while an option of 'flat_xor_3' would select Flat-XOR based HD combination codes. See the `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_ page for full details.
* ec_num_data_fragments: the total number of fragments that will be comprised of data
* ec_num_parity_fragments: the total number of fragments that will be comprised of parity
When PyECLib encodes an object, it will break it into N fragments however during configuration
what's important is how many of those are data and how many are parity. So in the example above,
PyECLib will actually break an object in 14 different fragments, 10 of them will be made up of
actual object data and 4 of them will be made of parity data (calculations depending on ec_type).
When deciding which devices to use in the EC policy's object ring, be sure to carefully consider
the performance items mentioned earlier. Once you've made you changes to `swift.conf` to
configure your EC policy, and created your object ring, your application is ready to start using EC
simply by creating a container with the specified name and interacting as usual.
Migrating Between Policies
--------------------------
A common usage of EC is to migrate less commonly accessed data from a more expensive but
lower latency policy such as replication. When an application determines that it wants to
move data from a replication policy to an EC policy, it simply needs to move the data from
the EC container to a different container that was created with the target durability policy.
--------------
Under the Hood
--------------
Now that we've explained a little about EC support in Swift and how to configure/use it,
let's explore how EC fits in at the nuts-n-bolts level.
Terminology
-----------
The term 'fragment' has been used already to describe the output of the EC process (a series of
fragments) however we need to define some other key terms here before going any deeper. Without
paying special attention to using the correct terms consistently, it is very easy to get confused
in a hurry!
* segment: not to be confused with SLO/DLO use of the work, in EC we call a segment a series of consecutive HTTP chunks buffered up before performing an EC operation.
* fragment: data and parity 'fragments' are generated when erasure coding transformation is applied to a segment.
* EC archive: A concatenation of EC fragments; to a storage node this looks like an object
* ec_k - number of EC data fragments (k is commonly used in the EC community for this purpose)
* ec_m - number of EC parity fragments (m is commonly used in the EC community for this purpose)
* chunk: HTTP chunks received over wire (term not used to describe any EC specific operation)
Middleware
----------
TODO: some middleware, like SLO/DLO are OK. Others, like list endpoints are TBD
On Disk Storage
---------------
EC archives are stored on disk in their respective objects-N directory based on their policy
index. See :doc:`overview_policies` for details on per policy directory information. There are
no special on disk storage impacts to EC policies.
Proxy Server
------------
TODO
Object Server
-------------
TODO
Metadata
--------
TODO
Database Updates
----------------
TODO
The Reconstructor
-----------------
TODO
The Auditor
-----------
Because the auditor already operates on a per storage policy basis, there are no specific
auditor changes associated with EC. Each EC archive looks like, and is treated like, a
regular object from the perspective of the auditor. Therefore, if the auditor finds bit-rot
in an EC archive, it simply quarantines it and the EC reconstructor will take care of the rest
just as the replicator does for replication policies.
PyECLib
-------
TODO