 0bcd7fd50e
			
		
	
	0bcd7fd50e
	
	
	
		
			
			The major functionality of EC has been released for Liberty and the beta version of the code has been removed since it is now in production. Change-Id: If60712045fb1af803093d6753fcd60434e637772
		
			
				
	
	
		
			678 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
		
			Executable File
		
	
	
	
	
			
		
		
	
	
			678 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
		
			Executable File
		
	
	
	
	
| ====================
 | |
| Erasure Code Support
 | |
| ====================
 | |
| 
 | |
| -------------------------------
 | |
| History and Theory of Operation
 | |
| -------------------------------
 | |
| 
 | |
| There's a lot of good material out there on Erasure Code (EC) theory, this short
 | |
| introduction is just meant to provide some basic context to help the reader
 | |
| better understand the implementation in Swift.
 | |
| 
 | |
| Erasure Coding for storage applications grew out of Coding Theory as far back as
 | |
| the 1960s with the Reed-Solomon codes.  These codes have been used for years in
 | |
| applications ranging from CDs to DVDs to general communications and, yes, even
 | |
| in the space program starting with Voyager! The basic idea is that some amount
 | |
| of data is broken up into smaller pieces called fragments and coded in such a
 | |
| way that it can be transmitted with the ability to tolerate the loss of some
 | |
| number of the coded fragments.  That's where the word "erasure" comes in, if you
 | |
| transmit 14 fragments and only 13 are received then one of them is said to be
 | |
| "erased".  The word "erasure" provides an important distinction with EC; it
 | |
| isn't about detecting errors, it's about dealing with failures.  Another
 | |
| important element of EC is that the number of erasures that can be tolerated can
 | |
| be adjusted to meet the needs of the application.
 | |
| 
 | |
| At a high level EC works by using a specific scheme to break up a single data
 | |
| buffer into several smaller data buffers then, depending on the scheme,
 | |
| performing some encoding operation on that data in order to generate additional
 | |
| information.  So you end up with more data than you started with and that extra
 | |
| data is often called "parity".  Note that there are many, many different
 | |
| encoding techniques that vary both in how they organize and manipulate the data
 | |
| as well by what means they use to calculate parity.  For example, one scheme
 | |
| might rely on `Galois Field Arithmetic <http://www.ssrc.ucsc.edu/Papers/plank-
 | |
| fast13.pdf>`_ while others may work with only XOR. The number of variations and
 | |
| details about their differences are well beyond the scope of this introduction,
 | |
| but we will talk more about a few of them when we get into the implementation of
 | |
| EC in Swift.
 | |
| 
 | |
| --------------------------------
 | |
| Overview of EC Support in Swift
 | |
| --------------------------------
 | |
| 
 | |
| First and foremost, from an application perspective EC support is totally
 | |
| transparent. There are no EC related external API; a container is simply created
 | |
| using a Storage Policy defined to use EC and then interaction with the cluster
 | |
| is the same as any other durability policy.
 | |
| 
 | |
| EC is implemented in Swift as a Storage Policy, see :doc:`overview_policies` for
 | |
| complete details on Storage Policies.  Because support is implemented as a
 | |
| Storage Policy, all of the storage devices associated with your cluster's EC
 | |
| capability can be isolated.  It is entirely possible to share devices between
 | |
| storage policies, but for EC it may make more sense to not only use separate
 | |
| devices but possibly even entire nodes dedicated for EC.
 | |
| 
 | |
| Which direction one chooses depends on why the EC policy is being deployed.  If,
 | |
| for example, there is a production replication policy in place already and the
 | |
| goal is to add a cold storage tier such that the existing nodes performing
 | |
| replication are impacted as little as possible, adding a new set of nodes
 | |
| dedicated to EC might make the most sense but also incurs the most cost.  On the
 | |
| other hand, if EC is being added as a capability to provide additional
 | |
| durability for a specific set of applications and the existing infrastructure is
 | |
| well suited for EC (sufficient number of nodes, zones for the EC scheme that is
 | |
| chosen) then leveraging the existing infrastructure such that the EC ring shares
 | |
| nodes with the replication ring makes the most sense.  These are some of the
 | |
| main considerations:
 | |
| 
 | |
| * Layout of existing infrastructure.
 | |
| * Cost of adding dedicated EC nodes (or just dedicated EC devices).
 | |
| * Intended usage model(s).
 | |
| 
 | |
| The Swift code base does not include any of the algorithms necessary to perform
 | |
| the actual encoding and decoding of data; that is left to external libraries.
 | |
| The Storage Policies architecture is leveraged to enable EC on a per container
 | |
| basis -- the object rings are still used to determine the placement of EC data
 | |
| fragments. Although there are several code paths that are unique to an operation
 | |
| associated with an EC policy, an external dependency to an Erasure Code library
 | |
| is what Swift counts on to perform the low level EC functions.  The use of an
 | |
| external library allows for maximum flexibility as there are a significant
 | |
| number of options out there, each with its owns pros and cons that can vary
 | |
| greatly from one use case to another.
 | |
| 
 | |
| ---------------------------------------
 | |
| PyECLib:  External Erasure Code Library
 | |
| ---------------------------------------
 | |
| 
 | |
| PyECLib is a Python Erasure Coding Library originally designed and written as
 | |
| part of the effort to add EC support to the Swift project, however it is an
 | |
| independent project.  The library provides a well-defined and simple Python
 | |
| interface and internally implements a plug-in architecture allowing it to take
 | |
| advantage of many well-known C libraries such as:
 | |
| 
 | |
| * Jerasure and GFComplete at http://jerasure.org.
 | |
| * Intel(R) ISA-L at http://01.org/intel%C2%AE-storage-acceleration-library-open-source-version.
 | |
| * Or write your own!
 | |
| 
 | |
| PyECLib uses a C based library called liberasurecode to implement the plug in
 | |
| infrastructure; liberasure code is available at:
 | |
| 
 | |
| * liberasurecode: https://bitbucket.org/tsg-/liberasurecode
 | |
| 
 | |
| PyECLib itself therefore allows for not only choice but further extensibility as
 | |
| well. PyECLib also comes with a handy utility to help determine the best
 | |
| algorithm to use based on the equipment that will be used (processors and server
 | |
| configurations may vary in performance per algorithm).  More on this will be
 | |
| covered in the configuration section.  PyECLib is included as a Swift
 | |
| requirement.
 | |
| 
 | |
| For complete details see `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_
 | |
| 
 | |
| ------------------------------
 | |
| Storing and Retrieving Objects
 | |
| ------------------------------
 | |
| 
 | |
| We will discuss the details of how PUT and GET work in the "Under the Hood"
 | |
| section later on. The key point here is that all of the erasure code work goes
 | |
| on behind the scenes; this summary is a high level information overview only.
 | |
| 
 | |
| The PUT flow looks like this:
 | |
| 
 | |
| #. The proxy server streams in an object and buffers up "a segment" of data
 | |
|    (size is configurable).
 | |
| #. The proxy server calls on PyECLib to encode the data into smaller fragments.
 | |
| #. The proxy streams the encoded fragments out to the storage nodes based on
 | |
|    ring locations.
 | |
| #. Repeat until the client is done sending data.
 | |
| #. The client is notified of completion when a quorum is met.
 | |
| 
 | |
| The GET flow looks like this:
 | |
| 
 | |
| #. The proxy server makes simultaneous requests to participating nodes.
 | |
| #. As soon as the proxy has the fragments it needs, it calls on PyECLib to
 | |
|    decode the data.
 | |
| #. The proxy streams the decoded data it has back to the client.
 | |
| #. Repeat until the proxy is done sending data back to the client.
 | |
| 
 | |
| It may sound like, from this high level overview, that using EC is going to
 | |
| cause an explosion in the number of actual files stored in each node's local
 | |
| file system.  Although it is true that more files will be stored (because an
 | |
| object is broken into pieces), the implementation works to minimize this where
 | |
| possible, more details are available in the Under the Hood section.
 | |
| 
 | |
| -------------
 | |
| Handoff Nodes
 | |
| -------------
 | |
| 
 | |
| In EC policies, similarly to replication, handoff nodes are a set of storage
 | |
| nodes used to augment the list of primary nodes responsible for storing an
 | |
| erasure coded object. These handoff nodes are used in the event that one or more
 | |
| of the primaries are unavailable.  Handoff nodes are still selected with an
 | |
| attempt to achieve maximum separation of the data being placed.
 | |
| 
 | |
| --------------
 | |
| Reconstruction
 | |
| --------------
 | |
| 
 | |
| For an EC policy, reconstruction is analogous to the process of replication for
 | |
| a replication type policy -- essentially "the reconstructor" replaces "the
 | |
| replicator" for EC policy types. The basic framework of reconstruction is very
 | |
| similar to that of replication with a few notable exceptions:
 | |
| 
 | |
| * Because EC does not actually replicate partitions, it needs to operate at a
 | |
|   finer granularity than what is provided with rsync, therefore EC leverages
 | |
|   much of ssync behind the scenes (you do not need to manually configure ssync).
 | |
| * Once a pair of nodes has determined the need to replace a missing object
 | |
|   fragment, instead of pushing over a copy like replication would do, the
 | |
|   reconstructor has to read in enough surviving fragments from other nodes and
 | |
|   perform a local reconstruction before it has the correct data to push to the
 | |
|   other node.
 | |
| * A reconstructor does not talk to all other reconstructors in the set of nodes
 | |
|   responsible for an EC partition, this would be far too chatty, instead each
 | |
|   reconstructor is responsible for sync'ing with the partition's closest two
 | |
|   neighbors (closest meaning left and right on the ring).
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     EC work (encode and decode) takes place both on the proxy nodes, for PUT/GET
 | |
|     operations, as well as on the storage nodes for reconstruction.  As with
 | |
|     replication, reconstruction can be the result of rebalancing, bit-rot, drive
 | |
|     failure or reverting data from a hand-off node back to its primary.
 | |
| 
 | |
| --------------------------
 | |
| Performance Considerations
 | |
| --------------------------
 | |
| 
 | |
| Efforts are underway to characterize performance of various Erasure Code
 | |
| schemes.  One of the main goals of the beta release is to perform this
 | |
| characterization and encourage others to do so and provide meaningful feedback
 | |
| to the development community.  There are many factors that will affect
 | |
| performance of EC so it is vital that we have multiple characterization
 | |
| activities happening.
 | |
| 
 | |
| In general, EC has different performance characteristics than replicated data.
 | |
| EC requires substantially more CPU to read and write data, and is more suited
 | |
| for larger objects that are not frequently accessed (eg backups).
 | |
| 
 | |
| ----------------------------
 | |
| Using an Erasure Code Policy
 | |
| ----------------------------
 | |
| 
 | |
| To use an EC policy, the administrator simply needs to define an EC policy in
 | |
| `swift.conf` and create/configure the associated object ring.  An example of how
 | |
| an EC policy can be setup is shown below::
 | |
| 
 | |
|         [storage-policy:2]
 | |
|         name = ec104
 | |
|         policy_type = erasure_coding
 | |
|         ec_type = jerasure_rs_vand
 | |
|         ec_num_data_fragments = 10
 | |
|         ec_num_parity_fragments = 4
 | |
|         ec_object_segment_size = 1048576
 | |
| 
 | |
| Let's take a closer look at each configuration parameter:
 | |
| 
 | |
| * ``name``: This is a standard storage policy parameter.
 | |
|   See :doc:`overview_policies` for details.
 | |
| * ``policy_type``: Set this to ``erasure_coding`` to indicate that this is an EC
 | |
|   policy.
 | |
| * ``ec_type``: Set this value according to the available options in the selected
 | |
|   PyECLib back-end. This specifies the EC scheme that is to be used.  For
 | |
|   example the option shown here selects Vandermonde Reed-Solomon encoding while
 | |
|   an option of ``flat_xor_hd_3`` would select Flat-XOR based HD combination
 | |
|   codes. See the `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_ page for
 | |
|   full details.
 | |
| * ``ec_num_data_fragments``: The total number of fragments that will be
 | |
|   comprised of data.
 | |
| * ``ec_num_parity_fragments``: The total number of fragments that will be
 | |
|   comprised of parity.
 | |
| * ``ec_object_segment_size``: The amount of data that will be buffered up before
 | |
|   feeding a segment into the encoder/decoder. The default value is 1048576.
 | |
| 
 | |
| When PyECLib encodes an object, it will break it into N fragments. However, what
 | |
| is important during configuration, is how many of those are data and how many
 | |
| are parity.  So in the example above, PyECLib will actually break an object in
 | |
| 14 different fragments, 10 of them will be made up of actual object data and 4
 | |
| of them will be made of parity data (calculations depending on ec_type).
 | |
| 
 | |
| When deciding which devices to use in the EC policy's object ring, be sure to
 | |
| carefully consider the performance impacts.  Running some performance
 | |
| benchmarking in a test environment for your configuration is highly recommended
 | |
| before deployment.
 | |
| 
 | |
| To create the EC policy's object ring, the only difference in the usage of the
 | |
| ``swift-ring-builder create`` command is the ``replicas`` parameter.  The
 | |
| ``replicas`` value is the number of fragments spread across the object servers
 | |
| associated with the ring; ``replicas`` must be equal to the sum of
 | |
| ``ec_num_data_fragments`` and ``ec_num_parity_fragments``. For example::
 | |
| 
 | |
|   swift-ring-builder object-1.builder create 10 14 1
 | |
| 
 | |
| Note that in this example the ``replicas`` value of 14 is based on the sum of
 | |
| 10 EC data fragments and 4 EC parity fragments.
 | |
| 
 | |
| Once you have configured your EC policy in `swift.conf` and created your object
 | |
| ring, your application is ready to start using EC simply by creating a container
 | |
| with the specified policy name and interacting as usual.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     It's important to note that once you have deployed a policy and have created
 | |
|     objects with that policy, these configurations options cannot be changed. In
 | |
|     case a change in the configuration is desired, you must create a new policy
 | |
|     and migrate the data to a new container.
 | |
| 
 | |
| Migrating Between Policies
 | |
| --------------------------
 | |
| 
 | |
| A common usage of EC is to migrate less commonly accessed data from a more
 | |
| expensive but lower latency policy such as replication.  When an application
 | |
| determines that it wants to move data from a replication policy to an EC policy,
 | |
| it simply needs to move the data from the replicated container to an EC
 | |
| container that was created with the target durability policy.
 | |
| 
 | |
| Region Support
 | |
| --------------
 | |
| 
 | |
| For at least the initial version of EC, it is not recommended that an EC scheme
 | |
| span beyond a single region, neither performance nor functional validation has 
 | |
| be been done in such a configuration.
 | |
| 
 | |
| --------------
 | |
| Under the Hood
 | |
| --------------
 | |
| 
 | |
| Now that we've explained a little about EC support in Swift and how to
 | |
| configure/use it, let's explore how EC fits in at the nuts-n-bolts level.
 | |
| 
 | |
| Terminology
 | |
| -----------
 | |
| 
 | |
| The term 'fragment' has been used already to describe the output of the EC
 | |
| process (a series of fragments) however we need to define some other key terms
 | |
| here before going any deeper.  Without paying special attention to using the
 | |
| correct terms consistently, it is very easy to get confused in a hurry!
 | |
| 
 | |
| * **chunk**: HTTP chunks received over wire (term not used to describe any EC
 | |
|   specific operation).
 | |
| * **segment**: Not to be confused with SLO/DLO use of the word, in EC we call a
 | |
|   segment a series of consecutive HTTP chunks buffered up before performing an
 | |
|   EC operation.
 | |
| * **fragment**: Data and parity 'fragments' are generated when erasure coding
 | |
|   transformation is applied to a segment.
 | |
| * **EC archive**: A concatenation of EC fragments; to a storage node this looks
 | |
|   like an object.
 | |
| * **ec_ndata**: Number of EC data fragments.
 | |
| * **ec_nparity**: Number of EC parity fragments.
 | |
| 
 | |
| Middleware
 | |
| ----------
 | |
| 
 | |
| Middleware remains unchanged.  For most middleware (e.g., SLO/DLO) the fact that
 | |
| the proxy is fragmenting incoming objects is transparent.  For list endpoints,
 | |
| however, it is a bit different. A caller of list endpoints will get back the
 | |
| locations of all of the fragments.  The caller will be unable to re-assemble the
 | |
| original object with this information, however the node locations may still
 | |
| prove to be useful information for some applications.
 | |
| 
 | |
| On Disk Storage
 | |
| ---------------
 | |
| 
 | |
| EC archives are stored on disk in their respective objects-N directory based on
 | |
| their policy index.  See :doc:`overview_policies` for details on per policy
 | |
| directory information.
 | |
| 
 | |
| The actual names on disk of EC archives also have one additional piece of data
 | |
| encoded in the filename, the fragment archive index.
 | |
| 
 | |
| Each storage policy now must include a transformation function that diskfile
 | |
| will use to build the filename to store on disk. The functions are implemented
 | |
| in the diskfile module as policy specific sub classes ``DiskFileManager``.
 | |
| 
 | |
| This is required for a few reasons. For one, it allows us to store fragment
 | |
| archives of different indexes on the same storage node which is not typical
 | |
| however it is possible in many circumstances. Without unique filenames for the
 | |
| different EC archive files in a set, we would be at risk of overwriting one
 | |
| archive of index n with another of index m in some scenarios.
 | |
| 
 | |
| The transformation function for the replication policy is simply a NOP. For
 | |
| reconstruction, the index is appended to the filename just before the .data
 | |
| extension. An example filename for a fragment archive storing the 5th fragment
 | |
| would like this this::
 | |
| 
 | |
|     1418673556.92690#5.data
 | |
| 
 | |
| An additional file is also included for Erasure Code policies called the
 | |
| ``.durable`` file. Its meaning will be covered in detail later, however, its on-
 | |
| disk format does not require the name transformation function that was just
 | |
| covered.  The .durable for the example above would simply look like this::
 | |
| 
 | |
|     1418673556.92690.durable
 | |
| 
 | |
| And it would be found alongside every fragment specific .data file following a
 | |
| 100% successful PUT operation.
 | |
| 
 | |
| Proxy Server
 | |
| ------------
 | |
| 
 | |
| High Level
 | |
| ==========
 | |
| 
 | |
| The Proxy Server handles Erasure Coding in a different manner than replication,
 | |
| therefore there are several code paths unique to EC policies either though sub
 | |
| classing or simple conditionals.  Taking a closer look at the PUT and the GET
 | |
| paths will help make this clearer.  But first, a high level overview of how an
 | |
| object flows through the system:
 | |
| 
 | |
| .. image:: images/ec_overview.png
 | |
| 
 | |
| Note how:
 | |
| 
 | |
| * Incoming objects are buffered into segments at the proxy.
 | |
| * Segments are erasure coded into fragments at the proxy.
 | |
| * The proxy stripes fragments across participating nodes such that the on-disk
 | |
|   stored files that we call a fragment archive is appended with each new
 | |
|   fragment.
 | |
| 
 | |
| This scheme makes it possible to minimize the number of on-disk files given our
 | |
| segmenting and fragmenting.
 | |
| 
 | |
| Multi_Phase Conversation
 | |
| ========================
 | |
| 
 | |
| Multi-part MIME document support is used to allow the proxy to engage in a
 | |
| handshake conversation with the storage node for processing PUT requests.  This
 | |
| is required for a few different reasons.
 | |
| 
 | |
| #. From the perspective of the storage node, a fragment archive is really just
 | |
|    another object, we need a mechanism to send down the original object etag
 | |
|    after all fragment archives have landed.
 | |
| #. Without introducing strong consistency semantics, the proxy needs a mechanism
 | |
|    to know when a quorum of fragment archives have actually made it to disk
 | |
|    before it can inform the client of a successful PUT.
 | |
| 
 | |
| MIME supports a conversation between the proxy and the storage nodes for every
 | |
| PUT. This provides us with the ability to handle a PUT in one connection and
 | |
| assure that we have the essence of a 2 phase commit, basically having the proxy
 | |
| communicate back to the storage nodes once it has confirmation that a quorum of
 | |
| fragment archives in the set have been written.
 | |
| 
 | |
| For the first phase of the conversation the proxy requires a quorum of
 | |
| `ec_ndata + 1` fragment archives to be successfully put to storage nodes.
 | |
| This ensures that the object could still be reconstructed even if one of the
 | |
| fragment archives becomes unavailable. During the second phase of the
 | |
| conversation the proxy communicates a confirmation to storage nodes that the
 | |
| fragment archive quorum has been achieved. This causes the storage node to
 | |
| create a `ts.durable` file at timestamp `ts` which acts as an indicator of
 | |
| the last known durable set of fragment archives for a given object. The
 | |
| presence of a `ts.durable` file means, to the object server, `there is a set
 | |
| of ts.data files that are durable at timestamp ts`.
 | |
| 
 | |
| For the second phase of the conversation the proxy requires a quorum of
 | |
| `ec_ndata + 1` successful commits on storage nodes. This ensures that there are
 | |
| sufficient committed fragment archives for the object to be reconstructed even
 | |
| if one becomes unavailable. The reconstructor ensures that `.durable` files are
 | |
| replicated on storage nodes where they may be missing.
 | |
| 
 | |
| Note that the completion of the commit phase of the conversation
 | |
| is also a signal for the object server to go ahead and immediately delete older
 | |
| timestamp files for this object. This is critical as we do not want to delete
 | |
| the older object until the storage node has confirmation from the proxy, via the
 | |
| multi-phase conversation, that the other nodes have landed enough for a quorum.
 | |
| 
 | |
| The basic flow looks like this:
 | |
| 
 | |
|  * The Proxy Server erasure codes and streams the object fragments
 | |
|    (ec_ndata + ec_nparity) to the storage nodes.
 | |
|  * The storage nodes store objects as EC archives and upon finishing object
 | |
|    data/metadata write, send a 1st-phase response to proxy.
 | |
|  * Upon quorum of storage nodes responses, the proxy initiates 2nd-phase by
 | |
|    sending commit confirmations to object servers.
 | |
|  * Upon receipt of commit message, object servers store a 0-byte data file as
 | |
|    `<timestamp>.durable` indicating successful PUT, and send a final response to
 | |
|    the proxy server.
 | |
|  * The proxy waits for `ec_ndata + 1` object servers to respond with a
 | |
|    success (2xx) status before responding to the client with a successful
 | |
|    status.
 | |
| 
 | |
| Here is a high level example of what the conversation looks like::
 | |
| 
 | |
|     proxy: PUT /p/a/c/o
 | |
|          Transfer-Encoding': 'chunked'
 | |
|          Expect': '100-continue'
 | |
|          X-Backend-Obj-Multiphase-Commit: yes
 | |
|     obj:   100 Continue
 | |
|          X-Obj-Multiphase-Commit: yes
 | |
|     proxy: --MIMEboundary
 | |
|          X-Document: object body
 | |
|          <obj_data>
 | |
|          --MIMEboundary
 | |
|          X-Document: object metadata
 | |
|          Content-MD5: <footer_meta_cksum>
 | |
|          <footer_meta>
 | |
|          --MIMEboundary
 | |
|     <object server writes data, metadata>
 | |
|     obj:   100 Continue
 | |
|     <quorum>
 | |
|     proxy: X-Document: put commit
 | |
|          commit_confirmation
 | |
|          --MIMEboundary--
 | |
|     <object server writes ts.durable state>
 | |
|     obj:   20x
 | |
|     <proxy waits to receive >=2 2xx responses>
 | |
|     proxy: 2xx -> client
 | |
| 
 | |
| A few key points on the .durable file:
 | |
| 
 | |
| * The .durable file means \"the matching .data file for this has sufficient
 | |
|   fragment archives somewhere, committed, to reconstruct the object\".
 | |
| * The Proxy Server will never have knowledge, either on GET or HEAD, of the
 | |
|   existence of a .data file on an object server if it does not have a matching
 | |
|   .durable file.
 | |
| * The object server will never return a .data that does not have a matching
 | |
|   .durable.
 | |
| * When a proxy does a GET, it will only receive fragment archives that have
 | |
|   enough present somewhere to be reconstructed.
 | |
| 
 | |
| Partial PUT Failures
 | |
| ====================
 | |
| 
 | |
| A partial PUT failure has a few different modes.  In one scenario the Proxy
 | |
| Server is alive through the entire PUT conversation.  This is a very
 | |
| straightforward case. The client will receive a good response if and only if a
 | |
| quorum of fragment archives were successfully landed on their storage nodes.  In
 | |
| this case the Reconstructor will discover the missing fragment archives, perform
 | |
| a reconstruction and deliver fragment archives and their matching .durable files
 | |
| to the nodes.
 | |
| 
 | |
| The more interesting case is what happens if the proxy dies in the middle of a
 | |
| conversation.  If it turns out that a quorum had been met and the commit phase
 | |
| of the conversation finished, its as simple as the previous case in that the
 | |
| reconstructor will repair things.  However, if the commit didn't get a chance to
 | |
| happen then some number of the storage nodes have .data files on them (fragment
 | |
| archives) but none of them knows whether there are enough elsewhere for the
 | |
| entire object to be reconstructed.  In this case the client will not have
 | |
| received a 2xx response so there is no issue there, however, it is left to the
 | |
| storage nodes to clean up the stale fragment archives.  Work is ongoing in this
 | |
| area to enable the proxy to play a role in reviving these fragment archives,
 | |
| however, for the current release, a proxy failure after the start of a
 | |
| conversation but before the commit message will simply result in a PUT failure.
 | |
| 
 | |
| GET
 | |
| ===
 | |
| 
 | |
| The GET for EC is different enough from replication that subclassing the
 | |
| `BaseObjectController` to the `ECObjectController` enables an efficient way to
 | |
| implement the high level steps described earlier:
 | |
| 
 | |
| #. The proxy server makes simultaneous requests to participating nodes.
 | |
| #. As soon as the proxy has the fragments it needs, it calls on PyECLib to
 | |
|    decode the data.
 | |
| #. The proxy streams the decoded data it has back to the client.
 | |
| #. Repeat until the proxy is done sending data back to the client.
 | |
| 
 | |
| The GET path will attempt to contact all nodes participating in the EC scheme,
 | |
| if not enough primaries respond then handoffs will be contacted just as with
 | |
| replication.  Etag and content length headers are updated for the client
 | |
| response following reconstruction as the individual fragment archives metadata
 | |
| is valid only for that fragment archive.
 | |
| 
 | |
| Object Server
 | |
| -------------
 | |
| 
 | |
| The Object Server, like the Proxy Server, supports MIME conversations as
 | |
| described in the proxy section earlier. This includes processing of the commit
 | |
| message and decoding various sections of the MIME document to extract the footer
 | |
| which includes things like the entire object etag.
 | |
| 
 | |
| DiskFile
 | |
| ========
 | |
| 
 | |
| Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter``,
 | |
| ``ECDiskFileReader`` and ``ECDiskFileManager`` to implement EC specific
 | |
| handling of on disk files.  This includes things like file name manipulation to
 | |
| include the fragment index in the filename, determination of valid .data files
 | |
| based on .durable presence, construction of EC specific hashes.pkl file to
 | |
| include fragment index information, etc., etc.
 | |
| 
 | |
| Metadata
 | |
| --------
 | |
| 
 | |
| There are few different categories of metadata that are associated with EC:
 | |
| 
 | |
| System Metadata: EC has a set of object level system metadata that it
 | |
| attaches to each of the EC archives.  The metadata is for internal use only:
 | |
| 
 | |
| * ``X-Object-Sysmeta-EC-Etag``:  The Etag of the original object.
 | |
| * ``X-Object-Sysmeta-EC-Content-Length``: The content length of the original
 | |
|   object.
 | |
| * ``X-Object-Sysmeta-EC-Frag-Index``: The fragment index for the object.
 | |
| * ``X-Object-Sysmeta-EC-Scheme``: Description of the EC policy used to encode
 | |
|   the object.
 | |
| * ``X-Object-Sysmeta-EC-Segment-Size``: The segment size used for the object.
 | |
| 
 | |
| User Metadata:  User metadata is unaffected by EC, however, a full copy of the
 | |
| user metadata is stored with every EC archive.  This is required as the
 | |
| reconstructor needs this information and each reconstructor only communicates
 | |
| with its closest neighbors on the ring.
 | |
| 
 | |
| PyECLib Metadata:  PyECLib stores a small amount of metadata on a per fragment
 | |
| basis.  This metadata is not documented here as it is opaque to Swift.
 | |
| 
 | |
| Database Updates
 | |
| ----------------
 | |
| 
 | |
| As account and container rings are not associated with a Storage Policy, there
 | |
| is no change to how these database updates occur when using an EC policy.
 | |
| 
 | |
| The Reconstructor
 | |
| -----------------
 | |
| 
 | |
| The Reconstructor performs analogous functions to the replicator:
 | |
| 
 | |
| #. Recovery from disk drive failure.
 | |
| #. Moving data around because of a rebalance.
 | |
| #. Reverting data back to a primary from a handoff.
 | |
| #. Recovering fragment archives from bit rot discovered by the auditor.
 | |
| 
 | |
| However, under the hood it operates quite differently.  The following are some
 | |
| of the key elements in understanding how the reconstructor operates.
 | |
| 
 | |
| Unlike the replicator, the work that the reconstructor does is not always as
 | |
| easy to break down into the 2 basic tasks of synchronize or revert (move data
 | |
| from handoff back to primary) because of the fact that one storage node can
 | |
| house fragment archives of various indexes and each index really /"belongs/" to
 | |
| a different node.  So, whereas when the replicator is reverting data from a
 | |
| handoff it has just one node to send its data to, the reconstructor can have
 | |
| several.  Additionally, its not always the case that the processing of a
 | |
| particular suffix directory means one or the other for the entire directory (as
 | |
| it does for replication). The scenarios that create these mixed situations can
 | |
| be pretty complex so we will just focus on what the reconstructor does here and
 | |
| not a detailed explanation of why.
 | |
| 
 | |
| Job Construction and Processing
 | |
| ===============================
 | |
| 
 | |
| Because of the nature of the work it has to do as described above, the
 | |
| reconstructor builds jobs for a single job processor.  The job itself contains
 | |
| all of the information needed for the processor to execute the job which may be
 | |
| a synchronization or a data reversion and there may be a mix of jobs that
 | |
| perform both of these operations on the same suffix directory.
 | |
| 
 | |
| Jobs are constructed on a per partition basis and then per fragment index basis.
 | |
| That is, there will be one job for every fragment index in a partition.
 | |
| Performing this construction \"up front\" like this helps minimize the
 | |
| interaction between nodes collecting hashes.pkl information.
 | |
| 
 | |
| Once a set of jobs for a partition has been constructed, those jobs are sent off
 | |
| to threads for execution. The single job processor then performs the necessary
 | |
| actions working closely with ssync to carry out its instructions.  For data
 | |
| reversion, the actual objects themselves are cleaned up via the ssync module and
 | |
| once that partition's set of jobs is complete, the reconstructor will attempt to
 | |
| remove the relevant directory structures.
 | |
| 
 | |
| The scenarios that job construction has to take into account include:
 | |
| 
 | |
| #. A partition directory with all fragment indexes matching the local node
 | |
|    index.  This is the case where everything is where it belongs and we just
 | |
|    need to compare hashes and sync if needed, here we sync with our partners.
 | |
| #. A partition directory with one local fragment index and mix of others.  Here
 | |
|    we need to sync with our partners where fragment indexes matches the
 | |
|    local_id, all others are sync'd with their home nodes and then deleted.
 | |
| #. A partition directory with no local fragment index and just one or more of
 | |
|    others. Here we sync with just the home nodes for the fragment indexes that
 | |
|    we have and then all the local archives are deleted.  This is the basic
 | |
|    handoff reversion case.
 | |
| 
 | |
| .. note::
 | |
|     A \"home node\" is the node where the fragment index encoded in the
 | |
|     fragment archive's filename matches the node index of a node in the primary
 | |
|     partition list.
 | |
| 
 | |
| Node Communication
 | |
| ==================
 | |
| 
 | |
| The replicators talk to all nodes who have a copy of their object, typically
 | |
| just 2 other nodes.  For EC, having each reconstructor node talk to all nodes
 | |
| would incur a large amount of overhead as there will typically be a much larger
 | |
| number of nodes participating in the EC scheme.  Therefore, the reconstructor is
 | |
| built to talk to its adjacent nodes on the ring only.  These nodes are typically
 | |
| referred to as partners.
 | |
| 
 | |
| Reconstruction
 | |
| ==============
 | |
| 
 | |
| Reconstruction can be thought of sort of like replication but with an extra step
 | |
| in the middle. The reconstructor is hard-wired to use ssync to determine what is
 | |
| missing and desired by the other side. However, before an object is sent over
 | |
| the wire it needs to be reconstructed from the remaining fragments as the local
 | |
| fragment is just that - a different fragment index than what the other end is
 | |
| asking for.
 | |
| 
 | |
| Thus, there are hooks in ssync for EC based policies. One case would be for
 | |
| basic reconstruction which, at a high level, looks like this:
 | |
| 
 | |
| * Determine which nodes need to be contacted to collect other EC archives needed
 | |
|   to perform reconstruction.
 | |
| * Update the etag and fragment index metadata elements of the newly constructed
 | |
|   fragment archive.
 | |
| * Establish a connection to the target nodes and give ssync a DiskFileLike class
 | |
|   that it can stream data from.
 | |
| 
 | |
| The reader in this class gathers fragments from the nodes and uses PyECLib to
 | |
| reconstruct each segment before yielding data back to ssync. Essentially what
 | |
| this means is that data is buffered, in memory, on a per segment basis at the
 | |
| node performing reconstruction and each segment is dynamically reconstructed and
 | |
| delivered to `ssync_sender` where the `send_put()` method will ship them on
 | |
| over.  The sender is then responsible for deleting the objects as they are sent
 | |
| in the case of data reversion.
 | |
| 
 | |
| The Auditor
 | |
| -----------
 | |
| 
 | |
| Because the auditor already operates on a per storage policy basis, there are no
 | |
| specific auditor changes associated with EC.  Each EC archive looks like, and is
 | |
| treated like, a regular object from the perspective of the auditor.  Therefore,
 | |
| if the auditor finds bit-rot in an EC archive, it simply quarantines it and the
 | |
| reconstructor will take care of the rest just as the replicator does for
 | |
| replication policies.
 |