Liberasurecode docs haven't been updated in a while. There have been some new implementations added so these have been added to the code_organisation.md doc. Although I didn't add ALL files in the repo, just the more important implementation ones, namele: - isa_l_rs_cauchy.c - isa_l_rs_vand_inv.c - liberasurecode_rs_vand Tim provided an overview of erasure coding to a colleague, and makes a good additional doc for this repo, and I have his permission to add it. doc/erasure_coding.md Co-Authored-By: Tim Burke <tim.burke@gmail.com> Change-Id: Ifd3e4aea4dbed664fb77a3e4a3106bd3b0d6f343 Signed-off-by: Matthew Oliver <matt@oliver.net.au>
6.3 KiB
Overview
Erasure coding allows the distribution of data across several independent
disks, improving data durability without requiring as much overhead as
high-replica replication. Data is broken into k data fragments, then
k + m fragments are calculated and stored. Given some n ∈ [k, k+m)
of these stored fragments, the original data can be reconstructed. Optimal
codes ensure that all subsets of k stored fragments can be used for
reconstruction.
Theory
Any Reed-Solomon
code uses linear algebra over a Galois field.
The k data fragments are represented as a series of vectors and multiplied
by a k × (k + m) encoding matrix E to produce the k + m fragments for
storage. To decode a set of fragments [f₁, f₂, ..., fₙ], select the
corresponding columns of E to create a k × n matrix E′ then compute
the decoding matrix D as a left-inverse of E′ᵀ (i.e., D × E′ᵀ = Iₖ).
Multiply the fragments by D to recover the original data.
Note that for systematic encodings, the left-most k × k submatrix of E is Iₖ.
The encoding matrix E is typically based upon either
a Vandermonde matrix or
a Cauchy matrix.
The flat XOR codes eschew matrix inversion and multiplication (which are both expensive) in favor of XOR-ing particular subsets of fragments together to create parity fragments. For more information, see "Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs".
Relevant Projects
-
The primary entrypoint, offering a unifying interface for multiple possible backends.
-
Python bindings for liberasurecode.
-
Collection of optimized low-level functions for storage applications. Uses multi-binary dispatch to offer optimized assembly to CPUs with a range of capabilities from a single binary. Notably, provides fast block Reed-Solomon type erasure codes for arbitrary encode/decode matrices as well as two functions for generating specific encoding matrices.
-
First Reed-Solomon codes supported by liberasurecode. Requires gf-complete. Written by James Plank, who has since made the original website read-only and issued a notice regarding claims of patent-infringement.
-
Galois field library used by jerasure; also written by James Plank, also potentially patent-encumbered.
-
shss
Proprietary; developed by NTT. Requires additional data to be stored with every fragment.
-
libphazr
Proprietary; developed by Phazr.io. Requires additional data to be stored with every fragment.
Supported Backends
Provided by liberasurecode
liberasurecode_rs_vand(added in liberasurecode 1.0.8, pyeclib 1.0.8)flat_xor_hd3flat_xor_hd4
Provided by isa-l
-
isa_l_rs_vandUses the Reed-Solomon functions provided by isa-l with an encoding matrix also provided by isa-l. Since this matrix is constructed by extending
Iₖwith ak × mVandermond matrix, a sufficient condition for optimality is thatm ≤ 4; beyond that, somek × ksubmatrices may not be invertible.Prior to liberasurecode 1.3.0, it did not detect the failure to invert
E′ᵀ, leading to incidents of data corruption. See bug #1639691 for more information. -
isa_l_rs_vand_inv(added in liberasurecode 1.7.0, pyeclib 1.7.0)Uses the Reed-Solomon functions provided by isa-l with an encoding matrix provided by liberasurecode. To construct the encoding matrix, start with a
k × (k + m)Vandermond matrixV, defineV′as the left-mostk × ksubmatrix, then calculateE = inv(V′) × V. This makes a systematic code that is optimal for allkandm. -
isa_l_rs_cauchy(added in liberasurecode 1.4.0, pyeclib 1.4.0)Uses the Reed-Solomon functions provided by isa-l with an encoding matrix also provided by isa-l. Being a Cauchy matrix, it forms an optimal code for all
kandm.
Provided by jerasure
jerasure_rs_vandjerasure_rs_cauchy
Proprietary
shss(added in liberasurecode 1.0.0, pyeclib 1.0.1)libphazr(added in liberasurecode 1.5.0, pyeclib 1.5.0)
Classifications
Required Fragments
n ≡ k
Most supported backends are optimal erasure codes, where any k fragments
are sufficient to recover the original data.
n > k
The flat XOR codes require more than k fragments to decode in the general
case. In particular, flat_xor_hd3 requires at least n ≡ k + m - 2
fragments and flat_xor_hd4 requires at least n ≡ k + m - 3.
Systematic vs. Non-systematic
Systematic codes ensure that
the first k fragments for storage correspond to the initial k data
fragments. This can greatly speed up decoding when all k data fragments are
available as well as provide more recovery options in certain failure cases.
Non-systematic encodings do not ensure that. Rather, they often will seek to
ensure that none of the original data is directly present in the storage
fragments, thus ensuring confidentiality of data when less than n fragments
are available. See also: secure secret sharing.
The following backends are non-systematic:
shsslibphazr
All other supported backends are systematic.