610345f2cc
Change-Id: I67fa565e6d6826d5486146278186997ce0dbe7d0
99 lines
4.5 KiB
ReStructuredText
99 lines
4.5 KiB
ReStructuredText
===========
|
|
Replication
|
|
===========
|
|
|
|
Because each replica in Object Storage functions independently and
|
|
clients generally require only a simple majority of nodes to respond to
|
|
consider an operation successful, transient failures like network
|
|
partitions can quickly cause replicas to diverge. These differences are
|
|
eventually reconciled by asynchronous, peer-to-peer replicator
|
|
processes. The replicator processes traverse their local file systems
|
|
and concurrently perform operations in a manner that balances load
|
|
across physical disks.
|
|
|
|
Replication uses a push model, with records and files generally only
|
|
being copied from local to remote replicas. This is important because
|
|
data on the node might not belong there (as in the case of hand offs and
|
|
ring changes), and a replicator cannot know which data it should pull in
|
|
from elsewhere in the cluster. Any node that contains data must ensure
|
|
that data gets to where it belongs. The ring handles replica placement.
|
|
|
|
To replicate deletions in addition to creations, every deleted record or
|
|
file in the system is marked by a tombstone. The replication process
|
|
cleans up tombstones after a time period known as the ``consistency
|
|
window``. This window defines the duration of the replication and how
|
|
long transient failure can remove a node from the cluster. Tombstone
|
|
cleanup must be tied to replication to reach replica convergence.
|
|
|
|
If a replicator detects that a remote drive has failed, the replicator
|
|
uses the ``get_more_nodes`` interface for the ring to choose an
|
|
alternate node with which to synchronize. The replicator can maintain
|
|
desired levels of replication during disk failures, though some replicas
|
|
might not be in an immediately usable location.
|
|
|
|
.. note::
|
|
|
|
The replicator does not maintain desired levels of replication when
|
|
failures such as entire node failures occur; most failures are
|
|
transient.
|
|
|
|
The main replication types are:
|
|
|
|
- Database replication
|
|
Replicates containers and objects.
|
|
|
|
- Object replication
|
|
Replicates object data.
|
|
|
|
Database replication
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Database replication completes a low-cost hash comparison to determine
|
|
whether two replicas already match. Normally, this check can quickly
|
|
verify that most databases in the system are already synchronized. If
|
|
the hashes differ, the replicator synchronizes the databases by sharing
|
|
records added since the last synchronization point.
|
|
|
|
This synchronization point is a high water mark that notes the last
|
|
record at which two databases were known to be synchronized, and is
|
|
stored in each database as a tuple of the remote database ID and record
|
|
ID. Database IDs are unique across all replicas of the database, and
|
|
record IDs are monotonically increasing integers. After all new records
|
|
are pushed to the remote database, the entire synchronization table of
|
|
the local database is pushed, so the remote database can guarantee that
|
|
it is synchronized with everything with which the local database was
|
|
previously synchronized.
|
|
|
|
If a replica is missing, the whole local database file is transmitted to
|
|
the peer by using rsync(1) and is assigned a new unique ID.
|
|
|
|
In practice, database replication can process hundreds of databases per
|
|
concurrency setting per second (up to the number of available CPUs or
|
|
disks) and is bound by the number of database transactions that must be
|
|
performed.
|
|
|
|
Object replication
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
The initial implementation of object replication performed an rsync to
|
|
push data from a local partition to all remote servers where it was
|
|
expected to reside. While this worked at small scale, replication times
|
|
skyrocketed once directory structures could no longer be held in RAM.
|
|
This scheme was modified to save a hash of the contents for each suffix
|
|
directory to a per-partition hashes file. The hash for a suffix
|
|
directory is no longer valid when the contents of that suffix directory
|
|
is modified.
|
|
|
|
The object replication process reads in hash files and calculates any
|
|
invalidated hashes. Then, it transmits the hashes to each remote server
|
|
that should hold the partition, and only suffix directories with
|
|
differing hashes on the remote server are rsynced. After pushing files
|
|
to the remote server, the replication process notifies it to recalculate
|
|
hashes for the rsynced suffix directories.
|
|
|
|
The number of uncached directories that object replication must
|
|
traverse, usually as a result of invalidated suffix directory hashes,
|
|
impedes performance. To provide acceptable replication speeds, object
|
|
replication is designed to invalidate around 2 percent of the hash space
|
|
on a normal node each day.
|