openstack-manuals/doc/common/section_objectstorage-replication.xml

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="section_objectstorage-replication">
    <title>Replication</title>
    <para>Because each replica in Object Storage functions
        independently and clients generally require only a simple
        majority of nodes to respond to consider an operation
        successful, transient failures like network partitions can
        quickly cause replicas to diverge. These differences are
        eventually reconciled by asynchronous, peer-to-peer replicator
        processes. The replicator processes traverse their local file
        systems and concurrently perform operations in a manner that
        balances load across physical disks.</para>
    <para>Replication uses a push model, with records and files
        generally only being copied from local to remote replicas.
        This is important because data on the node might not belong
        there (as in the case of hand offs and ring changes), and a
        replicator cannot know which data it should pull in from
        elsewhere in the cluster. Any node that contains data must
        ensure that data gets to where it belongs. The ring handles
        replica placement.</para>
    <para>To replicate deletions in addition to creations, every
        deleted record or file in the system is marked by a tombstone.
        The replication process cleans up tombstones after a time
        period known as the <emphasis role="italic">consistency
            window</emphasis>. This window defines the duration of the
        replication and how long transient failure can remove a node
        from the cluster. Tombstone cleanup must be tied to
        replication to reach replica convergence.</para>
    <para>If a replicator detects that a remote drive has failed, the
        replicator uses the <literal>get_more_nodes</literal>
        interface for the ring to choose an alternate node with which
        to synchronize. The replicator can maintain desired levels of
        replication during disk failures, though some replicas might
        not be in an immediately usable location.</para>
    <note>
        <para>The replicator does not maintain desired levels of
            replication when failures such as entire node failures
            occur; most failures are transient.</para>
    </note>
    <para>The main replication types are:</para>
    <itemizedlist>
        <listitem>
            <para><emphasis role="bold">Database
                    replication</emphasis>. Replicates containers and
                objects.</para>
        </listitem>
        <listitem>
            <para><emphasis role="bold">Object replication</emphasis>.
                Replicates object data.</para>
        </listitem>
    </itemizedlist>
    <section xml:id="section_database-replication">
        <title>Database replication</title>
        <para>Database replication completes a low-cost hash
            comparison to determine whether two replicas already
            match. Normally, this check can quickly verify that most
            databases in the system are already synchronized. If the
            hashes differ, the replicator synchronizes the databases
            by sharing records added since the last synchronization
            point.</para>
        <para>This synchronization point is a high water mark that
            notes the last record at which two databases were known to
            be synchronized, and is stored in each database as a tuple
            of the remote database ID and record ID. Database IDs are
            unique across all replicas of the database, and record IDs
            are monotonically increasing integers. After all new
            records are pushed to the remote database, the entire
            synchronization table of the local database is pushed, so
            the remote database can guarantee that it is synchronized
            with everything with which the local database was
            previously synchronized.</para>
        <para>If a replica is missing, the whole local database file
            is transmitted to the peer by using rsync(1) and is
            assigned a new unique ID.</para>
        <para>In practice, database replication can process hundreds
            of databases per concurrency setting per second (up to the
            number of available CPUs or disks) and is bound by the
            number of database transactions that must be
            performed.</para>
    </section>
    <section xml:id="section_object-replication">
        <title>Object replication</title>
        <para>The initial implementation of object replication
            performed an rsync to push data from a local partition to
            all remote servers where it was expected to reside. While
            this worked at small scale, replication times skyrocketed
            once directory structures could no longer be held in RAM.
            This scheme was modified to save a hash of the contents
            for each suffix directory to a per-partition hashes file.
            The hash for a suffix directory is no longer valid when
            the contents of that suffix directory is modified.</para>
        <para>The object replication process reads in hash files and
            calculates any invalidated hashes. Then, it transmits the
            hashes to each remote server that should hold the
            partition, and only suffix directories with differing
            hashes on the remote server are rsynced. After pushing
            files to the remote server, the replication process
            notifies it to recalculate hashes for the rsynced suffix
            directories.</para>
        <para>The number of uncached directories that object
            replication must traverse, usually as a result of
            invalidated suffix directory hashes, impedes performance.
            To provide acceptable replication speeds, object
            replication is designed to invalidate around 2 percent of
            the hash space on a normal node each day.</para>
    </section>
</section>