Restructure Object Storage chapter of Cloud Admin Guide

Restores Troubleshoot Object Storage Removes Monitoring section, which was based on a blog backport: havana Closes-Bug: #1251515 author: nermina miller Change-Id: I580b077a0124d7cd54dced6c0d340e05d5d5f983
2013-12-10 02:46:28 -05:00
parent 7e8c23eb28
commit 2163ad9a00
23 changed files with 959 additions and 147 deletions
--- a/doc/admin-guide-cloud/ch_objectstorage.xml
+++ b/doc/admin-guide-cloud/ch_objectstorage.xml
@@ -5,6 +5,13 @@
    xml:id="ch_admin-openstack-object-storage">
    <?dbhtml stop-chunking?>
    <title>Object Storage</title>
-    <xi:include href="../common/section_about-object-storage.xml"/>
+    <xi:include href="../common/section_objectstorage-intro.xml"/>
    <xi:include href="../common/section_objectstorage-features.xml"/>
    <xi:include href="../common/section_objectstorage-characteristics.xml"/>
    <xi:include href="../common/section_objectstorage-components.xml"/>
    <xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
    <xi:include href="../common/section_objectstorage-arch.xml"/>
    <xi:include href="../common/section_objectstorage-replication.xml"/>
    <xi:include href="section_object-storage-monitoring.xml"/>
    <xi:include href="../common/section_objectstorage-troubleshoot.xml"/>
 </chapter>
--- a/doc/admin-guide-cloud/section_object-storage-monitoring.xml
+++ b/doc/admin-guide-cloud/section_object-storage-monitoring.xml
@@ -3,6 +3,7 @@
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
    xml:id="ch_introduction-to-openstack-object-storage-monitoring">
    <!-- ... Based on a blog, should be replaced with original material... -->
    <title>Object Storage monitoring</title>
    <?dbhtml stop-chunking?>
    <para>Excerpted from a blog post by <link
--- a/doc/common/figures/objectstorage-accountscontainers.png
+++ b/doc/common/figures/objectstorage-accountscontainers.png
--- a/doc/common/figures/objectstorage-arch.png
+++ b/doc/common/figures/objectstorage-arch.png
--- a/doc/common/figures/objectstorage-buildingblocks.png
+++ b/doc/common/figures/objectstorage-buildingblocks.png
--- a/doc/common/figures/objectstorage-nodes.png
+++ b/doc/common/figures/objectstorage-nodes.png
--- a/doc/common/figures/objectstorage-partitions.png
+++ b/doc/common/figures/objectstorage-partitions.png
--- a/doc/common/figures/objectstorage-replication.png
+++ b/doc/common/figures/objectstorage-replication.png
--- a/doc/common/figures/objectstorage-ring.png
+++ b/doc/common/figures/objectstorage-ring.png
--- a/doc/common/figures/objectstorage-usecase.png
+++ b/doc/common/figures/objectstorage-usecase.png
--- a/doc/common/figures/objectstorage-zones.png
+++ b/doc/common/figures/objectstorage-zones.png
--- a/doc/common/figures/objectstorage.png
+++ b/doc/common/figures/objectstorage.png
--- a/doc/common/section_objectstorage-account-reaper.xml
+++ b/doc/common/section_objectstorage-account-reaper.xml
@@ -0,0 +1,40 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage-account-reaper">
    <!-- ... Old module003-ch008-account-reaper edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Account reaper</title>
    <para>In the background, the account reaper removes data from the deleted accounts.</para>
    <para>A reseller marks an account for deletion by issuing a <code>DELETE</code> request on the account’s
        storage URL. This action sets the <code>status</code> column of the account_stat table in the account
        database and replicas to <code>DELETED</code>, marking the account's data for deletion.</para>
    <para>Typically, a specific retention time or undelete are not provided. However, you can set a
            <code>delay_reaping</code> value in the <code>[account-reaper]</code> section of the
        account-server.conf to delay the actual deletion of data. At this time, to undelete you have
        to update the account database replicas directly, setting the status column to an empty
        string and updating the put_timestamp to be greater than the delete_timestamp.
                <note><para>It's on the developers' to-do list to write a utility that performs this task, preferably
                through a ReST call.</para></note>
    </para>
    <para>The account reaper runs on each account server and scans the server occasionally for
        account databases marked for deletion. It only fires up on the accounts for which the server
        is the primary node, so that multiple account servers aren’t trying to do it simultaneously.
        Using multiple servers to delete one account might improve the deletion speed but requires
        coordination to avoid duplication. Speed really is not a big concern with data deletion, and
        large accounts aren’t deleted often.</para>
    <para>Deleting an account is simple. For each account container, all objects are deleted and
        then the container is deleted. Deletion requests that fail will not stop the overall process
        but will cause the overall process to fail eventually (for example, if an object delete
        times out, you will not be able to delete the container or the account). The account reaper
        keeps trying to delete an account until it is empty, at which point the database reclaim
        process within the db_replicator will remove the database files.</para>
    <para>A persistent error state may prevent the deletion of an object
        or container. If this happens, you will see
        a message such as <code>“Account &lt;name&gt; has not been reaped
            since &lt;date&gt;”</code> in the log. You can control when this is
        logged with the <code>reap_warn_after</code> value in the <code>[account-reaper]</code>
        section of the account-server.conf file. The default value is 30
        days.</para>
 </section>
--- a/doc/common/section_objectstorage-arch.xml
+++ b/doc/common/section_objectstorage-arch.xml
@@ -0,0 +1,75 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage-cluster-architecture">
    <!-- ... Old module003-ch007-swift-cluster-architecture edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Cluster architecture</title>
    <section xml:id="section_access-tier">
        <title>Access tier</title>
    <para>Large-scale deployments segment off an access tier, which is considered the Object Storage
            system's central hub. The access tier fields the incoming API requests from clients and
            moves data in and out of the system. This tier consists of front-end load balancers,
            ssl-terminators, and authentication services. It runs the (distributed) brain of the
            Object Storage system&#151;the proxy server processes.</para>
    <figure>
        <title>Object Storage architecture</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-arch.png"/>
            </imageobject>
        </mediaobject>
    </figure>
        <para>Because access servers are collocated in their own tier, you can scale out read/write
            access regardless of the storage capacity. For example, if a cluster is on the public
            Internet, requires SSL termination, and has a high demand for data access, you can
            provision many access servers. However, if the cluster is on a private network and used
            primarily for archival purposes, you need fewer access servers.</para>
        <para>Since this is an HTTP addressable storage service, you may incorporate a load balancer
            into the access tier.</para>
        <para>Typically, the tier consists of a collection of 1U servers. These machines use a
            moderate amount of RAM and are network I/O intensive. Since these systems field each
            incoming API request, you should provision them with two high-throughput (10GbE)
            interfacesone for the incoming "front-end"  requests and the other for the "back-end"
            access to the object storage nodes to put and fetch data.</para>
        <section xml:id="section_access-tier-considerations">
            <title>Factors to consider</title>
            <para>For most publicly facing deployments as well as private deployments available
                across a wide-reaching corporate network, you use SSL to encrypt traffic to the
                client. SSL adds significant processing load to establish sessions between clients,
                which is why you have to provision more capacity in the access layer. SSL may not be
                required for private deployments on trusted networks.</para>
        </section>
    </section>
    <section xml:id="section_storage-nodes">
        <title>Storage nodes</title>
        <para>In most configurations, each of the five zones should have an equal amount of storage
            capacity. Storage nodes use a reasonable amount of memory and CPU. Metadata needs to be
            readily available to return objects quickly. The object stores run services not only to
            field incoming requests from the access tier, but to also run replicators, auditors, and
            reapers. You can provision object stores provisioned with single gigabit or 10 gigabit
            network interface depending on the expected workload and desired performance.</para>
    <figure>
        <title>Object Storage (Swift)</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-nodes.png"/>
            </imageobject>
        </mediaobject>
    </figure>
        <para>Currently, 2TB or 3TB SATA disks deliver good price/performance value. You can use
            desktop-grade drives if you have responsive remote hands in the datacenter and
            enterprise-grade drives if you don't.</para>
    <section xml:id="section_storage-nodes-considerations">
            <title>Factors to consider</title>
            <para>You should keep in mind the desired I/O performance for single-threaded requests .
                This system does not use RAID, so a single disk handles each request for an object.
                Disk performance impacts single-threaded response rates.</para>
            <para>To achieve apparent higher throughput, the object storage system is designed to
                handle concurrent uploads/downloads. The network I/O capacity (1GbE, bonded 1GbE
                pair, or 10GbE) should match your desired concurrent throughput needs for reads and
                writes.</para>
    </section>
 </section>
 </section>
--- a/doc/common/section_objectstorage-characteristics.xml
+++ b/doc/common/section_objectstorage-characteristics.xml
@@ -0,0 +1,59 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="objectstorage_characteristics">
    <!-- ... Old module003-ch003-obj-store-capabilities edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Object Storage characteristics</title>
    <para>The key characteristics of Object Storage are that:</para>
    <itemizedlist>
        <listitem>
            <para>All objects stored in Object Storage have a URL.</para>
        </listitem>
        <listitem>
            <para>All objects stored are replicated 3&#10005; in as-unique-as-possible zones, which
                    can be defined as a group of drives, a node, a rack, and so on.</para>
        </listitem>
        <listitem>
            <para>All objects have their own metadata.</para>
        </listitem>
        <listitem>
            <para>Developers interact with the object storage system through a RESTful HTTP
                    API.</para>
        </listitem>
        <listitem>
            <para>Object data can be located anywhere in the cluster.</para>
        </listitem>
        <listitem>
            <para>The cluster scales by adding additional nodes without sacrificing performance,
                which allows a more cost-effective linear storage expansion than fork-lift
                upgrades.</para>
        </listitem>
        <listitem>
            <para>Data doesn't have to be migrate to an entirely new storage system.</para>
        </listitem>
        <listitem>
            <para>New nodes can be added to the cluster without downtime.</para>
        </listitem>
        <listitem>
            <para>Failed nodes and disks can be swapped out without downtime.</para>
        </listitem>
        <listitem>
            <para>It runs on industry-standard hardware, such as Dell, HP, and Supermicro.</para>
        </listitem>
    </itemizedlist>
    <figure>
        <title>Object Storage (Swift)</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage.png"/>
            </imageobject>
        </mediaobject>
    </figure>
    <para>Developers can either write directly to the Swift API or use one of the many client
        libraries that exist for all of the popular programming languages, such as Java, Python,
        Ruby, and C#. Amazon S3 and RackSpace Cloud Files users should be very familiar with Object
        Storage. Users new to object storage systems will have to adjust to a different approach and
        mindset than those required for a traditional filesystem.</para>
 </section>
--- a/doc/common/section_objectstorage-components.xml
+++ b/doc/common/section_objectstorage-components.xml
@@ -0,0 +1,236 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage-components">
    <!-- ... Old module003-ch004-swift-building-blocks edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Components</title>
        <para>The components that enable Object Storage to deliver high availability, high
        durability, and high concurrency are:</para>
        <itemizedlist>
            <listitem>
                <para><emphasis role="bold">Proxy servers&#151;</emphasis>Handle all of the incoming
                API requests.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Rings&#151;</emphasis>Map logical names of data to
                locations on particular disks.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Zones&#151;</emphasis>Isolate data from other zones. A
                failure in one zone doesn’t impact the rest of the cluster because data is
                replicated across zones.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Accounts and containers&#151;</emphasis>Each account and
                container are individual databases that are distributed across the cluster. An
                account database contains the list of containers in that account. A container
                database contains the list of objects in that container.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Objects&#151;</emphasis>The data itself.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Partitions&#151;</emphasis>A partition stores objects,
                account databases, and container databases and helps manage locations where data
                lives in the cluster.</para>
            </listitem>
        </itemizedlist>
    <figure>
        <title>Object Storage building blocks</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-buildingblocks.png"/>
            </imageobject>
        </mediaobject>
    </figure>
    <section xml:id="section_proxy-servers">
        <title>Proxy servers</title>
        <para>Proxy servers are the public face of Object Storage and handle all of the incoming API
            requests. Once a proxy server receives a request, it determines the storage node based
            on the object's URL, for example, https://swift.example.com/v1/account/container/object.
            Proxy servers also coordinate responses, handle failures, and coordinate
            timestamps.</para>
        <para>Proxy servers use a shared-nothing architecture and can be scaled as needed based on
            projected workloads. A minimum of two proxy servers should be deployed for redundancy.
            If one proxy server fails, the others take over.</para>
    </section>
    <section xml:id="section_ring">
    <title>Rings</title>
    <para>A ring represents a mapping between the names of entities stored on disk and their
            physical locations. There are separate rings for accounts, containers, and objects. When
            other components need to perform any operation on an object, container, or account, they
            need to interact with the appropriate ring to determine their location in the
            cluster.</para>
    <para>The ring maintains this mapping using zones, devices, partitions, and replicas. Each
            partition in the ring is replicated, by default, three times across the cluster, and
            partition locations are stored in the mapping maintained by the ring. The ring is also
            responsible for determining which devices are used for handoff in failure
            scenarios.</para>
        <para>Data can be isolated into zones in the ring. Each partition replica is guaranteed to
            reside in a different zone. A zone could represent a drive, a server, a cabinet, a
            switch, or even a data center.</para>
        <para>The partitions of the ring are equally divided among all of the devices in the Object
            Storage installation. When partitions need to be moved around (for example, if a device
            is added to the cluster), the ring ensures that a minimum number of partitions are moved
            at a time, and only one replica of a partition is moved at a time.</para>
        <para>Weights can be used to balance the distribution of partitions on drives across the
            cluster. This can be useful, for example, when differently sized drives are used in a
            cluster.</para>
        <para>The ring is used by the proxy server and several background processes (like
            replication).</para>
    <figure>
        <title>The <emphasis role="bold">ring</emphasis></title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-ring.png"/>
            </imageobject>
        </mediaobject>
    </figure>
        <para>These rings are externally managed, in that the server processes themselves do not
            modify the rings, they are instead given new rings modified by other tools.</para>
        <para>The ring uses a configurable number of bits from a
            path’s MD5 hash as a partition index that designates a
            device. The number of bits kept from the hash is known as
            the partition power, and 2 to the partition power
            indicates the partition count. Partitioning the full MD5
            hash ring allows other parts of the cluster to work in
            batches of items at once which ends up either more
            efficient or at least less complex than working with each
            item separately or the entire cluster all at once.</para>
        <para>Another configurable value is the replica count, which indicates how many of the
            partition-device assignments make up a single ring. For a given partition number, each
            replica’s device will not be in the same zone as any other replica's device. Zones can
            be used to group devices based on physical locations, power separations, network
            separations, or any other attribute that would improve the availability of multiple
            replicas at the same time.</para>
    </section>
 <section xml:id="section_zones">
        <title>Zones</title>
            <para>Object Storage allows configuring zones in order to isolate failure boundaries.
            Each data replica resides in a separate zone, if possible. At the smallest level, a zone
            could be a single drive or a grouping of a few drives. If there were five object storage
            servers, then each server would represent its own zone. Larger deployments would have an
            entire rack (or multiple racks) of object servers, each representing a zone. The goal of
            zones is to allow the cluster to tolerate significant outages of storage servers without
            losing all replicas of the data.</para>
            <para>As mentioned earlier, everything in Object Storage is stored, by default, three
            times. Swift will place each replica "as-uniquely-as-possible" to ensure both high
            availability and high durability. This means that when chosing a replica location,
            Object Storage chooses a server in an unused zone before an unused server in a zone that
            already has a replica of the data.</para>
    <figure>
        <title>Zones</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-zones.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>When a disk fails, replica data is automatically distributed to the other zones to
            ensure there are three copies of the data.</para>
    </section>
    <section xml:id="section_accounts-containers">
            <title>Accounts and containers</title>
            <para>Each account and container is an individual SQLite
                database that is distributed across the cluster. An
                account database contains the list of containers in
                that account. A container database contains the list
                of objects in that container.</para>
    <figure>
        <title>Accounts and containers</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-accountscontainers.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>To keep track of object data locations, each account in the system has a database
            that references all of its containers, and each container database references each
            object.</para>
    </section>
    <section xml:id="section_partitions">
            <title>Partitions</title>
            <para>A partition is a collection of stored data, including account databases, container
            databases, and objects. Partitions are core to the replication system.</para>
            <para>Think of a partition as a bin moving throughout a fulfillment center warehouse.
            Individual orders get thrown into the bin. The system treats that bin as a cohesive
            entity as it moves throughout the system. A bin is easier to deal with than many little
            things. It makes for fewer moving parts throughout the system.</para>
            <para>System replicators and object uploads/downloads operate on partitions. As the
            system scales up, its behavior continues to be predictable because the number of
            partitions is a fixed number.</para>
            <para>Implementing a partition is conceptually simple&#151;a partition is just a
            directory sitting on a disk with a corresponding hash table of what it contains.</para>
    <figure>
        <title>Partitions</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-partitions.png"/>
            </imageobject>
        </mediaobject>
    </figure>
    </section>
    <section xml:id="section_replicators">
            <title>Replicators</title>
        <para>In order to ensure that there are three copies of the data everywhere, replicators
            continuously examine each partition. For each local partition, the replicator compares
            it against the replicated copies in the other zones to see if there are any
            differences.</para>
            <para>The replicator knowd if replication needs to take plac by examining hashes. A hash
            file is created for each partition, which contains hashes of each directory in the
            partition. Each of the three hash files is compared. For a given partition, the hash
            files for each of the partition's copies are compared. If the hashes are different, then
            it is time to replicate, and the directory that needs to be replicated is copied
            over.</para>
            <para>This is where partitions come in handy. With fewer things in the system, larger
            chunks of data are transferred around (rather than lots of little TCP connections, which
            is inefficient) and there is a consistent number of hashes to compare.</para>
            <para>The cluster eventually has a consistent behavior where the newest data has a
            priority.</para>
    <figure>
        <title>Replication</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-replication.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>If a zone goes down, one of the nodes containing a replica notices and proactively
            copies data to a handoff location.</para>
    </section>
    <section xml:id="section_usecases">
    <title>Use cases</title>
        <para>The following sections show use cases for object uploads and downloads and introduce the components.</para>
        <section xml:id="upload">
    <title>Upload</title>
        <para>A client uses the REST API to make a HTTP request to PUT an object into an existing
                container. The cluster receives the request. First, the system must figure out where
                the data is going to go. To do this, the account name, container name, and object
                name are all used to determine the partition where this object should live.</para>
        <para>Then a lookup in the ring figures out which storage nodes contain the partitions in
                question.</para>
        <para>The data is then sent to each storage node where it is placed in the appropriate
                partition. At least two of the three writes must be successful before the client is
                notified that the upload was successful.</para>
        <para>Next, the container database is updated asynchronously to reflect that there is a new
                object in it.</para>
    <figure>
        <title>Object Storage in use</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="../common/figures/objectstorage-usecase.png"/>
            </imageobject>
        </mediaobject>
    </figure>
        </section>
    <section xml:id="section_swift-component-download">
    <title>Download</title>
        <para>A request comes in for an acount/container/object. Using the same consistent hashing,
                the partition name is generated. A lookup in the ring reveals which storage nodes
                contain that partition. A request is made to one of the storage nodes to fetch the
                object and, if that fails, requests are made to the other nodes.</para>
        </section>
 </section>
 </section>
--- a/doc/common/section_objectstorage-features.xml
+++ b/doc/common/section_objectstorage-features.xml
@@ -0,0 +1,180 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage_features">
    <!-- ... Old module003-ch002-features-benefits edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Features and benefits</title>
    <para>
        <informaltable class="c19">
            <tbody>
                <tr>
                    <th rowspan="1" colspan="1">Features</th>
                    <th rowspan="1" colspan="1">Benefits</th>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Leverages commodity
                        hardware</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >No
                        lock-in, lower
                        price/GB</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >HDD/node failure agnostic</emphasis></td>
                    <td rowspan="1" colspan="1">Self-healing, reliable, data redundancy protects
                        from failures</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Unlimited storage</emphasis></td>
                    <td rowspan="1" colspan="1">Large and flat namespace, highly scalable read/write
                        access, able to serve content directly from storage system</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Multi-dimensional scalability</emphasis>
                       </td>
                    <td rowspan="1" colspan="1">Scale-out architecture&#151;Scale vertically and
                        horizontally-distributed storage Backs up and archives large amounts of data
                        with linear performance</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold">Account/container/object
                            structure</emphasis></td>
                    <td rowspan="1" colspan="1">No nesting, not a traditional file
                        system&#151;Optimized for scale, it scales to multiple petabytes and
                        billions of objects</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold">Built-in replication 3&#10005;
                        + data redundancy (compared with 2&#10005; on RAID)</emphasis></td>
                    <td rowspan="1" colspan="1">A configurable number of accounts, containers and
                        object copies for high availability</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                        >Easily add capacity (unlike
                        RAID resize)</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Elastic
                        data scaling with
                        ease</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >No central database</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Higher
                        performance, no
                        bottlenecks</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >RAID not required</emphasis></td>
                    <td rowspan="1" colspan="1">Handle many small, random reads and writes
                        efficiently</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Built-in management
                        utilities</emphasis></td>
                    <td rowspan="1" colspan="1">Account management&#151;Create, add, verify, and
                        delete users; Container management&#151;Upload, download, and verify;
                        Monitoring&#151;Capacity, host, network, log trawling, and cluster
                        health</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Drive auditing</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Detect
                        drive failures preempting data
                        corruption</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Expiring objects</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Users
                        can set an expiration time or a TTL on an
                        object to control
                        access</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Direct object access</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Enable
                        direct browser access to content, such as for
                        a control
                        panel</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Realtime visibility into client
                            requests</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Know
                        what users are
                        requesting</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Supports S3 API</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Utilize
                        tools that were designed for the popular S3
                        API</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Restrict containers per
                            account</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Limit
                        access to control usage by
                        user</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Support for NetApp, Nexenta,
                            SolidFire</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Unified
                        support for block volumes using a variety of
                        storage
                        systems</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Snapshot and backup API for block
                            volumes</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Data
                        protection and recovery for VM
                        data</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Standalone volume API
                            available</emphasis></td>
                    <td rowspan="1" colspan="1"
                        >Separate
                        endpoint and API for integration with other
                        compute
                        systems</td>
                </tr>
                <tr>
                    <td rowspan="1" colspan="1"><emphasis role="bold"
                            >Integration with Compute</emphasis></td>
                    <td rowspan="1" colspan="1">Fully integrated with Compute for attaching block
                        volumes and reporting on usage</td>
                </tr>
            </tbody>
        </informaltable>
    </para>
 </section>
--- a/doc/common/section_objectstorage-intro.xml
+++ b/doc/common/section_objectstorage-intro.xml
@@ -0,0 +1,23 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage-intro">
    <!-- ... Old module003-ch001-intro-objstore edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Introduction to Object Storage</title>
    <para>OpenStack Object Storage (code-named Swift) is open source software for creating
        redundant, scalable data storage using clusters of standardized servers to store petabytes
        of accessible data. It is a long-term storage system for large amounts of static data that
        can be retrieved, leveraged, and updated. Object Storage uses a distributed architecture
        with no central point of control, providing greater scalability, redundancy, and permanence.
        Objects are written to multiple hardware devices, with the OpenStack software responsible
        for ensuring data replication and integrity across the cluster. Storage clusters scale
        horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its
        content from other active nodes. Because OpenStack uses software logic to ensure data
        replication and distribution across different devices, inexpensive commodity hard drives and
        servers can be used in lieu of more expensive equipment.</para>
    <para>Object Storage is ideal for cost effective, scale-out storage. It provides a fully
        distributed, API-accessible storage platform that can be integrated directly into
        applications or used for backup, archiving, and data retention.</para>
 </section>
--- a/doc/common/section_objectstorage-replication.xml
+++ b/doc/common/section_objectstorage-replication.xml
@@ -0,0 +1,99 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage-replication">
    <!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Replication</title>
    <para>Because each replica in Object Storage functions independently, and clients generally
        require only a simple majority of nodes responding to consider an operation successful,
        transient failures like network partitions can quickly cause replicas to diverge. These
        differences are eventually reconciled by asynchronous, peer-to-peer replicator processes.
        The replicator processes traverse their local filesystems, concurrently performing
        operations in a manner that balances load across physical disks.</para>
    <para>Replication uses a push model, with records and files
        generally only being copied from local to remote replicas.
        This is important because data on the node may not belong
        there (as in the case of handoffs and ring changes), and a
        replicator can’t know what data exists elsewhere in the
        cluster that it should pull in. It’s the duty of any node that
        contains data to ensure that data gets to where it belongs.
        Replica placement is handled by the ring.</para>
    <para>Every deleted record or file in the system is marked by a
        tombstone, so that deletions can be replicated alongside
        creations. The replication process cleans up tombstones after
        a time period known as the consistency window. The consistency
        window encompasses replication duration and how long transient
        failure can remove a node from the cluster. Tombstone cleanup
        must be tied to replication to reach replica
        convergence.</para>
    <para>If a replicator detects that a remote drive has failed, the
        replicator uses the get_more_nodes interface for the ring to
        choose an alternate node with which to synchronize. The
        replicator can maintain desired levels of replication in the
        face of disk failures, though some replicas may not be in an
        immediately usable location. Note that the replicator doesn’t
        maintain desired levels of replication when other failures,
        such as entire node failures, occur because most failure are
        transient.</para>
    <para>Replication is an area of active development, and likely
        rife with potential improvements to speed and
        correctness.</para>
    <para>There are two major classes of replicator&#151;the db replicator, which replicates
        accounts and containers, and the object replicator, which replicates object data.</para>
    <section xml:id="section_database-replication">
    <title>Database replication</title>
        <para>The first step performed by db replication is a low-cost
            hash comparison to determine whether two replicas already
            match. Under normal operation, this check is able to
            verify that most databases in the system are already
            synchronized very quickly. If the hashes differ, the
            replicator brings the databases in sync by sharing records
            added since the last sync point.</para>
        <para>This sync point is a high water mark noting the last
            record at which two databases were known to be in sync,
            and is stored in each database as a tuple of the remote
            database id and record id. Database ids are unique amongst
            all replicas of the database, and record ids are
            monotonically increasing integers. After all new records
            have been pushed to the remote database, the entire sync
            table of the local database is pushed, so the remote
            database can guarantee that it is in sync with everything
            with which the local database has previously
            synchronized.</para>
        <para>If a replica is found to be missing entirely, the whole
            local database file is transmitted to the peer using
            rsync(1) and vested with a new unique id.</para>
        <para>In practice, DB replication can process hundreds of
            databases per concurrency setting per second (up to the
            number of available CPUs or disks) and is bound by the
            number of DB transactions that must be performed.</para>
    </section>
    <section xml:id="section_object-replication">
        <title>Object replication</title>
        <para>The initial implementation of object replication simply
            performed an rsync to push data from a local partition to
            all remote servers it was expected to exist on. While this
            performed adequately at small scale, replication times
            skyrocketed once directory structures could no longer be
            held in RAM. We now use a modification of this scheme in
            which a hash of the contents for each suffix directory is
            saved to a per-partition hashes file. The hash for a
            suffix directory is invalidated when the contents of that
            suffix directory are modified.</para>
        <para>The object replication process reads in these hash
            files, calculating any invalidated hashes. It then
            transmits the hashes to each remote server that should
            hold the partition, and only suffix directories with
            differing hashes on the remote server are rsynced. After
            pushing files to the remote server, the replication
            process notifies it to recalculate hashes for the rsynced
            suffix directories.</para>
        <para>Performance of object replication is generally bound by the number of uncached
            directories it has to traverse, usually as a result of invalidated suffix directory
            hashes. Using write volume and partition counts from our running systems, it was
            designed so that around 2 percent of the hash space on a normal node will be invalidated
            per day, which has experimentally given us acceptable replication speeds.</para>
    </section>
 </section>
--- a/doc/common/section_objectstorage-ringbuilder.xml
+++ b/doc/common/section_objectstorage-ringbuilder.xml
@@ -0,0 +1,129 @@
 <?xml version="1.0" encoding="utf-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="section_objectstorage-ringbuilder">
    <!-- ... Old module003-ch005-the-ring edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
    <title>Ring-builder</title>
    <para>Rings are built and managed manually by a utility called the ring-builder. The
        ring-builder assigns partitions to devices and writes an optimized Python structure to a
        gzipped, serialized file on disk for shipping out to the servers. The server processes just
        check the modification time of the file occasionally and reload their in-memory copies of
        the ring structure as needed. Because of how the ring-builder manages changes to the ring,
        using a slightly older ring usually just means one of the three replicas for a subset of the
        partitions will be incorrect, which can be easily worked around.</para>
    <para>The ring-builder also keeps its own builder file with the ring information and additional
        data required to build future rings. It is very important to keep multiple backup copies of
        these builder files. One option is to copy the builder files out to every server while
        copying the ring files themselves. Another is to upload the builder files into the cluster
        itself. If you lose the builder file, you have to create a new ring from scratch. Nearly all
        partitions would be assigned to different devices and, therefore, nearly all of the stored
        data would have to be replicated to new locations. So, recovery from a builder file loss is
        possible, but data would be unreachable for an extended time.</para>
    <section xml:id="section_ring-data-structure">
        <title>Ring data structure</title>
        <para>The ring data structure consists of three top level
            fields: a list of devices in the cluster, a list of lists
            of device ids indicating partition to device assignments,
            and an integer indicating the number of bits to shift an
            MD5 hash to calculate the partition for the hash.</para>
    </section>
    <section xml:id="section_partition-assignment">
            <title>Partition assignment list</title>
        <para>This is a list of <literal>array(‘H’)</literal> of devices ids. The
            outermost list contains an <literal>array(‘H’)</literal> for each
            replica. Each <literal>array(‘H’)</literal> has a length equal to the
                partition count for the ring. Each integer in the
                <literal>array(‘H’)</literal> is an index into the above list of devices.
                The partition list is known internally to the Ring
                class as <literal>_replica2part2dev_id</literal>.</para>
            <para>So, to create a list of device dictionaries assigned to a partition, the Python
            code would look like:
            <programlisting>devices = [self.devs[part2dev_id[partition]] for
 part2dev_id in self._replica2part2dev_id]</programlisting></para>
            <para>That code is a little simplistic, as it does not account for the removal of
            duplicate devices. If a ring has more replicas than devices, then a partition will have
            more than one replica on one device.</para>
        <para><literal>array(‘H’)</literal> is used for memory conservation as there
                may be millions of partitions.</para>
    </section>
    <section xml:id="section_fractional-replicas">
            <title>Fractional replicas</title>
            <para>A ring is not restricted to having an integer number
                of replicas. In order to support the gradual changing
                of replica counts, the ring is able to have a real
                number of replicas.</para>
            <para>When the number of replicas is not an integer, then the last element of
                <literal>_replica2part2dev_id</literal> will have a length that is less than the
            partition count for the ring. This means that some partitions will have more replicas
            than others. For example, if a ring has 3.25 replicas, then 25 percent of its partitions
            will have four replicas, while the remaining 75 percent will have just three.</para>
    </section>
 <section xml:id="section_partition-shift-value">
            <title>Partition shift value</title>
            <para>The partition shift value is known internally to the
                Ring class as <literal>_part_shift</literal>. This value used to shift an
                MD5 hash to calculate the partition on which the data
                for that hash should reside. Only the top four bytes
                of the hash is used in this process. For example, to
                compute the partition for the path
                /account/container/object the Python code might look
                like:
 <programlisting>partition = unpack_from('&gt;I',
 md5('/account/container/object').digest())[0] &gt;&gt;
 self._part_shift</programlisting></para>
            <para>For a ring generated with part_power P, the
                partition shift value is <literal>32 - P</literal>.</para>
 </section>
    <section xml:id="section_build-ring">
        <title>Build the ring</title>
        <para>The initial building of the ring first calculates the
            number of partitions that should ideally be assigned to
            each device based the device’s weight. For example, given
            a partition power of 20, the ring will have 1,048,576
            partitions. If there are 1,000 devices of equal weight
            they will each desire 1,048.576 partitions. The devices
            are then sorted by the number of partitions they desire
            and kept in order throughout the initialization
            process.</para>
        <note><para>Each device is also assigned a random tiebreaker
            value that is used when two devices desire the same number
            of partitions. This tiebreaker is not stored on disk
            anywhere, and so two different rings created with the same
            parameters will have different partition assignments. For
            repeatable partition assignments, <literal>RingBuilder.rebalance()</literal>
            takes an optional seed value that will be used to seed
            Python’s pseudo-random number generator.</para></note>
        <para>Then, the ring builder assigns each replica of each partition to the device that
            requires most partitions at that point while keeping it as far away as possible from
            other replicas. The ring builder prefers to assign a replica to a device in a region
            does not already have a replica. If no such region is available, the ring builder tries
            to find a device in a different zone. If that's not possible, it will look on a
            different server. If it doesn't find one there, it will just look for a device that has
            no replicas. Finally, if all of the other options are exhausted, the ring builder
            assigns the replica to the device that has the fewest replicas already assigned. Note
            that assignment of multiple replicas to one device will only happen if the ring has
            fewer devices than it has replicas.</para>
        <para>When building a new ring based on an old ring, the desired number of partitions each
            device wants is recalculated. Next, the partitions to be reassigned are gathered up. Any
            removed devices have all their assigned partitions unassigned and added to the gathered
            list. Any partition replicas that (due to the addition of new devices) can be spread out
            for better durability are unassigned and added to the gathered list. Any devices that
            have more partitions than they now need have random partitions unassigned from them and
            added to the gathered list. Lastly, the gathered partitions are then reassigned to
            devices using a similar method as in the initial assignment described above.</para>
        <para>Whenever a partition has a replica reassigned, the time of the reassignment is
            recorded. This is taken into account when gathering partitions to reassign so that no
            partition is moved twice in a configurable amount of time. This configurable amount of
            time is known internally to the RingBuilder class as <literal>min_part_hours</literal>.
            This restriction is ignored for replicas of partitions on devices that have been removed
            since removing a device only happens on device failure and reasssignment is the only
            choice.</para>
        <para>The above processes don’t always perfectly rebalance a ring due to the random nature
            of gathering partitions for reassignment. To help reach a more balanced ring, the
            rebalance process is repeated until near perfect (less than 1 percent off) or when the
            balance doesn’t improve by at least 1 percent (indicating we probably can’t get perfect
            balance due to wildly imbalanced zones or too many partitions recently moved).</para>
    </section>
 </section>
--- a/doc/common/section_objectstorage-troubleshoot.xml
+++ b/doc/common/section_objectstorage-troubleshoot.xml
@@ -0,0 +1,106 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <section xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
    xml:id="troubleshooting-openstack-object-storage">
    <title>Troubleshoot Object Storage</title>
    <para>For Object Storage, everything is logged in <filename>/var/log/syslog</filename> (or messages on some distros).
        Several settings enable further customization of logging, such as <literal>log_name</literal>, <literal>log_facility</literal>,
        and <literal>log_level</literal>, within the object server configuration files.</para>
    <section xml:id="drive-failure">
        <title>Drive failure</title>
        <para>In the event that a drive has failed, the first step is to make sure the drive is
            unmounted. This will make it easier for Object Storage to work around the failure until
            it has been resolved. If the drive is going to be replaced immediately, then it is just
            best to replace the drive, format it, remount it, and let replication fill it up.</para>
        <para>If the drive can’t be replaced immediately, then it is best to leave it
            unmounted, and remove the drive from the ring. This will allow all the replicas
            that were on that drive to be replicated elsewhere until the drive is replaced.
            Once the drive is replaced, it can be re-added to the ring.</para>
        <para>You can look at error messages in <filename>/var/log/kern.log</filename> for hints of drive failure.</para>
    </section>
    <section xml:id="server-failure">
        <title>Server failure</title>
        <para>If a server is having hardware issues, it is a good idea to make sure the
            Object Storage services are not running. This will allow Object Storage to
            work around the failure while you troubleshoot.</para>
        <para>If the server just needs a reboot, or a small amount of work that should only
            last a couple of hours, then it is probably best to let Object Storage work
            around the failure and get the machine fixed and back online. When the machine
            comes back online, replication will make sure that anything that is missing
            during the downtime will get updated.</para>
        <para>If the server has more serious issues, then it is probably best to remove all
            of the server’s devices from the ring. Once the server has been repaired and is
            back online, the server’s devices can be added back into the ring. It is
            important that the devices are reformatted before putting them back into the
            ring as it is likely to be responsible for a different set of partitions than
            before.</para>
    </section>
    <section xml:id="detect-failed-drives">
        <title>Detect failed drives</title>
        <para>It has been our experience that when a drive is about to fail, error messages will spew into
            /var/log/kern.log. There is a script called swift-drive-audit that can be run via cron
            to watch for bad drives. If errors are detected, it will unmount the bad drive, so that
            Object Storage can work around it. The script takes a configuration file with the
            following settings:</para>
        <xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
        <para>This script has only been tested on Ubuntu 10.04, so if you are using a
            different distro or OS, some care should be taken before using in production.
        </para>
    </section>
    <section xml:id="recover-ring-builder-file">
        <title>Emergency recovery of ring builder files</title>
        <para>You should always keep a backup of Swift ring builder files. However, if an
            emergency occurs, this procedure may assist in returning your cluster to an
            operational state.</para>
        <para>Using existing Swift tools, there is no way to recover a builder file from a
            <filename>ring.gz</filename> file. However, if you have a knowledge of Python, it is possible to
            construct a builder file that is pretty close to the one you have lost. The
            following is what you will need to do.</para>
        <warning>
            <para>This procedure is a last-resort for emergency circumstances&#151;it
                requires knowledge of the swift python code and may not succeed.</para>
        </warning>
        <para>First, load the ring and a new ringbuilder object in a Python REPL:</para>
        <programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
 >>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
        <para>Now, start copying the data we have in the ring into the builder.</para>
        <programlisting language="python">
 >>> import math
 >>> partitions = len(ring._replica2part2dev_id[0])
 >>> replicas = len(ring._replica2part2dev_id)
 >>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
 >>> builder.devs = ring.devs
 >>> builder._replica2part2dev = ring.replica2part2dev_id
 >>> builder._last_part_moves_epoch = 0
 >>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
 >>> builder._set_parts_wanted()
 >>> for d in builder._iter_devs():
            d['parts'] = 0
 >>> for p2d in builder._replica2part2dev:
            for dev_id in p2d:
                builder.devs[dev_id]['parts'] += 1</programlisting>
        <para>This is the extent of the recoverable fields. For
            <literal>min_part_hours</literal>  you'll either have to remember what the
            value you used was, or just make up a new one.</para>
        <programlisting language="python">
 >>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
        <para>Try some validation: if this doesn't raise an exception, you may feel some
            hope. Not too much, though.</para>
        <programlisting language="python">>>> builder.validate()</programlisting>
        <para>Save the builder.</para>
        <programlisting language="python">
 >>> import pickle
 >>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
        <para>You should now have a file called 'account.builder' in the current working
            directory. Next, run <literal>swift-ring-builder account.builder write_ring</literal>
            and compare the new account.ring.gz to the account.ring.gz that you started
            from. They probably won't be byte-for-byte identical, but if you load them up
            in a REPL and their <literal>_replica2part2dev_id</literal> and
            <literal>devs</literal> attributes are the same (or nearly so), then you're
            in good shape.</para>
        <para>Next, repeat the procedure for <filename>container.ring.gz</filename>
            and <filename>object.ring.gz</filename>, and you might get usable builder files.</para>
    </section>
 </section>
--- a/doc/common/section_support-object-storage.xml
+++ b/doc/common/section_support-object-storage.xml
@@ -1,144 +0,0 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <chapter xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
    xml:id="troubleshooting-openstack-object-storage">
    <title>Troubleshoot Object Storage</title>
    <para>For OpenStack Object Storage, everything is logged in
            <filename>/var/log/syslog</filename> (or messages on some
        distros). Several settings enable further customization of
        logging, such as <option>log_name</option>,
            <option>log_facility</option>, and
            <option>log_level</option>, within the object server
        configuration files.</para>
    <section xml:id="handling-drive-failure">
        <title>Recover drive failures</title>
        <para>If a drive fails, make sure the
            drive is unmounted to make it easier for Object
            Storage to work around the failure while you resolve
            it. If you plan to replace the drive immediately, replace
            the drive, format it, remount it, and let replication fill
            it.</para>
        <para>If you cannot replace the drive immediately, leave it
            unmounted and remove the drive from the ring. This enables
            you to replicate all the replicas on that drive elsewhere
            until you can replace the drive. After you replace the
            drive, you can add it to the ring again.</para>
        <note>
            <para>Rackspace has seen hints at drive failures by
                looking at error messages in
                    <filename>/var/log/kern.log</filename>. Check this
                file in your monitoring.</para>
        </note>
    </section>
    <section xml:id="handling-server-failure">
        <title>Recover server failures</title>
        <para>If a server has hardware issues, make sure that the
            Object Storage services are not running. This enables
            Object Storage to work around the failure while you
            troubleshoot.</para>
        <para>If the server needs a reboot or a minimal amount of
            work, let Object Storage work around the failure while you
            fix the machine and get it back online. When the machine
            comes back online, replication updates anything that was
            missing during the downtime.</para>
        <para>If the server has more serious issues,remove all server
            devices from the ring. After you repair and put the server
            online, you can add the devices for the server back to the
            ring. You must reformat the devices before you add them to
            the ring because they might be responsible for a different
            set of partitions than before.</para>
    </section>
    <section xml:id="detecting-failed-drives">
        <title>Detect failed drives</title>
        <para>When a drive is about to fail, many error messages
            appear in the <filename>/var/log/kern.log</filename> file.
            You can run the <package>swift-drive-audit</package>
            script through <command>cron</command> to watch for bad
            drives. If errors are detected, it unmounts the bad drive
            so that Object Storage can work around it. The script uses
            a configuration file with these settings:</para>
        <xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
        <para>This script has been tested on only Ubuntu 10.04. If you
            use a different distribution or operating system, take
            care before using the script in production.</para>
    </section>
    <section xml:id="recover-ring-builder-file">
        <title>Recover ring builder files (emergency)</title>
        <para>You should always keep a backup of Swift ring builder
            files. However, if an emergency occurs, use this procedure
            to return your cluster to an operational state.</para>
        <para>Existing Swift tools do not enable you to recover a
            builder file from a <filename>ring.gz</filename> file.
            However, if you have Python knowledge, you can construct a
            builder file similar to the one you have lost.</para>
        <warning>
            <para>This procedure is a last-resort in an emergency. It
                requires knowledge of the swift Python code and might
                not succeed.</para>
        </warning>
        <procedure>
            <step>
                <para>Load the ring and a new ringbuilder object in a
                    Python REPL:</para>
                <programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
 >>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
            </step>
            <step>
                <para>Copy the data in the ring into the
                    builder.</para>
                <programlisting language="python">>>> import math
 >>> partitions = len(ring._replica2part2dev_id[0])
 >>> replicas = len(ring._replica2part2dev_id)
 >>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
 >>> builder.devs = ring.devs
 >>> builder._replica2part2dev = ring.replica2part2dev_id
 >>> builder._last_part_moves_epoch = 0
 >>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
 >>> builder._set_parts_wanted()
 >>> for d in builder._iter_devs():
            d['parts'] = 0
 >>> for p2d in builder._replica2part2dev:
            for dev_id in p2d:
                builder.devs[dev_id]['parts'] += 1</programlisting>
                <para>This is the extent of the recoverable
                    fields.</para>
            </step>
            <step>
                <para>For <option>min_part_hours</option>, you must
                    remember the value that you used previously or
                    create a new value.</para>
                <programlisting language="python">>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
                <para>If validation succeeds without raising an
                    exception, you have succeeded.</para>
                <programlisting language="python">>>> builder.validate()</programlisting>
            </step>
            <step>
                <para>Save the builder.</para>
                <programlisting language="python">>>> import pickle
 >>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
                <para>The <filename>account.builder</filename> file
                    appears in the current working directory.</para>
            </step>
            <step>
                <para>Run <literal>swift-ring-builder account.builder
                        write_ring</literal>.</para>
                <para>Compare the new
                        <filename>account.ring.gz</filename> to the
                    original <filename>account.ring.gz</filename>
                    file. They might not be byte-for-byte identical,
                    but if you load them in REPL and their
                        <option>_replica2part2dev_id</option> and
                        <option>devs</option> attributes are the same
                    (or nearly so), you have succeeded.</para>
            </step>
            <step>
                <para>Repeat this procedure for the
                        <filename>container.ring.gz</filename> and
                        <filename>object.ring.gz</filename> files, and
                    you might get usable builder files.</para>
            </step>
        </procedure>
    </section>
 </chapter>
--- a/doc/config-reference/networking/section_networking-multi-dhcp-agents.xml
+++ b/doc/config-reference/networking/section_networking-multi-dhcp-agents.xml
@@ -69,7 +69,8 @@ format="PNG" />
            </imageobject>
        </mediaobject>
    </informalfigure>
-    <para>There will be three hosts in the setup.<table rules="all">
+    <para>There will be three hosts in the setup.</para>
        <table rules="all">
            <caption>Hosts for Demo</caption>
            <thead>
                <tr>
@@ -103,7 +104,7 @@ format="PNG" />
                    <td>Same as HostA</td>
                </tr>
            </tbody>
-        </table></para>
+        </table>
    <section xml:id="multi_agent_demo_configuration">
        <title>Configuration</title>
        <itemizedlist>