Restructure Object Storage chapter of Cloud Admin Guide

Restores Troubleshoot Object Storage
Removes Monitoring section, which was based on a blog

backport: havana
Closes-Bug: #1251515
author: nermina miller

Change-Id: I580b077a0124d7cd54dced6c0d340e05d5d5f983
This commit is contained in:
nerminamiller 2013-12-10 02:46:28 -05:00
parent 7e8c23eb28
commit 2163ad9a00
23 changed files with 959 additions and 147 deletions

@ -5,6 +5,13 @@
xml:id="ch_admin-openstack-object-storage">
<?dbhtml stop-chunking?>
<title>Object Storage</title>
<xi:include href="../common/section_about-object-storage.xml"/>
<xi:include href="../common/section_objectstorage-intro.xml"/>
<xi:include href="../common/section_objectstorage-features.xml"/>
<xi:include href="../common/section_objectstorage-characteristics.xml"/>
<xi:include href="../common/section_objectstorage-components.xml"/>
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
<xi:include href="../common/section_objectstorage-arch.xml"/>
<xi:include href="../common/section_objectstorage-replication.xml"/>
<xi:include href="section_object-storage-monitoring.xml"/>
<xi:include href="../common/section_objectstorage-troubleshoot.xml"/>
</chapter>

@ -3,6 +3,7 @@
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="ch_introduction-to-openstack-object-storage-monitoring">
<!-- ... Based on a blog, should be replaced with original material... -->
<title>Object Storage monitoring</title>
<?dbhtml stop-chunking?>
<para>Excerpted from a blog post by <link

Binary file not shown.

After

(image error) Size: 32 KiB

Binary file not shown.

After

(image error) Size: 56 KiB

Binary file not shown.

After

(image error) Size: 48 KiB

Binary file not shown.

After

(image error) Size: 58 KiB

Binary file not shown.

After

(image error) Size: 28 KiB

Binary file not shown.

After

(image error) Size: 45 KiB

Binary file not shown.

After

(image error) Size: 23 KiB

Binary file not shown.

After

(image error) Size: 61 KiB

Binary file not shown.

After

(image error) Size: 10 KiB

Binary file not shown.

After

(image error) Size: 23 KiB

@ -0,0 +1,40 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-account-reaper">
<!-- ... Old module003-ch008-account-reaper edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Account reaper</title>
<para>In the background, the account reaper removes data from the deleted accounts.</para>
<para>A reseller marks an account for deletion by issuing a <code>DELETE</code> request on the accounts
storage URL. This action sets the <code>status</code> column of the account_stat table in the account
database and replicas to <code>DELETED</code>, marking the account's data for deletion.</para>
<para>Typically, a specific retention time or undelete are not provided. However, you can set a
<code>delay_reaping</code> value in the <code>[account-reaper]</code> section of the
account-server.conf to delay the actual deletion of data. At this time, to undelete you have
to update the account database replicas directly, setting the status column to an empty
string and updating the put_timestamp to be greater than the delete_timestamp.
<note><para>It's on the developers' to-do list to write a utility that performs this task, preferably
through a ReST call.</para></note>
</para>
<para>The account reaper runs on each account server and scans the server occasionally for
account databases marked for deletion. It only fires up on the accounts for which the server
is the primary node, so that multiple account servers arent trying to do it simultaneously.
Using multiple servers to delete one account might improve the deletion speed but requires
coordination to avoid duplication. Speed really is not a big concern with data deletion, and
large accounts arent deleted often.</para>
<para>Deleting an account is simple. For each account container, all objects are deleted and
then the container is deleted. Deletion requests that fail will not stop the overall process
but will cause the overall process to fail eventually (for example, if an object delete
times out, you will not be able to delete the container or the account). The account reaper
keeps trying to delete an account until it is empty, at which point the database reclaim
process within the db_replicator will remove the database files.</para>
<para>A persistent error state may prevent the deletion of an object
or container. If this happens, you will see
a message such as <code>“Account &lt;name&gt; has not been reaped
since &lt;date&gt;</code> in the log. You can control when this is
logged with the <code>reap_warn_after</code> value in the <code>[account-reaper]</code>
section of the account-server.conf file. The default value is 30
days.</para>
</section>

@ -0,0 +1,75 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-cluster-architecture">
<!-- ... Old module003-ch007-swift-cluster-architecture edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Cluster architecture</title>
<section xml:id="section_access-tier">
<title>Access tier</title>
<para>Large-scale deployments segment off an access tier, which is considered the Object Storage
system's central hub. The access tier fields the incoming API requests from clients and
moves data in and out of the system. This tier consists of front-end load balancers,
ssl-terminators, and authentication services. It runs the (distributed) brain of the
Object Storage system&#151;the proxy server processes.</para>
<figure>
<title>Object Storage architecture</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-arch.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Because access servers are collocated in their own tier, you can scale out read/write
access regardless of the storage capacity. For example, if a cluster is on the public
Internet, requires SSL termination, and has a high demand for data access, you can
provision many access servers. However, if the cluster is on a private network and used
primarily for archival purposes, you need fewer access servers.</para>
<para>Since this is an HTTP addressable storage service, you may incorporate a load balancer
into the access tier.</para>
<para>Typically, the tier consists of a collection of 1U servers. These machines use a
moderate amount of RAM and are network I/O intensive. Since these systems field each
incoming API request, you should provision them with two high-throughput (10GbE)
interfaces—one for the incoming "front-end" requests and the other for the "back-end"
access to the object storage nodes to put and fetch data.</para>
<section xml:id="section_access-tier-considerations">
<title>Factors to consider</title>
<para>For most publicly facing deployments as well as private deployments available
across a wide-reaching corporate network, you use SSL to encrypt traffic to the
client. SSL adds significant processing load to establish sessions between clients,
which is why you have to provision more capacity in the access layer. SSL may not be
required for private deployments on trusted networks.</para>
</section>
</section>
<section xml:id="section_storage-nodes">
<title>Storage nodes</title>
<para>In most configurations, each of the five zones should have an equal amount of storage
capacity. Storage nodes use a reasonable amount of memory and CPU. Metadata needs to be
readily available to return objects quickly. The object stores run services not only to
field incoming requests from the access tier, but to also run replicators, auditors, and
reapers. You can provision object stores provisioned with single gigabit or 10 gigabit
network interface depending on the expected workload and desired performance.</para>
<figure>
<title>Object Storage (Swift)</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-nodes.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Currently, 2TB or 3TB SATA disks deliver good price/performance value. You can use
desktop-grade drives if you have responsive remote hands in the datacenter and
enterprise-grade drives if you don't.</para>
<section xml:id="section_storage-nodes-considerations">
<title>Factors to consider</title>
<para>You should keep in mind the desired I/O performance for single-threaded requests .
This system does not use RAID, so a single disk handles each request for an object.
Disk performance impacts single-threaded response rates.</para>
<para>To achieve apparent higher throughput, the object storage system is designed to
handle concurrent uploads/downloads. The network I/O capacity (1GbE, bonded 1GbE
pair, or 10GbE) should match your desired concurrent throughput needs for reads and
writes.</para>
</section>
</section>
</section>

@ -0,0 +1,59 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="objectstorage_characteristics">
<!-- ... Old module003-ch003-obj-store-capabilities edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Object Storage characteristics</title>
<para>The key characteristics of Object Storage are that:</para>
<itemizedlist>
<listitem>
<para>All objects stored in Object Storage have a URL.</para>
</listitem>
<listitem>
<para>All objects stored are replicated 3&#10005; in as-unique-as-possible zones, which
can be defined as a group of drives, a node, a rack, and so on.</para>
</listitem>
<listitem>
<para>All objects have their own metadata.</para>
</listitem>
<listitem>
<para>Developers interact with the object storage system through a RESTful HTTP
API.</para>
</listitem>
<listitem>
<para>Object data can be located anywhere in the cluster.</para>
</listitem>
<listitem>
<para>The cluster scales by adding additional nodes without sacrificing performance,
which allows a more cost-effective linear storage expansion than fork-lift
upgrades.</para>
</listitem>
<listitem>
<para>Data doesn't have to be migrate to an entirely new storage system.</para>
</listitem>
<listitem>
<para>New nodes can be added to the cluster without downtime.</para>
</listitem>
<listitem>
<para>Failed nodes and disks can be swapped out without downtime.</para>
</listitem>
<listitem>
<para>It runs on industry-standard hardware, such as Dell, HP, and Supermicro.</para>
</listitem>
</itemizedlist>
<figure>
<title>Object Storage (Swift)</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Developers can either write directly to the Swift API or use one of the many client
libraries that exist for all of the popular programming languages, such as Java, Python,
Ruby, and C#. Amazon S3 and RackSpace Cloud Files users should be very familiar with Object
Storage. Users new to object storage systems will have to adjust to a different approach and
mindset than those required for a traditional filesystem.</para>
</section>

@ -0,0 +1,236 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-components">
<!-- ... Old module003-ch004-swift-building-blocks edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Components</title>
<para>The components that enable Object Storage to deliver high availability, high
durability, and high concurrency are:</para>
<itemizedlist>
<listitem>
<para><emphasis role="bold">Proxy servers&#151;</emphasis>Handle all of the incoming
API requests.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Rings&#151;</emphasis>Map logical names of data to
locations on particular disks.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Zones&#151;</emphasis>Isolate data from other zones. A
failure in one zone doesnt impact the rest of the cluster because data is
replicated across zones.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Accounts and containers&#151;</emphasis>Each account and
container are individual databases that are distributed across the cluster. An
account database contains the list of containers in that account. A container
database contains the list of objects in that container.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Objects&#151;</emphasis>The data itself.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Partitions&#151;</emphasis>A partition stores objects,
account databases, and container databases and helps manage locations where data
lives in the cluster.</para>
</listitem>
</itemizedlist>
<figure>
<title>Object Storage building blocks</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-buildingblocks.png"/>
</imageobject>
</mediaobject>
</figure>
<section xml:id="section_proxy-servers">
<title>Proxy servers</title>
<para>Proxy servers are the public face of Object Storage and handle all of the incoming API
requests. Once a proxy server receives a request, it determines the storage node based
on the object's URL, for example, https://swift.example.com/v1/account/container/object.
Proxy servers also coordinate responses, handle failures, and coordinate
timestamps.</para>
<para>Proxy servers use a shared-nothing architecture and can be scaled as needed based on
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
If one proxy server fails, the others take over.</para>
</section>
<section xml:id="section_ring">
<title>Rings</title>
<para>A ring represents a mapping between the names of entities stored on disk and their
physical locations. There are separate rings for accounts, containers, and objects. When
other components need to perform any operation on an object, container, or account, they
need to interact with the appropriate ring to determine their location in the
cluster.</para>
<para>The ring maintains this mapping using zones, devices, partitions, and replicas. Each
partition in the ring is replicated, by default, three times across the cluster, and
partition locations are stored in the mapping maintained by the ring. The ring is also
responsible for determining which devices are used for handoff in failure
scenarios.</para>
<para>Data can be isolated into zones in the ring. Each partition replica is guaranteed to
reside in a different zone. A zone could represent a drive, a server, a cabinet, a
switch, or even a data center.</para>
<para>The partitions of the ring are equally divided among all of the devices in the Object
Storage installation. When partitions need to be moved around (for example, if a device
is added to the cluster), the ring ensures that a minimum number of partitions are moved
at a time, and only one replica of a partition is moved at a time.</para>
<para>Weights can be used to balance the distribution of partitions on drives across the
cluster. This can be useful, for example, when differently sized drives are used in a
cluster.</para>
<para>The ring is used by the proxy server and several background processes (like
replication).</para>
<figure>
<title>The <emphasis role="bold">ring</emphasis></title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-ring.png"/>
</imageobject>
</mediaobject>
</figure>
<para>These rings are externally managed, in that the server processes themselves do not
modify the rings, they are instead given new rings modified by other tools.</para>
<para>The ring uses a configurable number of bits from a
paths MD5 hash as a partition index that designates a
device. The number of bits kept from the hash is known as
the partition power, and 2 to the partition power
indicates the partition count. Partitioning the full MD5
hash ring allows other parts of the cluster to work in
batches of items at once which ends up either more
efficient or at least less complex than working with each
item separately or the entire cluster all at once.</para>
<para>Another configurable value is the replica count, which indicates how many of the
partition-device assignments make up a single ring. For a given partition number, each
replicas device will not be in the same zone as any other replica's device. Zones can
be used to group devices based on physical locations, power separations, network
separations, or any other attribute that would improve the availability of multiple
replicas at the same time.</para>
</section>
<section xml:id="section_zones">
<title>Zones</title>
<para>Object Storage allows configuring zones in order to isolate failure boundaries.
Each data replica resides in a separate zone, if possible. At the smallest level, a zone
could be a single drive or a grouping of a few drives. If there were five object storage
servers, then each server would represent its own zone. Larger deployments would have an
entire rack (or multiple racks) of object servers, each representing a zone. The goal of
zones is to allow the cluster to tolerate significant outages of storage servers without
losing all replicas of the data.</para>
<para>As mentioned earlier, everything in Object Storage is stored, by default, three
times. Swift will place each replica "as-uniquely-as-possible" to ensure both high
availability and high durability. This means that when chosing a replica location,
Object Storage chooses a server in an unused zone before an unused server in a zone that
already has a replica of the data.</para>
<figure>
<title>Zones</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-zones.png"/>
</imageobject>
</mediaobject>
</figure>
<para>When a disk fails, replica data is automatically distributed to the other zones to
ensure there are three copies of the data.</para>
</section>
<section xml:id="section_accounts-containers">
<title>Accounts and containers</title>
<para>Each account and container is an individual SQLite
database that is distributed across the cluster. An
account database contains the list of containers in
that account. A container database contains the list
of objects in that container.</para>
<figure>
<title>Accounts and containers</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-accountscontainers.png"/>
</imageobject>
</mediaobject>
</figure>
<para>To keep track of object data locations, each account in the system has a database
that references all of its containers, and each container database references each
object.</para>
</section>
<section xml:id="section_partitions">
<title>Partitions</title>
<para>A partition is a collection of stored data, including account databases, container
databases, and objects. Partitions are core to the replication system.</para>
<para>Think of a partition as a bin moving throughout a fulfillment center warehouse.
Individual orders get thrown into the bin. The system treats that bin as a cohesive
entity as it moves throughout the system. A bin is easier to deal with than many little
things. It makes for fewer moving parts throughout the system.</para>
<para>System replicators and object uploads/downloads operate on partitions. As the
system scales up, its behavior continues to be predictable because the number of
partitions is a fixed number.</para>
<para>Implementing a partition is conceptually simple&#151;a partition is just a
directory sitting on a disk with a corresponding hash table of what it contains.</para>
<figure>
<title>Partitions</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-partitions.png"/>
</imageobject>
</mediaobject>
</figure>
</section>
<section xml:id="section_replicators">
<title>Replicators</title>
<para>In order to ensure that there are three copies of the data everywhere, replicators
continuously examine each partition. For each local partition, the replicator compares
it against the replicated copies in the other zones to see if there are any
differences.</para>
<para>The replicator knowd if replication needs to take plac by examining hashes. A hash
file is created for each partition, which contains hashes of each directory in the
partition. Each of the three hash files is compared. For a given partition, the hash
files for each of the partition's copies are compared. If the hashes are different, then
it is time to replicate, and the directory that needs to be replicated is copied
over.</para>
<para>This is where partitions come in handy. With fewer things in the system, larger
chunks of data are transferred around (rather than lots of little TCP connections, which
is inefficient) and there is a consistent number of hashes to compare.</para>
<para>The cluster eventually has a consistent behavior where the newest data has a
priority.</para>
<figure>
<title>Replication</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-replication.png"/>
</imageobject>
</mediaobject>
</figure>
<para>If a zone goes down, one of the nodes containing a replica notices and proactively
copies data to a handoff location.</para>
</section>
<section xml:id="section_usecases">
<title>Use cases</title>
<para>The following sections show use cases for object uploads and downloads and introduce the components.</para>
<section xml:id="upload">
<title>Upload</title>
<para>A client uses the REST API to make a HTTP request to PUT an object into an existing
container. The cluster receives the request. First, the system must figure out where
the data is going to go. To do this, the account name, container name, and object
name are all used to determine the partition where this object should live.</para>
<para>Then a lookup in the ring figures out which storage nodes contain the partitions in
question.</para>
<para>The data is then sent to each storage node where it is placed in the appropriate
partition. At least two of the three writes must be successful before the client is
notified that the upload was successful.</para>
<para>Next, the container database is updated asynchronously to reflect that there is a new
object in it.</para>
<figure>
<title>Object Storage in use</title>
<mediaobject>
<imageobject>
<imagedata fileref="../common/figures/objectstorage-usecase.png"/>
</imageobject>
</mediaobject>
</figure>
</section>
<section xml:id="section_swift-component-download">
<title>Download</title>
<para>A request comes in for an acount/container/object. Using the same consistent hashing,
the partition name is generated. A lookup in the ring reveals which storage nodes
contain that partition. A request is made to one of the storage nodes to fetch the
object and, if that fails, requests are made to the other nodes.</para>
</section>
</section>
</section>

@ -0,0 +1,180 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage_features">
<!-- ... Old module003-ch002-features-benefits edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Features and benefits</title>
<para>
<informaltable class="c19">
<tbody>
<tr>
<th rowspan="1" colspan="1">Features</th>
<th rowspan="1" colspan="1">Benefits</th>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Leverages commodity
hardware</emphasis></td>
<td rowspan="1" colspan="1"
>No
lock-in, lower
price/GB</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>HDD/node failure agnostic</emphasis></td>
<td rowspan="1" colspan="1">Self-healing, reliable, data redundancy protects
from failures</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Unlimited storage</emphasis></td>
<td rowspan="1" colspan="1">Large and flat namespace, highly scalable read/write
access, able to serve content directly from storage system</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Multi-dimensional scalability</emphasis>
</td>
<td rowspan="1" colspan="1">Scale-out architecture&#151;Scale vertically and
horizontally-distributed storage Backs up and archives large amounts of data
with linear performance</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold">Account/container/object
structure</emphasis></td>
<td rowspan="1" colspan="1">No nesting, not a traditional file
system&#151;Optimized for scale, it scales to multiple petabytes and
billions of objects</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold">Built-in replication 3&#10005;
+ data redundancy (compared with 2&#10005; on RAID)</emphasis></td>
<td rowspan="1" colspan="1">A configurable number of accounts, containers and
object copies for high availability</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Easily add capacity (unlike
RAID resize)</emphasis></td>
<td rowspan="1" colspan="1"
>Elastic
data scaling with
ease</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>No central database</emphasis></td>
<td rowspan="1" colspan="1"
>Higher
performance, no
bottlenecks</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>RAID not required</emphasis></td>
<td rowspan="1" colspan="1">Handle many small, random reads and writes
efficiently</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Built-in management
utilities</emphasis></td>
<td rowspan="1" colspan="1">Account management&#151;Create, add, verify, and
delete users; Container management&#151;Upload, download, and verify;
Monitoring&#151;Capacity, host, network, log trawling, and cluster
health</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Drive auditing</emphasis></td>
<td rowspan="1" colspan="1"
>Detect
drive failures preempting data
corruption</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Expiring objects</emphasis></td>
<td rowspan="1" colspan="1"
>Users
can set an expiration time or a TTL on an
object to control
access</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Direct object access</emphasis></td>
<td rowspan="1" colspan="1"
>Enable
direct browser access to content, such as for
a control
panel</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Realtime visibility into client
requests</emphasis></td>
<td rowspan="1" colspan="1"
>Know
what users are
requesting</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Supports S3 API</emphasis></td>
<td rowspan="1" colspan="1"
>Utilize
tools that were designed for the popular S3
API</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Restrict containers per
account</emphasis></td>
<td rowspan="1" colspan="1"
>Limit
access to control usage by
user</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Support for NetApp, Nexenta,
SolidFire</emphasis></td>
<td rowspan="1" colspan="1"
>Unified
support for block volumes using a variety of
storage
systems</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Snapshot and backup API for block
volumes</emphasis></td>
<td rowspan="1" colspan="1"
>Data
protection and recovery for VM
data</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Standalone volume API
available</emphasis></td>
<td rowspan="1" colspan="1"
>Separate
endpoint and API for integration with other
compute
systems</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Integration with Compute</emphasis></td>
<td rowspan="1" colspan="1">Fully integrated with Compute for attaching block
volumes and reporting on usage</td>
</tr>
</tbody>
</informaltable>
</para>
</section>

@ -0,0 +1,23 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-intro">
<!-- ... Old module003-ch001-intro-objstore edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Introduction to Object Storage</title>
<para>OpenStack Object Storage (code-named Swift) is open source software for creating
redundant, scalable data storage using clusters of standardized servers to store petabytes
of accessible data. It is a long-term storage system for large amounts of static data that
can be retrieved, leveraged, and updated. Object Storage uses a distributed architecture
with no central point of control, providing greater scalability, redundancy, and permanence.
Objects are written to multiple hardware devices, with the OpenStack software responsible
for ensuring data replication and integrity across the cluster. Storage clusters scale
horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its
content from other active nodes. Because OpenStack uses software logic to ensure data
replication and distribution across different devices, inexpensive commodity hard drives and
servers can be used in lieu of more expensive equipment.</para>
<para>Object Storage is ideal for cost effective, scale-out storage. It provides a fully
distributed, API-accessible storage platform that can be integrated directly into
applications or used for backup, archiving, and data retention.</para>
</section>

@ -0,0 +1,99 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-replication">
<!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Replication</title>
<para>Because each replica in Object Storage functions independently, and clients generally
require only a simple majority of nodes responding to consider an operation successful,
transient failures like network partitions can quickly cause replicas to diverge. These
differences are eventually reconciled by asynchronous, peer-to-peer replicator processes.
The replicator processes traverse their local filesystems, concurrently performing
operations in a manner that balances load across physical disks.</para>
<para>Replication uses a push model, with records and files
generally only being copied from local to remote replicas.
This is important because data on the node may not belong
there (as in the case of handoffs and ring changes), and a
replicator cant know what data exists elsewhere in the
cluster that it should pull in. Its the duty of any node that
contains data to ensure that data gets to where it belongs.
Replica placement is handled by the ring.</para>
<para>Every deleted record or file in the system is marked by a
tombstone, so that deletions can be replicated alongside
creations. The replication process cleans up tombstones after
a time period known as the consistency window. The consistency
window encompasses replication duration and how long transient
failure can remove a node from the cluster. Tombstone cleanup
must be tied to replication to reach replica
convergence.</para>
<para>If a replicator detects that a remote drive has failed, the
replicator uses the get_more_nodes interface for the ring to
choose an alternate node with which to synchronize. The
replicator can maintain desired levels of replication in the
face of disk failures, though some replicas may not be in an
immediately usable location. Note that the replicator doesnt
maintain desired levels of replication when other failures,
such as entire node failures, occur because most failure are
transient.</para>
<para>Replication is an area of active development, and likely
rife with potential improvements to speed and
correctness.</para>
<para>There are two major classes of replicator&#151;the db replicator, which replicates
accounts and containers, and the object replicator, which replicates object data.</para>
<section xml:id="section_database-replication">
<title>Database replication</title>
<para>The first step performed by db replication is a low-cost
hash comparison to determine whether two replicas already
match. Under normal operation, this check is able to
verify that most databases in the system are already
synchronized very quickly. If the hashes differ, the
replicator brings the databases in sync by sharing records
added since the last sync point.</para>
<para>This sync point is a high water mark noting the last
record at which two databases were known to be in sync,
and is stored in each database as a tuple of the remote
database id and record id. Database ids are unique amongst
all replicas of the database, and record ids are
monotonically increasing integers. After all new records
have been pushed to the remote database, the entire sync
table of the local database is pushed, so the remote
database can guarantee that it is in sync with everything
with which the local database has previously
synchronized.</para>
<para>If a replica is found to be missing entirely, the whole
local database file is transmitted to the peer using
rsync(1) and vested with a new unique id.</para>
<para>In practice, DB replication can process hundreds of
databases per concurrency setting per second (up to the
number of available CPUs or disks) and is bound by the
number of DB transactions that must be performed.</para>
</section>
<section xml:id="section_object-replication">
<title>Object replication</title>
<para>The initial implementation of object replication simply
performed an rsync to push data from a local partition to
all remote servers it was expected to exist on. While this
performed adequately at small scale, replication times
skyrocketed once directory structures could no longer be
held in RAM. We now use a modification of this scheme in
which a hash of the contents for each suffix directory is
saved to a per-partition hashes file. The hash for a
suffix directory is invalidated when the contents of that
suffix directory are modified.</para>
<para>The object replication process reads in these hash
files, calculating any invalidated hashes. It then
transmits the hashes to each remote server that should
hold the partition, and only suffix directories with
differing hashes on the remote server are rsynced. After
pushing files to the remote server, the replication
process notifies it to recalculate hashes for the rsynced
suffix directories.</para>
<para>Performance of object replication is generally bound by the number of uncached
directories it has to traverse, usually as a result of invalidated suffix directory
hashes. Using write volume and partition counts from our running systems, it was
designed so that around 2 percent of the hash space on a normal node will be invalidated
per day, which has experimentally given us acceptable replication speeds.</para>
</section>
</section>

@ -0,0 +1,129 @@
<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-ringbuilder">
<!-- ... Old module003-ch005-the-ring edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Ring-builder</title>
<para>Rings are built and managed manually by a utility called the ring-builder. The
ring-builder assigns partitions to devices and writes an optimized Python structure to a
gzipped, serialized file on disk for shipping out to the servers. The server processes just
check the modification time of the file occasionally and reload their in-memory copies of
the ring structure as needed. Because of how the ring-builder manages changes to the ring,
using a slightly older ring usually just means one of the three replicas for a subset of the
partitions will be incorrect, which can be easily worked around.</para>
<para>The ring-builder also keeps its own builder file with the ring information and additional
data required to build future rings. It is very important to keep multiple backup copies of
these builder files. One option is to copy the builder files out to every server while
copying the ring files themselves. Another is to upload the builder files into the cluster
itself. If you lose the builder file, you have to create a new ring from scratch. Nearly all
partitions would be assigned to different devices and, therefore, nearly all of the stored
data would have to be replicated to new locations. So, recovery from a builder file loss is
possible, but data would be unreachable for an extended time.</para>
<section xml:id="section_ring-data-structure">
<title>Ring data structure</title>
<para>The ring data structure consists of three top level
fields: a list of devices in the cluster, a list of lists
of device ids indicating partition to device assignments,
and an integer indicating the number of bits to shift an
MD5 hash to calculate the partition for the hash.</para>
</section>
<section xml:id="section_partition-assignment">
<title>Partition assignment list</title>
<para>This is a list of <literal>array(H)</literal> of devices ids. The
outermost list contains an <literal>array(H)</literal> for each
replica. Each <literal>array(H)</literal> has a length equal to the
partition count for the ring. Each integer in the
<literal>array(H)</literal> is an index into the above list of devices.
The partition list is known internally to the Ring
class as <literal>_replica2part2dev_id</literal>.</para>
<para>So, to create a list of device dictionaries assigned to a partition, the Python
code would look like:
<programlisting>devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]</programlisting></para>
<para>That code is a little simplistic, as it does not account for the removal of
duplicate devices. If a ring has more replicas than devices, then a partition will have
more than one replica on one device.</para>
<para><literal>array(H)</literal> is used for memory conservation as there
may be millions of partitions.</para>
</section>
<section xml:id="section_fractional-replicas">
<title>Fractional replicas</title>
<para>A ring is not restricted to having an integer number
of replicas. In order to support the gradual changing
of replica counts, the ring is able to have a real
number of replicas.</para>
<para>When the number of replicas is not an integer, then the last element of
<literal>_replica2part2dev_id</literal> will have a length that is less than the
partition count for the ring. This means that some partitions will have more replicas
than others. For example, if a ring has 3.25 replicas, then 25 percent of its partitions
will have four replicas, while the remaining 75 percent will have just three.</para>
</section>
<section xml:id="section_partition-shift-value">
<title>Partition shift value</title>
<para>The partition shift value is known internally to the
Ring class as <literal>_part_shift</literal>. This value used to shift an
MD5 hash to calculate the partition on which the data
for that hash should reside. Only the top four bytes
of the hash is used in this process. For example, to
compute the partition for the path
/account/container/object the Python code might look
like:
<programlisting>partition = unpack_from('&gt;I',
md5('/account/container/object').digest())[0] &gt;&gt;
self._part_shift</programlisting></para>
<para>For a ring generated with part_power P, the
partition shift value is <literal>32 - P</literal>.</para>
</section>
<section xml:id="section_build-ring">
<title>Build the ring</title>
<para>The initial building of the ring first calculates the
number of partitions that should ideally be assigned to
each device based the devices weight. For example, given
a partition power of 20, the ring will have 1,048,576
partitions. If there are 1,000 devices of equal weight
they will each desire 1,048.576 partitions. The devices
are then sorted by the number of partitions they desire
and kept in order throughout the initialization
process.</para>
<note><para>Each device is also assigned a random tiebreaker
value that is used when two devices desire the same number
of partitions. This tiebreaker is not stored on disk
anywhere, and so two different rings created with the same
parameters will have different partition assignments. For
repeatable partition assignments, <literal>RingBuilder.rebalance()</literal>
takes an optional seed value that will be used to seed
Pythons pseudo-random number generator.</para></note>
<para>Then, the ring builder assigns each replica of each partition to the device that
requires most partitions at that point while keeping it as far away as possible from
other replicas. The ring builder prefers to assign a replica to a device in a region
does not already have a replica. If no such region is available, the ring builder tries
to find a device in a different zone. If that's not possible, it will look on a
different server. If it doesn't find one there, it will just look for a device that has
no replicas. Finally, if all of the other options are exhausted, the ring builder
assigns the replica to the device that has the fewest replicas already assigned. Note
that assignment of multiple replicas to one device will only happen if the ring has
fewer devices than it has replicas.</para>
<para>When building a new ring based on an old ring, the desired number of partitions each
device wants is recalculated. Next, the partitions to be reassigned are gathered up. Any
removed devices have all their assigned partitions unassigned and added to the gathered
list. Any partition replicas that (due to the addition of new devices) can be spread out
for better durability are unassigned and added to the gathered list. Any devices that
have more partitions than they now need have random partitions unassigned from them and
added to the gathered list. Lastly, the gathered partitions are then reassigned to
devices using a similar method as in the initial assignment described above.</para>
<para>Whenever a partition has a replica reassigned, the time of the reassignment is
recorded. This is taken into account when gathering partitions to reassign so that no
partition is moved twice in a configurable amount of time. This configurable amount of
time is known internally to the RingBuilder class as <literal>min_part_hours</literal>.
This restriction is ignored for replicas of partitions on devices that have been removed
since removing a device only happens on device failure and reasssignment is the only
choice.</para>
<para>The above processes dont always perfectly rebalance a ring due to the random nature
of gathering partitions for reassignment. To help reach a more balanced ring, the
rebalance process is repeated until near perfect (less than 1 percent off) or when the
balance doesnt improve by at least 1 percent (indicating we probably cant get perfect
balance due to wildly imbalanced zones or too many partitions recently moved).</para>
</section>
</section>

@ -0,0 +1,106 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="troubleshooting-openstack-object-storage">
<title>Troubleshoot Object Storage</title>
<para>For Object Storage, everything is logged in <filename>/var/log/syslog</filename> (or messages on some distros).
Several settings enable further customization of logging, such as <literal>log_name</literal>, <literal>log_facility</literal>,
and <literal>log_level</literal>, within the object server configuration files.</para>
<section xml:id="drive-failure">
<title>Drive failure</title>
<para>In the event that a drive has failed, the first step is to make sure the drive is
unmounted. This will make it easier for Object Storage to work around the failure until
it has been resolved. If the drive is going to be replaced immediately, then it is just
best to replace the drive, format it, remount it, and let replication fill it up.</para>
<para>If the drive cant be replaced immediately, then it is best to leave it
unmounted, and remove the drive from the ring. This will allow all the replicas
that were on that drive to be replicated elsewhere until the drive is replaced.
Once the drive is replaced, it can be re-added to the ring.</para>
<para>You can look at error messages in <filename>/var/log/kern.log</filename> for hints of drive failure.</para>
</section>
<section xml:id="server-failure">
<title>Server failure</title>
<para>If a server is having hardware issues, it is a good idea to make sure the
Object Storage services are not running. This will allow Object Storage to
work around the failure while you troubleshoot.</para>
<para>If the server just needs a reboot, or a small amount of work that should only
last a couple of hours, then it is probably best to let Object Storage work
around the failure and get the machine fixed and back online. When the machine
comes back online, replication will make sure that anything that is missing
during the downtime will get updated.</para>
<para>If the server has more serious issues, then it is probably best to remove all
of the servers devices from the ring. Once the server has been repaired and is
back online, the servers devices can be added back into the ring. It is
important that the devices are reformatted before putting them back into the
ring as it is likely to be responsible for a different set of partitions than
before.</para>
</section>
<section xml:id="detect-failed-drives">
<title>Detect failed drives</title>
<para>It has been our experience that when a drive is about to fail, error messages will spew into
/var/log/kern.log. There is a script called swift-drive-audit that can be run via cron
to watch for bad drives. If errors are detected, it will unmount the bad drive, so that
Object Storage can work around it. The script takes a configuration file with the
following settings:</para>
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
<para>This script has only been tested on Ubuntu 10.04, so if you are using a
different distro or OS, some care should be taken before using in production.
</para>
</section>
<section xml:id="recover-ring-builder-file">
<title>Emergency recovery of ring builder files</title>
<para>You should always keep a backup of Swift ring builder files. However, if an
emergency occurs, this procedure may assist in returning your cluster to an
operational state.</para>
<para>Using existing Swift tools, there is no way to recover a builder file from a
<filename>ring.gz</filename> file. However, if you have a knowledge of Python, it is possible to
construct a builder file that is pretty close to the one you have lost. The
following is what you will need to do.</para>
<warning>
<para>This procedure is a last-resort for emergency circumstances&#151;it
requires knowledge of the swift python code and may not succeed.</para>
</warning>
<para>First, load the ring and a new ringbuilder object in a Python REPL:</para>
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
<para>Now, start copying the data we have in the ring into the builder.</para>
<programlisting language="python">
>>> import math
>>> partitions = len(ring._replica2part2dev_id[0])
>>> replicas = len(ring._replica2part2dev_id)
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
>>> builder.devs = ring.devs
>>> builder._replica2part2dev = ring.replica2part2dev_id
>>> builder._last_part_moves_epoch = 0
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
>>> builder._set_parts_wanted()
>>> for d in builder._iter_devs():
d['parts'] = 0
>>> for p2d in builder._replica2part2dev:
for dev_id in p2d:
builder.devs[dev_id]['parts'] += 1</programlisting>
<para>This is the extent of the recoverable fields. For
<literal>min_part_hours</literal> you'll either have to remember what the
value you used was, or just make up a new one.</para>
<programlisting language="python">
>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
<para>Try some validation: if this doesn't raise an exception, you may feel some
hope. Not too much, though.</para>
<programlisting language="python">>>> builder.validate()</programlisting>
<para>Save the builder.</para>
<programlisting language="python">
>>> import pickle
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
<para>You should now have a file called 'account.builder' in the current working
directory. Next, run <literal>swift-ring-builder account.builder write_ring</literal>
and compare the new account.ring.gz to the account.ring.gz that you started
from. They probably won't be byte-for-byte identical, but if you load them up
in a REPL and their <literal>_replica2part2dev_id</literal> and
<literal>devs</literal> attributes are the same (or nearly so), then you're
in good shape.</para>
<para>Next, repeat the procedure for <filename>container.ring.gz</filename>
and <filename>object.ring.gz</filename>, and you might get usable builder files.</para>
</section>
</section>

@ -1,144 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="troubleshooting-openstack-object-storage">
<title>Troubleshoot Object Storage</title>
<para>For OpenStack Object Storage, everything is logged in
<filename>/var/log/syslog</filename> (or messages on some
distros). Several settings enable further customization of
logging, such as <option>log_name</option>,
<option>log_facility</option>, and
<option>log_level</option>, within the object server
configuration files.</para>
<section xml:id="handling-drive-failure">
<title>Recover drive failures</title>
<para>If a drive fails, make sure the
drive is unmounted to make it easier for Object
Storage to work around the failure while you resolve
it. If you plan to replace the drive immediately, replace
the drive, format it, remount it, and let replication fill
it.</para>
<para>If you cannot replace the drive immediately, leave it
unmounted and remove the drive from the ring. This enables
you to replicate all the replicas on that drive elsewhere
until you can replace the drive. After you replace the
drive, you can add it to the ring again.</para>
<note>
<para>Rackspace has seen hints at drive failures by
looking at error messages in
<filename>/var/log/kern.log</filename>. Check this
file in your monitoring.</para>
</note>
</section>
<section xml:id="handling-server-failure">
<title>Recover server failures</title>
<para>If a server has hardware issues, make sure that the
Object Storage services are not running. This enables
Object Storage to work around the failure while you
troubleshoot.</para>
<para>If the server needs a reboot or a minimal amount of
work, let Object Storage work around the failure while you
fix the machine and get it back online. When the machine
comes back online, replication updates anything that was
missing during the downtime.</para>
<para>If the server has more serious issues,remove all server
devices from the ring. After you repair and put the server
online, you can add the devices for the server back to the
ring. You must reformat the devices before you add them to
the ring because they might be responsible for a different
set of partitions than before.</para>
</section>
<section xml:id="detecting-failed-drives">
<title>Detect failed drives</title>
<para>When a drive is about to fail, many error messages
appear in the <filename>/var/log/kern.log</filename> file.
You can run the <package>swift-drive-audit</package>
script through <command>cron</command> to watch for bad
drives. If errors are detected, it unmounts the bad drive
so that Object Storage can work around it. The script uses
a configuration file with these settings:</para>
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
<para>This script has been tested on only Ubuntu 10.04. If you
use a different distribution or operating system, take
care before using the script in production.</para>
</section>
<section xml:id="recover-ring-builder-file">
<title>Recover ring builder files (emergency)</title>
<para>You should always keep a backup of Swift ring builder
files. However, if an emergency occurs, use this procedure
to return your cluster to an operational state.</para>
<para>Existing Swift tools do not enable you to recover a
builder file from a <filename>ring.gz</filename> file.
However, if you have Python knowledge, you can construct a
builder file similar to the one you have lost.</para>
<warning>
<para>This procedure is a last-resort in an emergency. It
requires knowledge of the swift Python code and might
not succeed.</para>
</warning>
<procedure>
<step>
<para>Load the ring and a new ringbuilder object in a
Python REPL:</para>
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
</step>
<step>
<para>Copy the data in the ring into the
builder.</para>
<programlisting language="python">>>> import math
>>> partitions = len(ring._replica2part2dev_id[0])
>>> replicas = len(ring._replica2part2dev_id)
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
>>> builder.devs = ring.devs
>>> builder._replica2part2dev = ring.replica2part2dev_id
>>> builder._last_part_moves_epoch = 0
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
>>> builder._set_parts_wanted()
>>> for d in builder._iter_devs():
d['parts'] = 0
>>> for p2d in builder._replica2part2dev:
for dev_id in p2d:
builder.devs[dev_id]['parts'] += 1</programlisting>
<para>This is the extent of the recoverable
fields.</para>
</step>
<step>
<para>For <option>min_part_hours</option>, you must
remember the value that you used previously or
create a new value.</para>
<programlisting language="python">>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
<para>If validation succeeds without raising an
exception, you have succeeded.</para>
<programlisting language="python">>>> builder.validate()</programlisting>
</step>
<step>
<para>Save the builder.</para>
<programlisting language="python">>>> import pickle
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
<para>The <filename>account.builder</filename> file
appears in the current working directory.</para>
</step>
<step>
<para>Run <literal>swift-ring-builder account.builder
write_ring</literal>.</para>
<para>Compare the new
<filename>account.ring.gz</filename> to the
original <filename>account.ring.gz</filename>
file. They might not be byte-for-byte identical,
but if you load them in REPL and their
<option>_replica2part2dev_id</option> and
<option>devs</option> attributes are the same
(or nearly so), you have succeeded.</para>
</step>
<step>
<para>Repeat this procedure for the
<filename>container.ring.gz</filename> and
<filename>object.ring.gz</filename> files, and
you might get usable builder files.</para>
</step>
</procedure>
</section>
</chapter>

@ -69,7 +69,8 @@ format="PNG" />
</imageobject>
</mediaobject>
</informalfigure>
<para>There will be three hosts in the setup.<table rules="all">
<para>There will be three hosts in the setup.</para>
<table rules="all">
<caption>Hosts for Demo</caption>
<thead>
<tr>
@ -103,7 +104,7 @@ format="PNG" />
<td>Same as HostA</td>
</tr>
</tbody>
</table></para>
</table>
<section xml:id="multi_agent_demo_configuration">
<title>Configuration</title>
<itemizedlist>