Restructure Object Storage chapter of Cloud Admin Guide
Restores Troubleshoot Object Storage Removes Monitoring section, which was based on a blog backport: havana Closes-Bug: #1251515 author: nermina miller Change-Id: I580b077a0124d7cd54dced6c0d340e05d5d5f983
@ -5,6 +5,13 @@
|
||||
xml:id="ch_admin-openstack-object-storage">
|
||||
<?dbhtml stop-chunking?>
|
||||
<title>Object Storage</title>
|
||||
<xi:include href="../common/section_about-object-storage.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-intro.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-features.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-characteristics.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-components.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-arch.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-replication.xml"/>
|
||||
<xi:include href="section_object-storage-monitoring.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-troubleshoot.xml"/>
|
||||
</chapter>
|
||||
|
@ -3,6 +3,7 @@
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||
xml:id="ch_introduction-to-openstack-object-storage-monitoring">
|
||||
<!-- ... Based on a blog, should be replaced with original material... -->
|
||||
<title>Object Storage monitoring</title>
|
||||
<?dbhtml stop-chunking?>
|
||||
<para>Excerpted from a blog post by <link
|
||||
|
BIN
doc/common/figures/objectstorage-accountscontainers.png
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
doc/common/figures/objectstorage-arch.png
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
doc/common/figures/objectstorage-buildingblocks.png
Normal file
After Width: | Height: | Size: 48 KiB |
BIN
doc/common/figures/objectstorage-nodes.png
Normal file
After Width: | Height: | Size: 58 KiB |
BIN
doc/common/figures/objectstorage-partitions.png
Normal file
After Width: | Height: | Size: 28 KiB |
BIN
doc/common/figures/objectstorage-replication.png
Normal file
After Width: | Height: | Size: 45 KiB |
BIN
doc/common/figures/objectstorage-ring.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
doc/common/figures/objectstorage-usecase.png
Normal file
After Width: | Height: | Size: 61 KiB |
BIN
doc/common/figures/objectstorage-zones.png
Normal file
After Width: | Height: | Size: 10 KiB |
BIN
doc/common/figures/objectstorage.png
Normal file
After Width: | Height: | Size: 23 KiB |
40
doc/common/section_objectstorage-account-reaper.xml
Normal file
@ -0,0 +1,40 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-account-reaper">
|
||||
<!-- ... Old module003-ch008-account-reaper edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Account reaper</title>
|
||||
<para>In the background, the account reaper removes data from the deleted accounts.</para>
|
||||
<para>A reseller marks an account for deletion by issuing a <code>DELETE</code> request on the account’s
|
||||
storage URL. This action sets the <code>status</code> column of the account_stat table in the account
|
||||
database and replicas to <code>DELETED</code>, marking the account's data for deletion.</para>
|
||||
<para>Typically, a specific retention time or undelete are not provided. However, you can set a
|
||||
<code>delay_reaping</code> value in the <code>[account-reaper]</code> section of the
|
||||
account-server.conf to delay the actual deletion of data. At this time, to undelete you have
|
||||
to update the account database replicas directly, setting the status column to an empty
|
||||
string and updating the put_timestamp to be greater than the delete_timestamp.
|
||||
<note><para>It's on the developers' to-do list to write a utility that performs this task, preferably
|
||||
through a ReST call.</para></note>
|
||||
</para>
|
||||
<para>The account reaper runs on each account server and scans the server occasionally for
|
||||
account databases marked for deletion. It only fires up on the accounts for which the server
|
||||
is the primary node, so that multiple account servers aren’t trying to do it simultaneously.
|
||||
Using multiple servers to delete one account might improve the deletion speed but requires
|
||||
coordination to avoid duplication. Speed really is not a big concern with data deletion, and
|
||||
large accounts aren’t deleted often.</para>
|
||||
<para>Deleting an account is simple. For each account container, all objects are deleted and
|
||||
then the container is deleted. Deletion requests that fail will not stop the overall process
|
||||
but will cause the overall process to fail eventually (for example, if an object delete
|
||||
times out, you will not be able to delete the container or the account). The account reaper
|
||||
keeps trying to delete an account until it is empty, at which point the database reclaim
|
||||
process within the db_replicator will remove the database files.</para>
|
||||
<para>A persistent error state may prevent the deletion of an object
|
||||
or container. If this happens, you will see
|
||||
a message such as <code>“Account <name> has not been reaped
|
||||
since <date>”</code> in the log. You can control when this is
|
||||
logged with the <code>reap_warn_after</code> value in the <code>[account-reaper]</code>
|
||||
section of the account-server.conf file. The default value is 30
|
||||
days.</para>
|
||||
</section>
|
75
doc/common/section_objectstorage-arch.xml
Normal file
@ -0,0 +1,75 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-cluster-architecture">
|
||||
<!-- ... Old module003-ch007-swift-cluster-architecture edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Cluster architecture</title>
|
||||
<section xml:id="section_access-tier">
|
||||
<title>Access tier</title>
|
||||
<para>Large-scale deployments segment off an access tier, which is considered the Object Storage
|
||||
system's central hub. The access tier fields the incoming API requests from clients and
|
||||
moves data in and out of the system. This tier consists of front-end load balancers,
|
||||
ssl-terminators, and authentication services. It runs the (distributed) brain of the
|
||||
Object Storage system—the proxy server processes.</para>
|
||||
<figure>
|
||||
<title>Object Storage architecture</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-arch.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>Because access servers are collocated in their own tier, you can scale out read/write
|
||||
access regardless of the storage capacity. For example, if a cluster is on the public
|
||||
Internet, requires SSL termination, and has a high demand for data access, you can
|
||||
provision many access servers. However, if the cluster is on a private network and used
|
||||
primarily for archival purposes, you need fewer access servers.</para>
|
||||
<para>Since this is an HTTP addressable storage service, you may incorporate a load balancer
|
||||
into the access tier.</para>
|
||||
<para>Typically, the tier consists of a collection of 1U servers. These machines use a
|
||||
moderate amount of RAM and are network I/O intensive. Since these systems field each
|
||||
incoming API request, you should provision them with two high-throughput (10GbE)
|
||||
interfacesone for the incoming "front-end" requests and the other for the "back-end"
|
||||
access to the object storage nodes to put and fetch data.</para>
|
||||
<section xml:id="section_access-tier-considerations">
|
||||
<title>Factors to consider</title>
|
||||
<para>For most publicly facing deployments as well as private deployments available
|
||||
across a wide-reaching corporate network, you use SSL to encrypt traffic to the
|
||||
client. SSL adds significant processing load to establish sessions between clients,
|
||||
which is why you have to provision more capacity in the access layer. SSL may not be
|
||||
required for private deployments on trusted networks.</para>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="section_storage-nodes">
|
||||
<title>Storage nodes</title>
|
||||
<para>In most configurations, each of the five zones should have an equal amount of storage
|
||||
capacity. Storage nodes use a reasonable amount of memory and CPU. Metadata needs to be
|
||||
readily available to return objects quickly. The object stores run services not only to
|
||||
field incoming requests from the access tier, but to also run replicators, auditors, and
|
||||
reapers. You can provision object stores provisioned with single gigabit or 10 gigabit
|
||||
network interface depending on the expected workload and desired performance.</para>
|
||||
<figure>
|
||||
<title>Object Storage (Swift)</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-nodes.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>Currently, 2TB or 3TB SATA disks deliver good price/performance value. You can use
|
||||
desktop-grade drives if you have responsive remote hands in the datacenter and
|
||||
enterprise-grade drives if you don't.</para>
|
||||
<section xml:id="section_storage-nodes-considerations">
|
||||
<title>Factors to consider</title>
|
||||
<para>You should keep in mind the desired I/O performance for single-threaded requests .
|
||||
This system does not use RAID, so a single disk handles each request for an object.
|
||||
Disk performance impacts single-threaded response rates.</para>
|
||||
<para>To achieve apparent higher throughput, the object storage system is designed to
|
||||
handle concurrent uploads/downloads. The network I/O capacity (1GbE, bonded 1GbE
|
||||
pair, or 10GbE) should match your desired concurrent throughput needs for reads and
|
||||
writes.</para>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
59
doc/common/section_objectstorage-characteristics.xml
Normal file
@ -0,0 +1,59 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="objectstorage_characteristics">
|
||||
<!-- ... Old module003-ch003-obj-store-capabilities edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Object Storage characteristics</title>
|
||||
<para>The key characteristics of Object Storage are that:</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>All objects stored in Object Storage have a URL.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>All objects stored are replicated 3✕ in as-unique-as-possible zones, which
|
||||
can be defined as a group of drives, a node, a rack, and so on.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>All objects have their own metadata.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Developers interact with the object storage system through a RESTful HTTP
|
||||
API.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Object data can be located anywhere in the cluster.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>The cluster scales by adding additional nodes without sacrificing performance,
|
||||
which allows a more cost-effective linear storage expansion than fork-lift
|
||||
upgrades.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Data doesn't have to be migrate to an entirely new storage system.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>New nodes can be added to the cluster without downtime.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Failed nodes and disks can be swapped out without downtime.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>It runs on industry-standard hardware, such as Dell, HP, and Supermicro.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<figure>
|
||||
<title>Object Storage (Swift)</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>Developers can either write directly to the Swift API or use one of the many client
|
||||
libraries that exist for all of the popular programming languages, such as Java, Python,
|
||||
Ruby, and C#. Amazon S3 and RackSpace Cloud Files users should be very familiar with Object
|
||||
Storage. Users new to object storage systems will have to adjust to a different approach and
|
||||
mindset than those required for a traditional filesystem.</para>
|
||||
</section>
|
236
doc/common/section_objectstorage-components.xml
Normal file
@ -0,0 +1,236 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-components">
|
||||
<!-- ... Old module003-ch004-swift-building-blocks edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Components</title>
|
||||
<para>The components that enable Object Storage to deliver high availability, high
|
||||
durability, and high concurrency are:</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Proxy servers—</emphasis>Handle all of the incoming
|
||||
API requests.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Rings—</emphasis>Map logical names of data to
|
||||
locations on particular disks.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Zones—</emphasis>Isolate data from other zones. A
|
||||
failure in one zone doesn’t impact the rest of the cluster because data is
|
||||
replicated across zones.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Accounts and containers—</emphasis>Each account and
|
||||
container are individual databases that are distributed across the cluster. An
|
||||
account database contains the list of containers in that account. A container
|
||||
database contains the list of objects in that container.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Objects—</emphasis>The data itself.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Partitions—</emphasis>A partition stores objects,
|
||||
account databases, and container databases and helps manage locations where data
|
||||
lives in the cluster.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<figure>
|
||||
<title>Object Storage building blocks</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-buildingblocks.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<section xml:id="section_proxy-servers">
|
||||
<title>Proxy servers</title>
|
||||
<para>Proxy servers are the public face of Object Storage and handle all of the incoming API
|
||||
requests. Once a proxy server receives a request, it determines the storage node based
|
||||
on the object's URL, for example, https://swift.example.com/v1/account/container/object.
|
||||
Proxy servers also coordinate responses, handle failures, and coordinate
|
||||
timestamps.</para>
|
||||
<para>Proxy servers use a shared-nothing architecture and can be scaled as needed based on
|
||||
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
|
||||
If one proxy server fails, the others take over.</para>
|
||||
</section>
|
||||
<section xml:id="section_ring">
|
||||
<title>Rings</title>
|
||||
<para>A ring represents a mapping between the names of entities stored on disk and their
|
||||
physical locations. There are separate rings for accounts, containers, and objects. When
|
||||
other components need to perform any operation on an object, container, or account, they
|
||||
need to interact with the appropriate ring to determine their location in the
|
||||
cluster.</para>
|
||||
<para>The ring maintains this mapping using zones, devices, partitions, and replicas. Each
|
||||
partition in the ring is replicated, by default, three times across the cluster, and
|
||||
partition locations are stored in the mapping maintained by the ring. The ring is also
|
||||
responsible for determining which devices are used for handoff in failure
|
||||
scenarios.</para>
|
||||
<para>Data can be isolated into zones in the ring. Each partition replica is guaranteed to
|
||||
reside in a different zone. A zone could represent a drive, a server, a cabinet, a
|
||||
switch, or even a data center.</para>
|
||||
<para>The partitions of the ring are equally divided among all of the devices in the Object
|
||||
Storage installation. When partitions need to be moved around (for example, if a device
|
||||
is added to the cluster), the ring ensures that a minimum number of partitions are moved
|
||||
at a time, and only one replica of a partition is moved at a time.</para>
|
||||
<para>Weights can be used to balance the distribution of partitions on drives across the
|
||||
cluster. This can be useful, for example, when differently sized drives are used in a
|
||||
cluster.</para>
|
||||
<para>The ring is used by the proxy server and several background processes (like
|
||||
replication).</para>
|
||||
<figure>
|
||||
<title>The <emphasis role="bold">ring</emphasis></title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-ring.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>These rings are externally managed, in that the server processes themselves do not
|
||||
modify the rings, they are instead given new rings modified by other tools.</para>
|
||||
<para>The ring uses a configurable number of bits from a
|
||||
path’s MD5 hash as a partition index that designates a
|
||||
device. The number of bits kept from the hash is known as
|
||||
the partition power, and 2 to the partition power
|
||||
indicates the partition count. Partitioning the full MD5
|
||||
hash ring allows other parts of the cluster to work in
|
||||
batches of items at once which ends up either more
|
||||
efficient or at least less complex than working with each
|
||||
item separately or the entire cluster all at once.</para>
|
||||
<para>Another configurable value is the replica count, which indicates how many of the
|
||||
partition-device assignments make up a single ring. For a given partition number, each
|
||||
replica’s device will not be in the same zone as any other replica's device. Zones can
|
||||
be used to group devices based on physical locations, power separations, network
|
||||
separations, or any other attribute that would improve the availability of multiple
|
||||
replicas at the same time.</para>
|
||||
</section>
|
||||
<section xml:id="section_zones">
|
||||
<title>Zones</title>
|
||||
<para>Object Storage allows configuring zones in order to isolate failure boundaries.
|
||||
Each data replica resides in a separate zone, if possible. At the smallest level, a zone
|
||||
could be a single drive or a grouping of a few drives. If there were five object storage
|
||||
servers, then each server would represent its own zone. Larger deployments would have an
|
||||
entire rack (or multiple racks) of object servers, each representing a zone. The goal of
|
||||
zones is to allow the cluster to tolerate significant outages of storage servers without
|
||||
losing all replicas of the data.</para>
|
||||
<para>As mentioned earlier, everything in Object Storage is stored, by default, three
|
||||
times. Swift will place each replica "as-uniquely-as-possible" to ensure both high
|
||||
availability and high durability. This means that when chosing a replica location,
|
||||
Object Storage chooses a server in an unused zone before an unused server in a zone that
|
||||
already has a replica of the data.</para>
|
||||
<figure>
|
||||
<title>Zones</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-zones.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>When a disk fails, replica data is automatically distributed to the other zones to
|
||||
ensure there are three copies of the data.</para>
|
||||
</section>
|
||||
<section xml:id="section_accounts-containers">
|
||||
<title>Accounts and containers</title>
|
||||
<para>Each account and container is an individual SQLite
|
||||
database that is distributed across the cluster. An
|
||||
account database contains the list of containers in
|
||||
that account. A container database contains the list
|
||||
of objects in that container.</para>
|
||||
<figure>
|
||||
<title>Accounts and containers</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-accountscontainers.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>To keep track of object data locations, each account in the system has a database
|
||||
that references all of its containers, and each container database references each
|
||||
object.</para>
|
||||
</section>
|
||||
<section xml:id="section_partitions">
|
||||
<title>Partitions</title>
|
||||
<para>A partition is a collection of stored data, including account databases, container
|
||||
databases, and objects. Partitions are core to the replication system.</para>
|
||||
<para>Think of a partition as a bin moving throughout a fulfillment center warehouse.
|
||||
Individual orders get thrown into the bin. The system treats that bin as a cohesive
|
||||
entity as it moves throughout the system. A bin is easier to deal with than many little
|
||||
things. It makes for fewer moving parts throughout the system.</para>
|
||||
<para>System replicators and object uploads/downloads operate on partitions. As the
|
||||
system scales up, its behavior continues to be predictable because the number of
|
||||
partitions is a fixed number.</para>
|
||||
<para>Implementing a partition is conceptually simple—a partition is just a
|
||||
directory sitting on a disk with a corresponding hash table of what it contains.</para>
|
||||
<figure>
|
||||
<title>Partitions</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-partitions.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
</section>
|
||||
<section xml:id="section_replicators">
|
||||
<title>Replicators</title>
|
||||
<para>In order to ensure that there are three copies of the data everywhere, replicators
|
||||
continuously examine each partition. For each local partition, the replicator compares
|
||||
it against the replicated copies in the other zones to see if there are any
|
||||
differences.</para>
|
||||
<para>The replicator knowd if replication needs to take plac by examining hashes. A hash
|
||||
file is created for each partition, which contains hashes of each directory in the
|
||||
partition. Each of the three hash files is compared. For a given partition, the hash
|
||||
files for each of the partition's copies are compared. If the hashes are different, then
|
||||
it is time to replicate, and the directory that needs to be replicated is copied
|
||||
over.</para>
|
||||
<para>This is where partitions come in handy. With fewer things in the system, larger
|
||||
chunks of data are transferred around (rather than lots of little TCP connections, which
|
||||
is inefficient) and there is a consistent number of hashes to compare.</para>
|
||||
<para>The cluster eventually has a consistent behavior where the newest data has a
|
||||
priority.</para>
|
||||
<figure>
|
||||
<title>Replication</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-replication.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>If a zone goes down, one of the nodes containing a replica notices and proactively
|
||||
copies data to a handoff location.</para>
|
||||
</section>
|
||||
<section xml:id="section_usecases">
|
||||
<title>Use cases</title>
|
||||
<para>The following sections show use cases for object uploads and downloads and introduce the components.</para>
|
||||
<section xml:id="upload">
|
||||
<title>Upload</title>
|
||||
<para>A client uses the REST API to make a HTTP request to PUT an object into an existing
|
||||
container. The cluster receives the request. First, the system must figure out where
|
||||
the data is going to go. To do this, the account name, container name, and object
|
||||
name are all used to determine the partition where this object should live.</para>
|
||||
<para>Then a lookup in the ring figures out which storage nodes contain the partitions in
|
||||
question.</para>
|
||||
<para>The data is then sent to each storage node where it is placed in the appropriate
|
||||
partition. At least two of the three writes must be successful before the client is
|
||||
notified that the upload was successful.</para>
|
||||
<para>Next, the container database is updated asynchronously to reflect that there is a new
|
||||
object in it.</para>
|
||||
<figure>
|
||||
<title>Object Storage in use</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="../common/figures/objectstorage-usecase.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
</section>
|
||||
<section xml:id="section_swift-component-download">
|
||||
<title>Download</title>
|
||||
<para>A request comes in for an acount/container/object. Using the same consistent hashing,
|
||||
the partition name is generated. A lookup in the ring reveals which storage nodes
|
||||
contain that partition. A request is made to one of the storage nodes to fetch the
|
||||
object and, if that fails, requests are made to the other nodes.</para>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
180
doc/common/section_objectstorage-features.xml
Normal file
@ -0,0 +1,180 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage_features">
|
||||
<!-- ... Old module003-ch002-features-benefits edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Features and benefits</title>
|
||||
<para>
|
||||
<informaltable class="c19">
|
||||
<tbody>
|
||||
<tr>
|
||||
<th rowspan="1" colspan="1">Features</th>
|
||||
<th rowspan="1" colspan="1">Benefits</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Leverages commodity
|
||||
hardware</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>No
|
||||
lock-in, lower
|
||||
price/GB</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>HDD/node failure agnostic</emphasis></td>
|
||||
<td rowspan="1" colspan="1">Self-healing, reliable, data redundancy protects
|
||||
from failures</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Unlimited storage</emphasis></td>
|
||||
<td rowspan="1" colspan="1">Large and flat namespace, highly scalable read/write
|
||||
access, able to serve content directly from storage system</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Multi-dimensional scalability</emphasis>
|
||||
</td>
|
||||
<td rowspan="1" colspan="1">Scale-out architecture—Scale vertically and
|
||||
horizontally-distributed storage Backs up and archives large amounts of data
|
||||
with linear performance</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold">Account/container/object
|
||||
structure</emphasis></td>
|
||||
<td rowspan="1" colspan="1">No nesting, not a traditional file
|
||||
system—Optimized for scale, it scales to multiple petabytes and
|
||||
billions of objects</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold">Built-in replication 3✕
|
||||
+ data redundancy (compared with 2✕ on RAID)</emphasis></td>
|
||||
<td rowspan="1" colspan="1">A configurable number of accounts, containers and
|
||||
object copies for high availability</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Easily add capacity (unlike
|
||||
RAID resize)</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Elastic
|
||||
data scaling with
|
||||
ease</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>No central database</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Higher
|
||||
performance, no
|
||||
bottlenecks</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>RAID not required</emphasis></td>
|
||||
<td rowspan="1" colspan="1">Handle many small, random reads and writes
|
||||
efficiently</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Built-in management
|
||||
utilities</emphasis></td>
|
||||
<td rowspan="1" colspan="1">Account management—Create, add, verify, and
|
||||
delete users; Container management—Upload, download, and verify;
|
||||
Monitoring—Capacity, host, network, log trawling, and cluster
|
||||
health</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Drive auditing</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Detect
|
||||
drive failures preempting data
|
||||
corruption</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Expiring objects</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Users
|
||||
can set an expiration time or a TTL on an
|
||||
object to control
|
||||
access</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Direct object access</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Enable
|
||||
direct browser access to content, such as for
|
||||
a control
|
||||
panel</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Realtime visibility into client
|
||||
requests</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Know
|
||||
what users are
|
||||
requesting</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Supports S3 API</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Utilize
|
||||
tools that were designed for the popular S3
|
||||
API</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Restrict containers per
|
||||
account</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Limit
|
||||
access to control usage by
|
||||
user</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Support for NetApp, Nexenta,
|
||||
SolidFire</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Unified
|
||||
support for block volumes using a variety of
|
||||
storage
|
||||
systems</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Snapshot and backup API for block
|
||||
volumes</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Data
|
||||
protection and recovery for VM
|
||||
data</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Standalone volume API
|
||||
available</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Separate
|
||||
endpoint and API for integration with other
|
||||
compute
|
||||
systems</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Integration with Compute</emphasis></td>
|
||||
<td rowspan="1" colspan="1">Fully integrated with Compute for attaching block
|
||||
volumes and reporting on usage</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</informaltable>
|
||||
</para>
|
||||
</section>
|
23
doc/common/section_objectstorage-intro.xml
Normal file
@ -0,0 +1,23 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-intro">
|
||||
<!-- ... Old module003-ch001-intro-objstore edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Introduction to Object Storage</title>
|
||||
<para>OpenStack Object Storage (code-named Swift) is open source software for creating
|
||||
redundant, scalable data storage using clusters of standardized servers to store petabytes
|
||||
of accessible data. It is a long-term storage system for large amounts of static data that
|
||||
can be retrieved, leveraged, and updated. Object Storage uses a distributed architecture
|
||||
with no central point of control, providing greater scalability, redundancy, and permanence.
|
||||
Objects are written to multiple hardware devices, with the OpenStack software responsible
|
||||
for ensuring data replication and integrity across the cluster. Storage clusters scale
|
||||
horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its
|
||||
content from other active nodes. Because OpenStack uses software logic to ensure data
|
||||
replication and distribution across different devices, inexpensive commodity hard drives and
|
||||
servers can be used in lieu of more expensive equipment.</para>
|
||||
<para>Object Storage is ideal for cost effective, scale-out storage. It provides a fully
|
||||
distributed, API-accessible storage platform that can be integrated directly into
|
||||
applications or used for backup, archiving, and data retention.</para>
|
||||
</section>
|
99
doc/common/section_objectstorage-replication.xml
Normal file
@ -0,0 +1,99 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-replication">
|
||||
<!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Replication</title>
|
||||
<para>Because each replica in Object Storage functions independently, and clients generally
|
||||
require only a simple majority of nodes responding to consider an operation successful,
|
||||
transient failures like network partitions can quickly cause replicas to diverge. These
|
||||
differences are eventually reconciled by asynchronous, peer-to-peer replicator processes.
|
||||
The replicator processes traverse their local filesystems, concurrently performing
|
||||
operations in a manner that balances load across physical disks.</para>
|
||||
<para>Replication uses a push model, with records and files
|
||||
generally only being copied from local to remote replicas.
|
||||
This is important because data on the node may not belong
|
||||
there (as in the case of handoffs and ring changes), and a
|
||||
replicator can’t know what data exists elsewhere in the
|
||||
cluster that it should pull in. It’s the duty of any node that
|
||||
contains data to ensure that data gets to where it belongs.
|
||||
Replica placement is handled by the ring.</para>
|
||||
<para>Every deleted record or file in the system is marked by a
|
||||
tombstone, so that deletions can be replicated alongside
|
||||
creations. The replication process cleans up tombstones after
|
||||
a time period known as the consistency window. The consistency
|
||||
window encompasses replication duration and how long transient
|
||||
failure can remove a node from the cluster. Tombstone cleanup
|
||||
must be tied to replication to reach replica
|
||||
convergence.</para>
|
||||
<para>If a replicator detects that a remote drive has failed, the
|
||||
replicator uses the get_more_nodes interface for the ring to
|
||||
choose an alternate node with which to synchronize. The
|
||||
replicator can maintain desired levels of replication in the
|
||||
face of disk failures, though some replicas may not be in an
|
||||
immediately usable location. Note that the replicator doesn’t
|
||||
maintain desired levels of replication when other failures,
|
||||
such as entire node failures, occur because most failure are
|
||||
transient.</para>
|
||||
<para>Replication is an area of active development, and likely
|
||||
rife with potential improvements to speed and
|
||||
correctness.</para>
|
||||
<para>There are two major classes of replicator—the db replicator, which replicates
|
||||
accounts and containers, and the object replicator, which replicates object data.</para>
|
||||
<section xml:id="section_database-replication">
|
||||
<title>Database replication</title>
|
||||
<para>The first step performed by db replication is a low-cost
|
||||
hash comparison to determine whether two replicas already
|
||||
match. Under normal operation, this check is able to
|
||||
verify that most databases in the system are already
|
||||
synchronized very quickly. If the hashes differ, the
|
||||
replicator brings the databases in sync by sharing records
|
||||
added since the last sync point.</para>
|
||||
<para>This sync point is a high water mark noting the last
|
||||
record at which two databases were known to be in sync,
|
||||
and is stored in each database as a tuple of the remote
|
||||
database id and record id. Database ids are unique amongst
|
||||
all replicas of the database, and record ids are
|
||||
monotonically increasing integers. After all new records
|
||||
have been pushed to the remote database, the entire sync
|
||||
table of the local database is pushed, so the remote
|
||||
database can guarantee that it is in sync with everything
|
||||
with which the local database has previously
|
||||
synchronized.</para>
|
||||
<para>If a replica is found to be missing entirely, the whole
|
||||
local database file is transmitted to the peer using
|
||||
rsync(1) and vested with a new unique id.</para>
|
||||
<para>In practice, DB replication can process hundreds of
|
||||
databases per concurrency setting per second (up to the
|
||||
number of available CPUs or disks) and is bound by the
|
||||
number of DB transactions that must be performed.</para>
|
||||
</section>
|
||||
<section xml:id="section_object-replication">
|
||||
<title>Object replication</title>
|
||||
<para>The initial implementation of object replication simply
|
||||
performed an rsync to push data from a local partition to
|
||||
all remote servers it was expected to exist on. While this
|
||||
performed adequately at small scale, replication times
|
||||
skyrocketed once directory structures could no longer be
|
||||
held in RAM. We now use a modification of this scheme in
|
||||
which a hash of the contents for each suffix directory is
|
||||
saved to a per-partition hashes file. The hash for a
|
||||
suffix directory is invalidated when the contents of that
|
||||
suffix directory are modified.</para>
|
||||
<para>The object replication process reads in these hash
|
||||
files, calculating any invalidated hashes. It then
|
||||
transmits the hashes to each remote server that should
|
||||
hold the partition, and only suffix directories with
|
||||
differing hashes on the remote server are rsynced. After
|
||||
pushing files to the remote server, the replication
|
||||
process notifies it to recalculate hashes for the rsynced
|
||||
suffix directories.</para>
|
||||
<para>Performance of object replication is generally bound by the number of uncached
|
||||
directories it has to traverse, usually as a result of invalidated suffix directory
|
||||
hashes. Using write volume and partition counts from our running systems, it was
|
||||
designed so that around 2 percent of the hash space on a normal node will be invalidated
|
||||
per day, which has experimentally given us acceptable replication speeds.</para>
|
||||
</section>
|
||||
</section>
|
129
doc/common/section_objectstorage-ringbuilder.xml
Normal file
@ -0,0 +1,129 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-ringbuilder">
|
||||
<!-- ... Old module003-ch005-the-ring edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Ring-builder</title>
|
||||
<para>Rings are built and managed manually by a utility called the ring-builder. The
|
||||
ring-builder assigns partitions to devices and writes an optimized Python structure to a
|
||||
gzipped, serialized file on disk for shipping out to the servers. The server processes just
|
||||
check the modification time of the file occasionally and reload their in-memory copies of
|
||||
the ring structure as needed. Because of how the ring-builder manages changes to the ring,
|
||||
using a slightly older ring usually just means one of the three replicas for a subset of the
|
||||
partitions will be incorrect, which can be easily worked around.</para>
|
||||
<para>The ring-builder also keeps its own builder file with the ring information and additional
|
||||
data required to build future rings. It is very important to keep multiple backup copies of
|
||||
these builder files. One option is to copy the builder files out to every server while
|
||||
copying the ring files themselves. Another is to upload the builder files into the cluster
|
||||
itself. If you lose the builder file, you have to create a new ring from scratch. Nearly all
|
||||
partitions would be assigned to different devices and, therefore, nearly all of the stored
|
||||
data would have to be replicated to new locations. So, recovery from a builder file loss is
|
||||
possible, but data would be unreachable for an extended time.</para>
|
||||
<section xml:id="section_ring-data-structure">
|
||||
<title>Ring data structure</title>
|
||||
<para>The ring data structure consists of three top level
|
||||
fields: a list of devices in the cluster, a list of lists
|
||||
of device ids indicating partition to device assignments,
|
||||
and an integer indicating the number of bits to shift an
|
||||
MD5 hash to calculate the partition for the hash.</para>
|
||||
</section>
|
||||
<section xml:id="section_partition-assignment">
|
||||
<title>Partition assignment list</title>
|
||||
<para>This is a list of <literal>array(‘H’)</literal> of devices ids. The
|
||||
outermost list contains an <literal>array(‘H’)</literal> for each
|
||||
replica. Each <literal>array(‘H’)</literal> has a length equal to the
|
||||
partition count for the ring. Each integer in the
|
||||
<literal>array(‘H’)</literal> is an index into the above list of devices.
|
||||
The partition list is known internally to the Ring
|
||||
class as <literal>_replica2part2dev_id</literal>.</para>
|
||||
<para>So, to create a list of device dictionaries assigned to a partition, the Python
|
||||
code would look like:
|
||||
<programlisting>devices = [self.devs[part2dev_id[partition]] for
|
||||
part2dev_id in self._replica2part2dev_id]</programlisting></para>
|
||||
<para>That code is a little simplistic, as it does not account for the removal of
|
||||
duplicate devices. If a ring has more replicas than devices, then a partition will have
|
||||
more than one replica on one device.</para>
|
||||
<para><literal>array(‘H’)</literal> is used for memory conservation as there
|
||||
may be millions of partitions.</para>
|
||||
</section>
|
||||
<section xml:id="section_fractional-replicas">
|
||||
<title>Fractional replicas</title>
|
||||
<para>A ring is not restricted to having an integer number
|
||||
of replicas. In order to support the gradual changing
|
||||
of replica counts, the ring is able to have a real
|
||||
number of replicas.</para>
|
||||
<para>When the number of replicas is not an integer, then the last element of
|
||||
<literal>_replica2part2dev_id</literal> will have a length that is less than the
|
||||
partition count for the ring. This means that some partitions will have more replicas
|
||||
than others. For example, if a ring has 3.25 replicas, then 25 percent of its partitions
|
||||
will have four replicas, while the remaining 75 percent will have just three.</para>
|
||||
</section>
|
||||
<section xml:id="section_partition-shift-value">
|
||||
<title>Partition shift value</title>
|
||||
<para>The partition shift value is known internally to the
|
||||
Ring class as <literal>_part_shift</literal>. This value used to shift an
|
||||
MD5 hash to calculate the partition on which the data
|
||||
for that hash should reside. Only the top four bytes
|
||||
of the hash is used in this process. For example, to
|
||||
compute the partition for the path
|
||||
/account/container/object the Python code might look
|
||||
like:
|
||||
<programlisting>partition = unpack_from('>I',
|
||||
md5('/account/container/object').digest())[0] >>
|
||||
self._part_shift</programlisting></para>
|
||||
<para>For a ring generated with part_power P, the
|
||||
partition shift value is <literal>32 - P</literal>.</para>
|
||||
</section>
|
||||
<section xml:id="section_build-ring">
|
||||
<title>Build the ring</title>
|
||||
<para>The initial building of the ring first calculates the
|
||||
number of partitions that should ideally be assigned to
|
||||
each device based the device’s weight. For example, given
|
||||
a partition power of 20, the ring will have 1,048,576
|
||||
partitions. If there are 1,000 devices of equal weight
|
||||
they will each desire 1,048.576 partitions. The devices
|
||||
are then sorted by the number of partitions they desire
|
||||
and kept in order throughout the initialization
|
||||
process.</para>
|
||||
<note><para>Each device is also assigned a random tiebreaker
|
||||
value that is used when two devices desire the same number
|
||||
of partitions. This tiebreaker is not stored on disk
|
||||
anywhere, and so two different rings created with the same
|
||||
parameters will have different partition assignments. For
|
||||
repeatable partition assignments, <literal>RingBuilder.rebalance()</literal>
|
||||
takes an optional seed value that will be used to seed
|
||||
Python’s pseudo-random number generator.</para></note>
|
||||
<para>Then, the ring builder assigns each replica of each partition to the device that
|
||||
requires most partitions at that point while keeping it as far away as possible from
|
||||
other replicas. The ring builder prefers to assign a replica to a device in a region
|
||||
does not already have a replica. If no such region is available, the ring builder tries
|
||||
to find a device in a different zone. If that's not possible, it will look on a
|
||||
different server. If it doesn't find one there, it will just look for a device that has
|
||||
no replicas. Finally, if all of the other options are exhausted, the ring builder
|
||||
assigns the replica to the device that has the fewest replicas already assigned. Note
|
||||
that assignment of multiple replicas to one device will only happen if the ring has
|
||||
fewer devices than it has replicas.</para>
|
||||
<para>When building a new ring based on an old ring, the desired number of partitions each
|
||||
device wants is recalculated. Next, the partitions to be reassigned are gathered up. Any
|
||||
removed devices have all their assigned partitions unassigned and added to the gathered
|
||||
list. Any partition replicas that (due to the addition of new devices) can be spread out
|
||||
for better durability are unassigned and added to the gathered list. Any devices that
|
||||
have more partitions than they now need have random partitions unassigned from them and
|
||||
added to the gathered list. Lastly, the gathered partitions are then reassigned to
|
||||
devices using a similar method as in the initial assignment described above.</para>
|
||||
<para>Whenever a partition has a replica reassigned, the time of the reassignment is
|
||||
recorded. This is taken into account when gathering partitions to reassign so that no
|
||||
partition is moved twice in a configurable amount of time. This configurable amount of
|
||||
time is known internally to the RingBuilder class as <literal>min_part_hours</literal>.
|
||||
This restriction is ignored for replicas of partitions on devices that have been removed
|
||||
since removing a device only happens on device failure and reasssignment is the only
|
||||
choice.</para>
|
||||
<para>The above processes don’t always perfectly rebalance a ring due to the random nature
|
||||
of gathering partitions for reassignment. To help reach a more balanced ring, the
|
||||
rebalance process is repeated until near perfect (less than 1 percent off) or when the
|
||||
balance doesn’t improve by at least 1 percent (indicating we probably can’t get perfect
|
||||
balance due to wildly imbalanced zones or too many partitions recently moved).</para>
|
||||
</section>
|
||||
</section>
|
106
doc/common/section_objectstorage-troubleshoot.xml
Normal file
@ -0,0 +1,106 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<section xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||
xml:id="troubleshooting-openstack-object-storage">
|
||||
<title>Troubleshoot Object Storage</title>
|
||||
<para>For Object Storage, everything is logged in <filename>/var/log/syslog</filename> (or messages on some distros).
|
||||
Several settings enable further customization of logging, such as <literal>log_name</literal>, <literal>log_facility</literal>,
|
||||
and <literal>log_level</literal>, within the object server configuration files.</para>
|
||||
<section xml:id="drive-failure">
|
||||
<title>Drive failure</title>
|
||||
<para>In the event that a drive has failed, the first step is to make sure the drive is
|
||||
unmounted. This will make it easier for Object Storage to work around the failure until
|
||||
it has been resolved. If the drive is going to be replaced immediately, then it is just
|
||||
best to replace the drive, format it, remount it, and let replication fill it up.</para>
|
||||
<para>If the drive can’t be replaced immediately, then it is best to leave it
|
||||
unmounted, and remove the drive from the ring. This will allow all the replicas
|
||||
that were on that drive to be replicated elsewhere until the drive is replaced.
|
||||
Once the drive is replaced, it can be re-added to the ring.</para>
|
||||
<para>You can look at error messages in <filename>/var/log/kern.log</filename> for hints of drive failure.</para>
|
||||
</section>
|
||||
<section xml:id="server-failure">
|
||||
<title>Server failure</title>
|
||||
<para>If a server is having hardware issues, it is a good idea to make sure the
|
||||
Object Storage services are not running. This will allow Object Storage to
|
||||
work around the failure while you troubleshoot.</para>
|
||||
<para>If the server just needs a reboot, or a small amount of work that should only
|
||||
last a couple of hours, then it is probably best to let Object Storage work
|
||||
around the failure and get the machine fixed and back online. When the machine
|
||||
comes back online, replication will make sure that anything that is missing
|
||||
during the downtime will get updated.</para>
|
||||
<para>If the server has more serious issues, then it is probably best to remove all
|
||||
of the server’s devices from the ring. Once the server has been repaired and is
|
||||
back online, the server’s devices can be added back into the ring. It is
|
||||
important that the devices are reformatted before putting them back into the
|
||||
ring as it is likely to be responsible for a different set of partitions than
|
||||
before.</para>
|
||||
</section>
|
||||
<section xml:id="detect-failed-drives">
|
||||
<title>Detect failed drives</title>
|
||||
<para>It has been our experience that when a drive is about to fail, error messages will spew into
|
||||
/var/log/kern.log. There is a script called swift-drive-audit that can be run via cron
|
||||
to watch for bad drives. If errors are detected, it will unmount the bad drive, so that
|
||||
Object Storage can work around it. The script takes a configuration file with the
|
||||
following settings:</para>
|
||||
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
|
||||
<para>This script has only been tested on Ubuntu 10.04, so if you are using a
|
||||
different distro or OS, some care should be taken before using in production.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="recover-ring-builder-file">
|
||||
<title>Emergency recovery of ring builder files</title>
|
||||
<para>You should always keep a backup of Swift ring builder files. However, if an
|
||||
emergency occurs, this procedure may assist in returning your cluster to an
|
||||
operational state.</para>
|
||||
<para>Using existing Swift tools, there is no way to recover a builder file from a
|
||||
<filename>ring.gz</filename> file. However, if you have a knowledge of Python, it is possible to
|
||||
construct a builder file that is pretty close to the one you have lost. The
|
||||
following is what you will need to do.</para>
|
||||
<warning>
|
||||
<para>This procedure is a last-resort for emergency circumstances—it
|
||||
requires knowledge of the swift python code and may not succeed.</para>
|
||||
</warning>
|
||||
<para>First, load the ring and a new ringbuilder object in a Python REPL:</para>
|
||||
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
|
||||
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
|
||||
<para>Now, start copying the data we have in the ring into the builder.</para>
|
||||
<programlisting language="python">
|
||||
>>> import math
|
||||
>>> partitions = len(ring._replica2part2dev_id[0])
|
||||
>>> replicas = len(ring._replica2part2dev_id)
|
||||
|
||||
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
|
||||
>>> builder.devs = ring.devs
|
||||
>>> builder._replica2part2dev = ring.replica2part2dev_id
|
||||
>>> builder._last_part_moves_epoch = 0
|
||||
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
|
||||
>>> builder._set_parts_wanted()
|
||||
>>> for d in builder._iter_devs():
|
||||
d['parts'] = 0
|
||||
>>> for p2d in builder._replica2part2dev:
|
||||
for dev_id in p2d:
|
||||
builder.devs[dev_id]['parts'] += 1</programlisting>
|
||||
<para>This is the extent of the recoverable fields. For
|
||||
<literal>min_part_hours</literal> you'll either have to remember what the
|
||||
value you used was, or just make up a new one.</para>
|
||||
<programlisting language="python">
|
||||
>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
|
||||
<para>Try some validation: if this doesn't raise an exception, you may feel some
|
||||
hope. Not too much, though.</para>
|
||||
<programlisting language="python">>>> builder.validate()</programlisting>
|
||||
<para>Save the builder.</para>
|
||||
<programlisting language="python">
|
||||
>>> import pickle
|
||||
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
|
||||
<para>You should now have a file called 'account.builder' in the current working
|
||||
directory. Next, run <literal>swift-ring-builder account.builder write_ring</literal>
|
||||
and compare the new account.ring.gz to the account.ring.gz that you started
|
||||
from. They probably won't be byte-for-byte identical, but if you load them up
|
||||
in a REPL and their <literal>_replica2part2dev_id</literal> and
|
||||
<literal>devs</literal> attributes are the same (or nearly so), then you're
|
||||
in good shape.</para>
|
||||
<para>Next, repeat the procedure for <filename>container.ring.gz</filename>
|
||||
and <filename>object.ring.gz</filename>, and you might get usable builder files.</para>
|
||||
</section>
|
||||
</section>
|
@ -1,144 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||
xml:id="troubleshooting-openstack-object-storage">
|
||||
<title>Troubleshoot Object Storage</title>
|
||||
<para>For OpenStack Object Storage, everything is logged in
|
||||
<filename>/var/log/syslog</filename> (or messages on some
|
||||
distros). Several settings enable further customization of
|
||||
logging, such as <option>log_name</option>,
|
||||
<option>log_facility</option>, and
|
||||
<option>log_level</option>, within the object server
|
||||
configuration files.</para>
|
||||
<section xml:id="handling-drive-failure">
|
||||
<title>Recover drive failures</title>
|
||||
<para>If a drive fails, make sure the
|
||||
drive is unmounted to make it easier for Object
|
||||
Storage to work around the failure while you resolve
|
||||
it. If you plan to replace the drive immediately, replace
|
||||
the drive, format it, remount it, and let replication fill
|
||||
it.</para>
|
||||
<para>If you cannot replace the drive immediately, leave it
|
||||
unmounted and remove the drive from the ring. This enables
|
||||
you to replicate all the replicas on that drive elsewhere
|
||||
until you can replace the drive. After you replace the
|
||||
drive, you can add it to the ring again.</para>
|
||||
<note>
|
||||
<para>Rackspace has seen hints at drive failures by
|
||||
looking at error messages in
|
||||
<filename>/var/log/kern.log</filename>. Check this
|
||||
file in your monitoring.</para>
|
||||
</note>
|
||||
</section>
|
||||
<section xml:id="handling-server-failure">
|
||||
<title>Recover server failures</title>
|
||||
<para>If a server has hardware issues, make sure that the
|
||||
Object Storage services are not running. This enables
|
||||
Object Storage to work around the failure while you
|
||||
troubleshoot.</para>
|
||||
<para>If the server needs a reboot or a minimal amount of
|
||||
work, let Object Storage work around the failure while you
|
||||
fix the machine and get it back online. When the machine
|
||||
comes back online, replication updates anything that was
|
||||
missing during the downtime.</para>
|
||||
<para>If the server has more serious issues,remove all server
|
||||
devices from the ring. After you repair and put the server
|
||||
online, you can add the devices for the server back to the
|
||||
ring. You must reformat the devices before you add them to
|
||||
the ring because they might be responsible for a different
|
||||
set of partitions than before.</para>
|
||||
</section>
|
||||
<section xml:id="detecting-failed-drives">
|
||||
<title>Detect failed drives</title>
|
||||
<para>When a drive is about to fail, many error messages
|
||||
appear in the <filename>/var/log/kern.log</filename> file.
|
||||
You can run the <package>swift-drive-audit</package>
|
||||
script through <command>cron</command> to watch for bad
|
||||
drives. If errors are detected, it unmounts the bad drive
|
||||
so that Object Storage can work around it. The script uses
|
||||
a configuration file with these settings:</para>
|
||||
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
|
||||
<para>This script has been tested on only Ubuntu 10.04. If you
|
||||
use a different distribution or operating system, take
|
||||
care before using the script in production.</para>
|
||||
</section>
|
||||
<section xml:id="recover-ring-builder-file">
|
||||
<title>Recover ring builder files (emergency)</title>
|
||||
<para>You should always keep a backup of Swift ring builder
|
||||
files. However, if an emergency occurs, use this procedure
|
||||
to return your cluster to an operational state.</para>
|
||||
<para>Existing Swift tools do not enable you to recover a
|
||||
builder file from a <filename>ring.gz</filename> file.
|
||||
However, if you have Python knowledge, you can construct a
|
||||
builder file similar to the one you have lost.</para>
|
||||
<warning>
|
||||
<para>This procedure is a last-resort in an emergency. It
|
||||
requires knowledge of the swift Python code and might
|
||||
not succeed.</para>
|
||||
</warning>
|
||||
<procedure>
|
||||
<step>
|
||||
<para>Load the ring and a new ringbuilder object in a
|
||||
Python REPL:</para>
|
||||
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
|
||||
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
|
||||
</step>
|
||||
<step>
|
||||
<para>Copy the data in the ring into the
|
||||
builder.</para>
|
||||
<programlisting language="python">>>> import math
|
||||
>>> partitions = len(ring._replica2part2dev_id[0])
|
||||
>>> replicas = len(ring._replica2part2dev_id)
|
||||
|
||||
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
|
||||
>>> builder.devs = ring.devs
|
||||
>>> builder._replica2part2dev = ring.replica2part2dev_id
|
||||
>>> builder._last_part_moves_epoch = 0
|
||||
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
|
||||
>>> builder._set_parts_wanted()
|
||||
>>> for d in builder._iter_devs():
|
||||
d['parts'] = 0
|
||||
>>> for p2d in builder._replica2part2dev:
|
||||
for dev_id in p2d:
|
||||
builder.devs[dev_id]['parts'] += 1</programlisting>
|
||||
<para>This is the extent of the recoverable
|
||||
fields.</para>
|
||||
</step>
|
||||
<step>
|
||||
<para>For <option>min_part_hours</option>, you must
|
||||
remember the value that you used previously or
|
||||
create a new value.</para>
|
||||
<programlisting language="python">>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
|
||||
<para>If validation succeeds without raising an
|
||||
exception, you have succeeded.</para>
|
||||
<programlisting language="python">>>> builder.validate()</programlisting>
|
||||
</step>
|
||||
<step>
|
||||
<para>Save the builder.</para>
|
||||
<programlisting language="python">>>> import pickle
|
||||
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
|
||||
<para>The <filename>account.builder</filename> file
|
||||
appears in the current working directory.</para>
|
||||
</step>
|
||||
<step>
|
||||
<para>Run <literal>swift-ring-builder account.builder
|
||||
write_ring</literal>.</para>
|
||||
<para>Compare the new
|
||||
<filename>account.ring.gz</filename> to the
|
||||
original <filename>account.ring.gz</filename>
|
||||
file. They might not be byte-for-byte identical,
|
||||
but if you load them in REPL and their
|
||||
<option>_replica2part2dev_id</option> and
|
||||
<option>devs</option> attributes are the same
|
||||
(or nearly so), you have succeeded.</para>
|
||||
</step>
|
||||
<step>
|
||||
<para>Repeat this procedure for the
|
||||
<filename>container.ring.gz</filename> and
|
||||
<filename>object.ring.gz</filename> files, and
|
||||
you might get usable builder files.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
</section>
|
||||
</chapter>
|
@ -69,7 +69,8 @@ format="PNG" />
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</informalfigure>
|
||||
<para>There will be three hosts in the setup.<table rules="all">
|
||||
<para>There will be three hosts in the setup.</para>
|
||||
<table rules="all">
|
||||
<caption>Hosts for Demo</caption>
|
||||
<thead>
|
||||
<tr>
|
||||
@ -103,7 +104,7 @@ format="PNG" />
|
||||
<td>Same as HostA</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table></para>
|
||||
</table>
|
||||
<section xml:id="multi_agent_demo_configuration">
|
||||
<title>Configuration</title>
|
||||
<itemizedlist>
|
||||
|