Restructure Object Storage chapter of Cloud Admin Guide
Restores Troubleshoot Object Storage Removes Monitoring section, which was based on a blog backport: havana Closes-Bug: #1251515 author: nermina miller Change-Id: I580b077a0124d7cd54dced6c0d340e05d5d5f983
@@ -5,6 +5,13 @@
|
|||||||
xml:id="ch_admin-openstack-object-storage">
|
xml:id="ch_admin-openstack-object-storage">
|
||||||
<?dbhtml stop-chunking?>
|
<?dbhtml stop-chunking?>
|
||||||
<title>Object Storage</title>
|
<title>Object Storage</title>
|
||||||
<xi:include href="../common/section_about-object-storage.xml"/>
|
<xi:include href="../common/section_objectstorage-intro.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-features.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-characteristics.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-components.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-arch.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-replication.xml"/>
|
||||||
<xi:include href="section_object-storage-monitoring.xml"/>
|
<xi:include href="section_object-storage-monitoring.xml"/>
|
||||||
|
<xi:include href="../common/section_objectstorage-troubleshoot.xml"/>
|
||||||
</chapter>
|
</chapter>
|
||||||
|
@@ -3,6 +3,7 @@
|
|||||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||||
xml:id="ch_introduction-to-openstack-object-storage-monitoring">
|
xml:id="ch_introduction-to-openstack-object-storage-monitoring">
|
||||||
|
<!-- ... Based on a blog, should be replaced with original material... -->
|
||||||
<title>Object Storage monitoring</title>
|
<title>Object Storage monitoring</title>
|
||||||
<?dbhtml stop-chunking?>
|
<?dbhtml stop-chunking?>
|
||||||
<para>Excerpted from a blog post by <link
|
<para>Excerpted from a blog post by <link
|
||||||
|
BIN
doc/common/figures/objectstorage-accountscontainers.png
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
doc/common/figures/objectstorage-arch.png
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
doc/common/figures/objectstorage-buildingblocks.png
Normal file
After Width: | Height: | Size: 48 KiB |
BIN
doc/common/figures/objectstorage-nodes.png
Normal file
After Width: | Height: | Size: 58 KiB |
BIN
doc/common/figures/objectstorage-partitions.png
Normal file
After Width: | Height: | Size: 28 KiB |
BIN
doc/common/figures/objectstorage-replication.png
Normal file
After Width: | Height: | Size: 45 KiB |
BIN
doc/common/figures/objectstorage-ring.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
doc/common/figures/objectstorage-usecase.png
Normal file
After Width: | Height: | Size: 61 KiB |
BIN
doc/common/figures/objectstorage-zones.png
Normal file
After Width: | Height: | Size: 10 KiB |
BIN
doc/common/figures/objectstorage.png
Normal file
After Width: | Height: | Size: 23 KiB |
40
doc/common/section_objectstorage-account-reaper.xml
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage-account-reaper">
|
||||||
|
<!-- ... Old module003-ch008-account-reaper edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Account reaper</title>
|
||||||
|
<para>In the background, the account reaper removes data from the deleted accounts.</para>
|
||||||
|
<para>A reseller marks an account for deletion by issuing a <code>DELETE</code> request on the account’s
|
||||||
|
storage URL. This action sets the <code>status</code> column of the account_stat table in the account
|
||||||
|
database and replicas to <code>DELETED</code>, marking the account's data for deletion.</para>
|
||||||
|
<para>Typically, a specific retention time or undelete are not provided. However, you can set a
|
||||||
|
<code>delay_reaping</code> value in the <code>[account-reaper]</code> section of the
|
||||||
|
account-server.conf to delay the actual deletion of data. At this time, to undelete you have
|
||||||
|
to update the account database replicas directly, setting the status column to an empty
|
||||||
|
string and updating the put_timestamp to be greater than the delete_timestamp.
|
||||||
|
<note><para>It's on the developers' to-do list to write a utility that performs this task, preferably
|
||||||
|
through a ReST call.</para></note>
|
||||||
|
</para>
|
||||||
|
<para>The account reaper runs on each account server and scans the server occasionally for
|
||||||
|
account databases marked for deletion. It only fires up on the accounts for which the server
|
||||||
|
is the primary node, so that multiple account servers aren’t trying to do it simultaneously.
|
||||||
|
Using multiple servers to delete one account might improve the deletion speed but requires
|
||||||
|
coordination to avoid duplication. Speed really is not a big concern with data deletion, and
|
||||||
|
large accounts aren’t deleted often.</para>
|
||||||
|
<para>Deleting an account is simple. For each account container, all objects are deleted and
|
||||||
|
then the container is deleted. Deletion requests that fail will not stop the overall process
|
||||||
|
but will cause the overall process to fail eventually (for example, if an object delete
|
||||||
|
times out, you will not be able to delete the container or the account). The account reaper
|
||||||
|
keeps trying to delete an account until it is empty, at which point the database reclaim
|
||||||
|
process within the db_replicator will remove the database files.</para>
|
||||||
|
<para>A persistent error state may prevent the deletion of an object
|
||||||
|
or container. If this happens, you will see
|
||||||
|
a message such as <code>“Account <name> has not been reaped
|
||||||
|
since <date>”</code> in the log. You can control when this is
|
||||||
|
logged with the <code>reap_warn_after</code> value in the <code>[account-reaper]</code>
|
||||||
|
section of the account-server.conf file. The default value is 30
|
||||||
|
days.</para>
|
||||||
|
</section>
|
75
doc/common/section_objectstorage-arch.xml
Normal file
@@ -0,0 +1,75 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage-cluster-architecture">
|
||||||
|
<!-- ... Old module003-ch007-swift-cluster-architecture edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Cluster architecture</title>
|
||||||
|
<section xml:id="section_access-tier">
|
||||||
|
<title>Access tier</title>
|
||||||
|
<para>Large-scale deployments segment off an access tier, which is considered the Object Storage
|
||||||
|
system's central hub. The access tier fields the incoming API requests from clients and
|
||||||
|
moves data in and out of the system. This tier consists of front-end load balancers,
|
||||||
|
ssl-terminators, and authentication services. It runs the (distributed) brain of the
|
||||||
|
Object Storage system—the proxy server processes.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Object Storage architecture</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-arch.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>Because access servers are collocated in their own tier, you can scale out read/write
|
||||||
|
access regardless of the storage capacity. For example, if a cluster is on the public
|
||||||
|
Internet, requires SSL termination, and has a high demand for data access, you can
|
||||||
|
provision many access servers. However, if the cluster is on a private network and used
|
||||||
|
primarily for archival purposes, you need fewer access servers.</para>
|
||||||
|
<para>Since this is an HTTP addressable storage service, you may incorporate a load balancer
|
||||||
|
into the access tier.</para>
|
||||||
|
<para>Typically, the tier consists of a collection of 1U servers. These machines use a
|
||||||
|
moderate amount of RAM and are network I/O intensive. Since these systems field each
|
||||||
|
incoming API request, you should provision them with two high-throughput (10GbE)
|
||||||
|
interfacesone for the incoming "front-end" requests and the other for the "back-end"
|
||||||
|
access to the object storage nodes to put and fetch data.</para>
|
||||||
|
<section xml:id="section_access-tier-considerations">
|
||||||
|
<title>Factors to consider</title>
|
||||||
|
<para>For most publicly facing deployments as well as private deployments available
|
||||||
|
across a wide-reaching corporate network, you use SSL to encrypt traffic to the
|
||||||
|
client. SSL adds significant processing load to establish sessions between clients,
|
||||||
|
which is why you have to provision more capacity in the access layer. SSL may not be
|
||||||
|
required for private deployments on trusted networks.</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_storage-nodes">
|
||||||
|
<title>Storage nodes</title>
|
||||||
|
<para>In most configurations, each of the five zones should have an equal amount of storage
|
||||||
|
capacity. Storage nodes use a reasonable amount of memory and CPU. Metadata needs to be
|
||||||
|
readily available to return objects quickly. The object stores run services not only to
|
||||||
|
field incoming requests from the access tier, but to also run replicators, auditors, and
|
||||||
|
reapers. You can provision object stores provisioned with single gigabit or 10 gigabit
|
||||||
|
network interface depending on the expected workload and desired performance.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Object Storage (Swift)</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-nodes.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>Currently, 2TB or 3TB SATA disks deliver good price/performance value. You can use
|
||||||
|
desktop-grade drives if you have responsive remote hands in the datacenter and
|
||||||
|
enterprise-grade drives if you don't.</para>
|
||||||
|
<section xml:id="section_storage-nodes-considerations">
|
||||||
|
<title>Factors to consider</title>
|
||||||
|
<para>You should keep in mind the desired I/O performance for single-threaded requests .
|
||||||
|
This system does not use RAID, so a single disk handles each request for an object.
|
||||||
|
Disk performance impacts single-threaded response rates.</para>
|
||||||
|
<para>To achieve apparent higher throughput, the object storage system is designed to
|
||||||
|
handle concurrent uploads/downloads. The network I/O capacity (1GbE, bonded 1GbE
|
||||||
|
pair, or 10GbE) should match your desired concurrent throughput needs for reads and
|
||||||
|
writes.</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
</section>
|
59
doc/common/section_objectstorage-characteristics.xml
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="objectstorage_characteristics">
|
||||||
|
<!-- ... Old module003-ch003-obj-store-capabilities edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Object Storage characteristics</title>
|
||||||
|
<para>The key characteristics of Object Storage are that:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>All objects stored in Object Storage have a URL.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>All objects stored are replicated 3✕ in as-unique-as-possible zones, which
|
||||||
|
can be defined as a group of drives, a node, a rack, and so on.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>All objects have their own metadata.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Developers interact with the object storage system through a RESTful HTTP
|
||||||
|
API.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Object data can be located anywhere in the cluster.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>The cluster scales by adding additional nodes without sacrificing performance,
|
||||||
|
which allows a more cost-effective linear storage expansion than fork-lift
|
||||||
|
upgrades.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Data doesn't have to be migrate to an entirely new storage system.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>New nodes can be added to the cluster without downtime.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Failed nodes and disks can be swapped out without downtime.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>It runs on industry-standard hardware, such as Dell, HP, and Supermicro.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<figure>
|
||||||
|
<title>Object Storage (Swift)</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>Developers can either write directly to the Swift API or use one of the many client
|
||||||
|
libraries that exist for all of the popular programming languages, such as Java, Python,
|
||||||
|
Ruby, and C#. Amazon S3 and RackSpace Cloud Files users should be very familiar with Object
|
||||||
|
Storage. Users new to object storage systems will have to adjust to a different approach and
|
||||||
|
mindset than those required for a traditional filesystem.</para>
|
||||||
|
</section>
|
236
doc/common/section_objectstorage-components.xml
Normal file
@@ -0,0 +1,236 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage-components">
|
||||||
|
<!-- ... Old module003-ch004-swift-building-blocks edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Components</title>
|
||||||
|
<para>The components that enable Object Storage to deliver high availability, high
|
||||||
|
durability, and high concurrency are:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis role="bold">Proxy servers—</emphasis>Handle all of the incoming
|
||||||
|
API requests.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis role="bold">Rings—</emphasis>Map logical names of data to
|
||||||
|
locations on particular disks.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis role="bold">Zones—</emphasis>Isolate data from other zones. A
|
||||||
|
failure in one zone doesn’t impact the rest of the cluster because data is
|
||||||
|
replicated across zones.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis role="bold">Accounts and containers—</emphasis>Each account and
|
||||||
|
container are individual databases that are distributed across the cluster. An
|
||||||
|
account database contains the list of containers in that account. A container
|
||||||
|
database contains the list of objects in that container.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis role="bold">Objects—</emphasis>The data itself.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis role="bold">Partitions—</emphasis>A partition stores objects,
|
||||||
|
account databases, and container databases and helps manage locations where data
|
||||||
|
lives in the cluster.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<figure>
|
||||||
|
<title>Object Storage building blocks</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-buildingblocks.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<section xml:id="section_proxy-servers">
|
||||||
|
<title>Proxy servers</title>
|
||||||
|
<para>Proxy servers are the public face of Object Storage and handle all of the incoming API
|
||||||
|
requests. Once a proxy server receives a request, it determines the storage node based
|
||||||
|
on the object's URL, for example, https://swift.example.com/v1/account/container/object.
|
||||||
|
Proxy servers also coordinate responses, handle failures, and coordinate
|
||||||
|
timestamps.</para>
|
||||||
|
<para>Proxy servers use a shared-nothing architecture and can be scaled as needed based on
|
||||||
|
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
|
||||||
|
If one proxy server fails, the others take over.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_ring">
|
||||||
|
<title>Rings</title>
|
||||||
|
<para>A ring represents a mapping between the names of entities stored on disk and their
|
||||||
|
physical locations. There are separate rings for accounts, containers, and objects. When
|
||||||
|
other components need to perform any operation on an object, container, or account, they
|
||||||
|
need to interact with the appropriate ring to determine their location in the
|
||||||
|
cluster.</para>
|
||||||
|
<para>The ring maintains this mapping using zones, devices, partitions, and replicas. Each
|
||||||
|
partition in the ring is replicated, by default, three times across the cluster, and
|
||||||
|
partition locations are stored in the mapping maintained by the ring. The ring is also
|
||||||
|
responsible for determining which devices are used for handoff in failure
|
||||||
|
scenarios.</para>
|
||||||
|
<para>Data can be isolated into zones in the ring. Each partition replica is guaranteed to
|
||||||
|
reside in a different zone. A zone could represent a drive, a server, a cabinet, a
|
||||||
|
switch, or even a data center.</para>
|
||||||
|
<para>The partitions of the ring are equally divided among all of the devices in the Object
|
||||||
|
Storage installation. When partitions need to be moved around (for example, if a device
|
||||||
|
is added to the cluster), the ring ensures that a minimum number of partitions are moved
|
||||||
|
at a time, and only one replica of a partition is moved at a time.</para>
|
||||||
|
<para>Weights can be used to balance the distribution of partitions on drives across the
|
||||||
|
cluster. This can be useful, for example, when differently sized drives are used in a
|
||||||
|
cluster.</para>
|
||||||
|
<para>The ring is used by the proxy server and several background processes (like
|
||||||
|
replication).</para>
|
||||||
|
<figure>
|
||||||
|
<title>The <emphasis role="bold">ring</emphasis></title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-ring.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>These rings are externally managed, in that the server processes themselves do not
|
||||||
|
modify the rings, they are instead given new rings modified by other tools.</para>
|
||||||
|
<para>The ring uses a configurable number of bits from a
|
||||||
|
path’s MD5 hash as a partition index that designates a
|
||||||
|
device. The number of bits kept from the hash is known as
|
||||||
|
the partition power, and 2 to the partition power
|
||||||
|
indicates the partition count. Partitioning the full MD5
|
||||||
|
hash ring allows other parts of the cluster to work in
|
||||||
|
batches of items at once which ends up either more
|
||||||
|
efficient or at least less complex than working with each
|
||||||
|
item separately or the entire cluster all at once.</para>
|
||||||
|
<para>Another configurable value is the replica count, which indicates how many of the
|
||||||
|
partition-device assignments make up a single ring. For a given partition number, each
|
||||||
|
replica’s device will not be in the same zone as any other replica's device. Zones can
|
||||||
|
be used to group devices based on physical locations, power separations, network
|
||||||
|
separations, or any other attribute that would improve the availability of multiple
|
||||||
|
replicas at the same time.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_zones">
|
||||||
|
<title>Zones</title>
|
||||||
|
<para>Object Storage allows configuring zones in order to isolate failure boundaries.
|
||||||
|
Each data replica resides in a separate zone, if possible. At the smallest level, a zone
|
||||||
|
could be a single drive or a grouping of a few drives. If there were five object storage
|
||||||
|
servers, then each server would represent its own zone. Larger deployments would have an
|
||||||
|
entire rack (or multiple racks) of object servers, each representing a zone. The goal of
|
||||||
|
zones is to allow the cluster to tolerate significant outages of storage servers without
|
||||||
|
losing all replicas of the data.</para>
|
||||||
|
<para>As mentioned earlier, everything in Object Storage is stored, by default, three
|
||||||
|
times. Swift will place each replica "as-uniquely-as-possible" to ensure both high
|
||||||
|
availability and high durability. This means that when chosing a replica location,
|
||||||
|
Object Storage chooses a server in an unused zone before an unused server in a zone that
|
||||||
|
already has a replica of the data.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Zones</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-zones.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>When a disk fails, replica data is automatically distributed to the other zones to
|
||||||
|
ensure there are three copies of the data.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_accounts-containers">
|
||||||
|
<title>Accounts and containers</title>
|
||||||
|
<para>Each account and container is an individual SQLite
|
||||||
|
database that is distributed across the cluster. An
|
||||||
|
account database contains the list of containers in
|
||||||
|
that account. A container database contains the list
|
||||||
|
of objects in that container.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Accounts and containers</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-accountscontainers.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>To keep track of object data locations, each account in the system has a database
|
||||||
|
that references all of its containers, and each container database references each
|
||||||
|
object.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_partitions">
|
||||||
|
<title>Partitions</title>
|
||||||
|
<para>A partition is a collection of stored data, including account databases, container
|
||||||
|
databases, and objects. Partitions are core to the replication system.</para>
|
||||||
|
<para>Think of a partition as a bin moving throughout a fulfillment center warehouse.
|
||||||
|
Individual orders get thrown into the bin. The system treats that bin as a cohesive
|
||||||
|
entity as it moves throughout the system. A bin is easier to deal with than many little
|
||||||
|
things. It makes for fewer moving parts throughout the system.</para>
|
||||||
|
<para>System replicators and object uploads/downloads operate on partitions. As the
|
||||||
|
system scales up, its behavior continues to be predictable because the number of
|
||||||
|
partitions is a fixed number.</para>
|
||||||
|
<para>Implementing a partition is conceptually simple—a partition is just a
|
||||||
|
directory sitting on a disk with a corresponding hash table of what it contains.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Partitions</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-partitions.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_replicators">
|
||||||
|
<title>Replicators</title>
|
||||||
|
<para>In order to ensure that there are three copies of the data everywhere, replicators
|
||||||
|
continuously examine each partition. For each local partition, the replicator compares
|
||||||
|
it against the replicated copies in the other zones to see if there are any
|
||||||
|
differences.</para>
|
||||||
|
<para>The replicator knowd if replication needs to take plac by examining hashes. A hash
|
||||||
|
file is created for each partition, which contains hashes of each directory in the
|
||||||
|
partition. Each of the three hash files is compared. For a given partition, the hash
|
||||||
|
files for each of the partition's copies are compared. If the hashes are different, then
|
||||||
|
it is time to replicate, and the directory that needs to be replicated is copied
|
||||||
|
over.</para>
|
||||||
|
<para>This is where partitions come in handy. With fewer things in the system, larger
|
||||||
|
chunks of data are transferred around (rather than lots of little TCP connections, which
|
||||||
|
is inefficient) and there is a consistent number of hashes to compare.</para>
|
||||||
|
<para>The cluster eventually has a consistent behavior where the newest data has a
|
||||||
|
priority.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Replication</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-replication.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>If a zone goes down, one of the nodes containing a replica notices and proactively
|
||||||
|
copies data to a handoff location.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_usecases">
|
||||||
|
<title>Use cases</title>
|
||||||
|
<para>The following sections show use cases for object uploads and downloads and introduce the components.</para>
|
||||||
|
<section xml:id="upload">
|
||||||
|
<title>Upload</title>
|
||||||
|
<para>A client uses the REST API to make a HTTP request to PUT an object into an existing
|
||||||
|
container. The cluster receives the request. First, the system must figure out where
|
||||||
|
the data is going to go. To do this, the account name, container name, and object
|
||||||
|
name are all used to determine the partition where this object should live.</para>
|
||||||
|
<para>Then a lookup in the ring figures out which storage nodes contain the partitions in
|
||||||
|
question.</para>
|
||||||
|
<para>The data is then sent to each storage node where it is placed in the appropriate
|
||||||
|
partition. At least two of the three writes must be successful before the client is
|
||||||
|
notified that the upload was successful.</para>
|
||||||
|
<para>Next, the container database is updated asynchronously to reflect that there is a new
|
||||||
|
object in it.</para>
|
||||||
|
<figure>
|
||||||
|
<title>Object Storage in use</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="../common/figures/objectstorage-usecase.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_swift-component-download">
|
||||||
|
<title>Download</title>
|
||||||
|
<para>A request comes in for an acount/container/object. Using the same consistent hashing,
|
||||||
|
the partition name is generated. A lookup in the ring reveals which storage nodes
|
||||||
|
contain that partition. A request is made to one of the storage nodes to fetch the
|
||||||
|
object and, if that fails, requests are made to the other nodes.</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
</section>
|
180
doc/common/section_objectstorage-features.xml
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage_features">
|
||||||
|
<!-- ... Old module003-ch002-features-benefits edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Features and benefits</title>
|
||||||
|
<para>
|
||||||
|
<informaltable class="c19">
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<th rowspan="1" colspan="1">Features</th>
|
||||||
|
<th rowspan="1" colspan="1">Benefits</th>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Leverages commodity
|
||||||
|
hardware</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>No
|
||||||
|
lock-in, lower
|
||||||
|
price/GB</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>HDD/node failure agnostic</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">Self-healing, reliable, data redundancy protects
|
||||||
|
from failures</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Unlimited storage</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">Large and flat namespace, highly scalable read/write
|
||||||
|
access, able to serve content directly from storage system</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Multi-dimensional scalability</emphasis>
|
||||||
|
</td>
|
||||||
|
<td rowspan="1" colspan="1">Scale-out architecture—Scale vertically and
|
||||||
|
horizontally-distributed storage Backs up and archives large amounts of data
|
||||||
|
with linear performance</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold">Account/container/object
|
||||||
|
structure</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">No nesting, not a traditional file
|
||||||
|
system—Optimized for scale, it scales to multiple petabytes and
|
||||||
|
billions of objects</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold">Built-in replication 3✕
|
||||||
|
+ data redundancy (compared with 2✕ on RAID)</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">A configurable number of accounts, containers and
|
||||||
|
object copies for high availability</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Easily add capacity (unlike
|
||||||
|
RAID resize)</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Elastic
|
||||||
|
data scaling with
|
||||||
|
ease</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>No central database</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Higher
|
||||||
|
performance, no
|
||||||
|
bottlenecks</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>RAID not required</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">Handle many small, random reads and writes
|
||||||
|
efficiently</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Built-in management
|
||||||
|
utilities</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">Account management—Create, add, verify, and
|
||||||
|
delete users; Container management—Upload, download, and verify;
|
||||||
|
Monitoring—Capacity, host, network, log trawling, and cluster
|
||||||
|
health</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Drive auditing</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Detect
|
||||||
|
drive failures preempting data
|
||||||
|
corruption</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Expiring objects</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Users
|
||||||
|
can set an expiration time or a TTL on an
|
||||||
|
object to control
|
||||||
|
access</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Direct object access</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Enable
|
||||||
|
direct browser access to content, such as for
|
||||||
|
a control
|
||||||
|
panel</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Realtime visibility into client
|
||||||
|
requests</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Know
|
||||||
|
what users are
|
||||||
|
requesting</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Supports S3 API</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Utilize
|
||||||
|
tools that were designed for the popular S3
|
||||||
|
API</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Restrict containers per
|
||||||
|
account</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Limit
|
||||||
|
access to control usage by
|
||||||
|
user</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Support for NetApp, Nexenta,
|
||||||
|
SolidFire</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Unified
|
||||||
|
support for block volumes using a variety of
|
||||||
|
storage
|
||||||
|
systems</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Snapshot and backup API for block
|
||||||
|
volumes</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Data
|
||||||
|
protection and recovery for VM
|
||||||
|
data</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Standalone volume API
|
||||||
|
available</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1"
|
||||||
|
>Separate
|
||||||
|
endpoint and API for integration with other
|
||||||
|
compute
|
||||||
|
systems</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||||
|
>Integration with Compute</emphasis></td>
|
||||||
|
<td rowspan="1" colspan="1">Fully integrated with Compute for attaching block
|
||||||
|
volumes and reporting on usage</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</informaltable>
|
||||||
|
</para>
|
||||||
|
</section>
|
23
doc/common/section_objectstorage-intro.xml
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage-intro">
|
||||||
|
<!-- ... Old module003-ch001-intro-objstore edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Introduction to Object Storage</title>
|
||||||
|
<para>OpenStack Object Storage (code-named Swift) is open source software for creating
|
||||||
|
redundant, scalable data storage using clusters of standardized servers to store petabytes
|
||||||
|
of accessible data. It is a long-term storage system for large amounts of static data that
|
||||||
|
can be retrieved, leveraged, and updated. Object Storage uses a distributed architecture
|
||||||
|
with no central point of control, providing greater scalability, redundancy, and permanence.
|
||||||
|
Objects are written to multiple hardware devices, with the OpenStack software responsible
|
||||||
|
for ensuring data replication and integrity across the cluster. Storage clusters scale
|
||||||
|
horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its
|
||||||
|
content from other active nodes. Because OpenStack uses software logic to ensure data
|
||||||
|
replication and distribution across different devices, inexpensive commodity hard drives and
|
||||||
|
servers can be used in lieu of more expensive equipment.</para>
|
||||||
|
<para>Object Storage is ideal for cost effective, scale-out storage. It provides a fully
|
||||||
|
distributed, API-accessible storage platform that can be integrated directly into
|
||||||
|
applications or used for backup, archiving, and data retention.</para>
|
||||||
|
</section>
|
99
doc/common/section_objectstorage-replication.xml
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage-replication">
|
||||||
|
<!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Replication</title>
|
||||||
|
<para>Because each replica in Object Storage functions independently, and clients generally
|
||||||
|
require only a simple majority of nodes responding to consider an operation successful,
|
||||||
|
transient failures like network partitions can quickly cause replicas to diverge. These
|
||||||
|
differences are eventually reconciled by asynchronous, peer-to-peer replicator processes.
|
||||||
|
The replicator processes traverse their local filesystems, concurrently performing
|
||||||
|
operations in a manner that balances load across physical disks.</para>
|
||||||
|
<para>Replication uses a push model, with records and files
|
||||||
|
generally only being copied from local to remote replicas.
|
||||||
|
This is important because data on the node may not belong
|
||||||
|
there (as in the case of handoffs and ring changes), and a
|
||||||
|
replicator can’t know what data exists elsewhere in the
|
||||||
|
cluster that it should pull in. It’s the duty of any node that
|
||||||
|
contains data to ensure that data gets to where it belongs.
|
||||||
|
Replica placement is handled by the ring.</para>
|
||||||
|
<para>Every deleted record or file in the system is marked by a
|
||||||
|
tombstone, so that deletions can be replicated alongside
|
||||||
|
creations. The replication process cleans up tombstones after
|
||||||
|
a time period known as the consistency window. The consistency
|
||||||
|
window encompasses replication duration and how long transient
|
||||||
|
failure can remove a node from the cluster. Tombstone cleanup
|
||||||
|
must be tied to replication to reach replica
|
||||||
|
convergence.</para>
|
||||||
|
<para>If a replicator detects that a remote drive has failed, the
|
||||||
|
replicator uses the get_more_nodes interface for the ring to
|
||||||
|
choose an alternate node with which to synchronize. The
|
||||||
|
replicator can maintain desired levels of replication in the
|
||||||
|
face of disk failures, though some replicas may not be in an
|
||||||
|
immediately usable location. Note that the replicator doesn’t
|
||||||
|
maintain desired levels of replication when other failures,
|
||||||
|
such as entire node failures, occur because most failure are
|
||||||
|
transient.</para>
|
||||||
|
<para>Replication is an area of active development, and likely
|
||||||
|
rife with potential improvements to speed and
|
||||||
|
correctness.</para>
|
||||||
|
<para>There are two major classes of replicator—the db replicator, which replicates
|
||||||
|
accounts and containers, and the object replicator, which replicates object data.</para>
|
||||||
|
<section xml:id="section_database-replication">
|
||||||
|
<title>Database replication</title>
|
||||||
|
<para>The first step performed by db replication is a low-cost
|
||||||
|
hash comparison to determine whether two replicas already
|
||||||
|
match. Under normal operation, this check is able to
|
||||||
|
verify that most databases in the system are already
|
||||||
|
synchronized very quickly. If the hashes differ, the
|
||||||
|
replicator brings the databases in sync by sharing records
|
||||||
|
added since the last sync point.</para>
|
||||||
|
<para>This sync point is a high water mark noting the last
|
||||||
|
record at which two databases were known to be in sync,
|
||||||
|
and is stored in each database as a tuple of the remote
|
||||||
|
database id and record id. Database ids are unique amongst
|
||||||
|
all replicas of the database, and record ids are
|
||||||
|
monotonically increasing integers. After all new records
|
||||||
|
have been pushed to the remote database, the entire sync
|
||||||
|
table of the local database is pushed, so the remote
|
||||||
|
database can guarantee that it is in sync with everything
|
||||||
|
with which the local database has previously
|
||||||
|
synchronized.</para>
|
||||||
|
<para>If a replica is found to be missing entirely, the whole
|
||||||
|
local database file is transmitted to the peer using
|
||||||
|
rsync(1) and vested with a new unique id.</para>
|
||||||
|
<para>In practice, DB replication can process hundreds of
|
||||||
|
databases per concurrency setting per second (up to the
|
||||||
|
number of available CPUs or disks) and is bound by the
|
||||||
|
number of DB transactions that must be performed.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_object-replication">
|
||||||
|
<title>Object replication</title>
|
||||||
|
<para>The initial implementation of object replication simply
|
||||||
|
performed an rsync to push data from a local partition to
|
||||||
|
all remote servers it was expected to exist on. While this
|
||||||
|
performed adequately at small scale, replication times
|
||||||
|
skyrocketed once directory structures could no longer be
|
||||||
|
held in RAM. We now use a modification of this scheme in
|
||||||
|
which a hash of the contents for each suffix directory is
|
||||||
|
saved to a per-partition hashes file. The hash for a
|
||||||
|
suffix directory is invalidated when the contents of that
|
||||||
|
suffix directory are modified.</para>
|
||||||
|
<para>The object replication process reads in these hash
|
||||||
|
files, calculating any invalidated hashes. It then
|
||||||
|
transmits the hashes to each remote server that should
|
||||||
|
hold the partition, and only suffix directories with
|
||||||
|
differing hashes on the remote server are rsynced. After
|
||||||
|
pushing files to the remote server, the replication
|
||||||
|
process notifies it to recalculate hashes for the rsynced
|
||||||
|
suffix directories.</para>
|
||||||
|
<para>Performance of object replication is generally bound by the number of uncached
|
||||||
|
directories it has to traverse, usually as a result of invalidated suffix directory
|
||||||
|
hashes. Using write volume and partition counts from our running systems, it was
|
||||||
|
designed so that around 2 percent of the hash space on a normal node will be invalidated
|
||||||
|
per day, which has experimentally given us acceptable replication speeds.</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
129
doc/common/section_objectstorage-ringbuilder.xml
Normal file
@@ -0,0 +1,129 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
version="5.0"
|
||||||
|
xml:id="section_objectstorage-ringbuilder">
|
||||||
|
<!-- ... Old module003-ch005-the-ring edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||||
|
<title>Ring-builder</title>
|
||||||
|
<para>Rings are built and managed manually by a utility called the ring-builder. The
|
||||||
|
ring-builder assigns partitions to devices and writes an optimized Python structure to a
|
||||||
|
gzipped, serialized file on disk for shipping out to the servers. The server processes just
|
||||||
|
check the modification time of the file occasionally and reload their in-memory copies of
|
||||||
|
the ring structure as needed. Because of how the ring-builder manages changes to the ring,
|
||||||
|
using a slightly older ring usually just means one of the three replicas for a subset of the
|
||||||
|
partitions will be incorrect, which can be easily worked around.</para>
|
||||||
|
<para>The ring-builder also keeps its own builder file with the ring information and additional
|
||||||
|
data required to build future rings. It is very important to keep multiple backup copies of
|
||||||
|
these builder files. One option is to copy the builder files out to every server while
|
||||||
|
copying the ring files themselves. Another is to upload the builder files into the cluster
|
||||||
|
itself. If you lose the builder file, you have to create a new ring from scratch. Nearly all
|
||||||
|
partitions would be assigned to different devices and, therefore, nearly all of the stored
|
||||||
|
data would have to be replicated to new locations. So, recovery from a builder file loss is
|
||||||
|
possible, but data would be unreachable for an extended time.</para>
|
||||||
|
<section xml:id="section_ring-data-structure">
|
||||||
|
<title>Ring data structure</title>
|
||||||
|
<para>The ring data structure consists of three top level
|
||||||
|
fields: a list of devices in the cluster, a list of lists
|
||||||
|
of device ids indicating partition to device assignments,
|
||||||
|
and an integer indicating the number of bits to shift an
|
||||||
|
MD5 hash to calculate the partition for the hash.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_partition-assignment">
|
||||||
|
<title>Partition assignment list</title>
|
||||||
|
<para>This is a list of <literal>array(‘H’)</literal> of devices ids. The
|
||||||
|
outermost list contains an <literal>array(‘H’)</literal> for each
|
||||||
|
replica. Each <literal>array(‘H’)</literal> has a length equal to the
|
||||||
|
partition count for the ring. Each integer in the
|
||||||
|
<literal>array(‘H’)</literal> is an index into the above list of devices.
|
||||||
|
The partition list is known internally to the Ring
|
||||||
|
class as <literal>_replica2part2dev_id</literal>.</para>
|
||||||
|
<para>So, to create a list of device dictionaries assigned to a partition, the Python
|
||||||
|
code would look like:
|
||||||
|
<programlisting>devices = [self.devs[part2dev_id[partition]] for
|
||||||
|
part2dev_id in self._replica2part2dev_id]</programlisting></para>
|
||||||
|
<para>That code is a little simplistic, as it does not account for the removal of
|
||||||
|
duplicate devices. If a ring has more replicas than devices, then a partition will have
|
||||||
|
more than one replica on one device.</para>
|
||||||
|
<para><literal>array(‘H’)</literal> is used for memory conservation as there
|
||||||
|
may be millions of partitions.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_fractional-replicas">
|
||||||
|
<title>Fractional replicas</title>
|
||||||
|
<para>A ring is not restricted to having an integer number
|
||||||
|
of replicas. In order to support the gradual changing
|
||||||
|
of replica counts, the ring is able to have a real
|
||||||
|
number of replicas.</para>
|
||||||
|
<para>When the number of replicas is not an integer, then the last element of
|
||||||
|
<literal>_replica2part2dev_id</literal> will have a length that is less than the
|
||||||
|
partition count for the ring. This means that some partitions will have more replicas
|
||||||
|
than others. For example, if a ring has 3.25 replicas, then 25 percent of its partitions
|
||||||
|
will have four replicas, while the remaining 75 percent will have just three.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_partition-shift-value">
|
||||||
|
<title>Partition shift value</title>
|
||||||
|
<para>The partition shift value is known internally to the
|
||||||
|
Ring class as <literal>_part_shift</literal>. This value used to shift an
|
||||||
|
MD5 hash to calculate the partition on which the data
|
||||||
|
for that hash should reside. Only the top four bytes
|
||||||
|
of the hash is used in this process. For example, to
|
||||||
|
compute the partition for the path
|
||||||
|
/account/container/object the Python code might look
|
||||||
|
like:
|
||||||
|
<programlisting>partition = unpack_from('>I',
|
||||||
|
md5('/account/container/object').digest())[0] >>
|
||||||
|
self._part_shift</programlisting></para>
|
||||||
|
<para>For a ring generated with part_power P, the
|
||||||
|
partition shift value is <literal>32 - P</literal>.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="section_build-ring">
|
||||||
|
<title>Build the ring</title>
|
||||||
|
<para>The initial building of the ring first calculates the
|
||||||
|
number of partitions that should ideally be assigned to
|
||||||
|
each device based the device’s weight. For example, given
|
||||||
|
a partition power of 20, the ring will have 1,048,576
|
||||||
|
partitions. If there are 1,000 devices of equal weight
|
||||||
|
they will each desire 1,048.576 partitions. The devices
|
||||||
|
are then sorted by the number of partitions they desire
|
||||||
|
and kept in order throughout the initialization
|
||||||
|
process.</para>
|
||||||
|
<note><para>Each device is also assigned a random tiebreaker
|
||||||
|
value that is used when two devices desire the same number
|
||||||
|
of partitions. This tiebreaker is not stored on disk
|
||||||
|
anywhere, and so two different rings created with the same
|
||||||
|
parameters will have different partition assignments. For
|
||||||
|
repeatable partition assignments, <literal>RingBuilder.rebalance()</literal>
|
||||||
|
takes an optional seed value that will be used to seed
|
||||||
|
Python’s pseudo-random number generator.</para></note>
|
||||||
|
<para>Then, the ring builder assigns each replica of each partition to the device that
|
||||||
|
requires most partitions at that point while keeping it as far away as possible from
|
||||||
|
other replicas. The ring builder prefers to assign a replica to a device in a region
|
||||||
|
does not already have a replica. If no such region is available, the ring builder tries
|
||||||
|
to find a device in a different zone. If that's not possible, it will look on a
|
||||||
|
different server. If it doesn't find one there, it will just look for a device that has
|
||||||
|
no replicas. Finally, if all of the other options are exhausted, the ring builder
|
||||||
|
assigns the replica to the device that has the fewest replicas already assigned. Note
|
||||||
|
that assignment of multiple replicas to one device will only happen if the ring has
|
||||||
|
fewer devices than it has replicas.</para>
|
||||||
|
<para>When building a new ring based on an old ring, the desired number of partitions each
|
||||||
|
device wants is recalculated. Next, the partitions to be reassigned are gathered up. Any
|
||||||
|
removed devices have all their assigned partitions unassigned and added to the gathered
|
||||||
|
list. Any partition replicas that (due to the addition of new devices) can be spread out
|
||||||
|
for better durability are unassigned and added to the gathered list. Any devices that
|
||||||
|
have more partitions than they now need have random partitions unassigned from them and
|
||||||
|
added to the gathered list. Lastly, the gathered partitions are then reassigned to
|
||||||
|
devices using a similar method as in the initial assignment described above.</para>
|
||||||
|
<para>Whenever a partition has a replica reassigned, the time of the reassignment is
|
||||||
|
recorded. This is taken into account when gathering partitions to reassign so that no
|
||||||
|
partition is moved twice in a configurable amount of time. This configurable amount of
|
||||||
|
time is known internally to the RingBuilder class as <literal>min_part_hours</literal>.
|
||||||
|
This restriction is ignored for replicas of partitions on devices that have been removed
|
||||||
|
since removing a device only happens on device failure and reasssignment is the only
|
||||||
|
choice.</para>
|
||||||
|
<para>The above processes don’t always perfectly rebalance a ring due to the random nature
|
||||||
|
of gathering partitions for reassignment. To help reach a more balanced ring, the
|
||||||
|
rebalance process is repeated until near perfect (less than 1 percent off) or when the
|
||||||
|
balance doesn’t improve by at least 1 percent (indicating we probably can’t get perfect
|
||||||
|
balance due to wildly imbalanced zones or too many partitions recently moved).</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
106
doc/common/section_objectstorage-troubleshoot.xml
Normal file
@@ -0,0 +1,106 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<section xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||||
|
xml:id="troubleshooting-openstack-object-storage">
|
||||||
|
<title>Troubleshoot Object Storage</title>
|
||||||
|
<para>For Object Storage, everything is logged in <filename>/var/log/syslog</filename> (or messages on some distros).
|
||||||
|
Several settings enable further customization of logging, such as <literal>log_name</literal>, <literal>log_facility</literal>,
|
||||||
|
and <literal>log_level</literal>, within the object server configuration files.</para>
|
||||||
|
<section xml:id="drive-failure">
|
||||||
|
<title>Drive failure</title>
|
||||||
|
<para>In the event that a drive has failed, the first step is to make sure the drive is
|
||||||
|
unmounted. This will make it easier for Object Storage to work around the failure until
|
||||||
|
it has been resolved. If the drive is going to be replaced immediately, then it is just
|
||||||
|
best to replace the drive, format it, remount it, and let replication fill it up.</para>
|
||||||
|
<para>If the drive can’t be replaced immediately, then it is best to leave it
|
||||||
|
unmounted, and remove the drive from the ring. This will allow all the replicas
|
||||||
|
that were on that drive to be replicated elsewhere until the drive is replaced.
|
||||||
|
Once the drive is replaced, it can be re-added to the ring.</para>
|
||||||
|
<para>You can look at error messages in <filename>/var/log/kern.log</filename> for hints of drive failure.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="server-failure">
|
||||||
|
<title>Server failure</title>
|
||||||
|
<para>If a server is having hardware issues, it is a good idea to make sure the
|
||||||
|
Object Storage services are not running. This will allow Object Storage to
|
||||||
|
work around the failure while you troubleshoot.</para>
|
||||||
|
<para>If the server just needs a reboot, or a small amount of work that should only
|
||||||
|
last a couple of hours, then it is probably best to let Object Storage work
|
||||||
|
around the failure and get the machine fixed and back online. When the machine
|
||||||
|
comes back online, replication will make sure that anything that is missing
|
||||||
|
during the downtime will get updated.</para>
|
||||||
|
<para>If the server has more serious issues, then it is probably best to remove all
|
||||||
|
of the server’s devices from the ring. Once the server has been repaired and is
|
||||||
|
back online, the server’s devices can be added back into the ring. It is
|
||||||
|
important that the devices are reformatted before putting them back into the
|
||||||
|
ring as it is likely to be responsible for a different set of partitions than
|
||||||
|
before.</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="detect-failed-drives">
|
||||||
|
<title>Detect failed drives</title>
|
||||||
|
<para>It has been our experience that when a drive is about to fail, error messages will spew into
|
||||||
|
/var/log/kern.log. There is a script called swift-drive-audit that can be run via cron
|
||||||
|
to watch for bad drives. If errors are detected, it will unmount the bad drive, so that
|
||||||
|
Object Storage can work around it. The script takes a configuration file with the
|
||||||
|
following settings:</para>
|
||||||
|
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
|
||||||
|
<para>This script has only been tested on Ubuntu 10.04, so if you are using a
|
||||||
|
different distro or OS, some care should be taken before using in production.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="recover-ring-builder-file">
|
||||||
|
<title>Emergency recovery of ring builder files</title>
|
||||||
|
<para>You should always keep a backup of Swift ring builder files. However, if an
|
||||||
|
emergency occurs, this procedure may assist in returning your cluster to an
|
||||||
|
operational state.</para>
|
||||||
|
<para>Using existing Swift tools, there is no way to recover a builder file from a
|
||||||
|
<filename>ring.gz</filename> file. However, if you have a knowledge of Python, it is possible to
|
||||||
|
construct a builder file that is pretty close to the one you have lost. The
|
||||||
|
following is what you will need to do.</para>
|
||||||
|
<warning>
|
||||||
|
<para>This procedure is a last-resort for emergency circumstances—it
|
||||||
|
requires knowledge of the swift python code and may not succeed.</para>
|
||||||
|
</warning>
|
||||||
|
<para>First, load the ring and a new ringbuilder object in a Python REPL:</para>
|
||||||
|
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
|
||||||
|
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
|
||||||
|
<para>Now, start copying the data we have in the ring into the builder.</para>
|
||||||
|
<programlisting language="python">
|
||||||
|
>>> import math
|
||||||
|
>>> partitions = len(ring._replica2part2dev_id[0])
|
||||||
|
>>> replicas = len(ring._replica2part2dev_id)
|
||||||
|
|
||||||
|
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
|
||||||
|
>>> builder.devs = ring.devs
|
||||||
|
>>> builder._replica2part2dev = ring.replica2part2dev_id
|
||||||
|
>>> builder._last_part_moves_epoch = 0
|
||||||
|
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
|
||||||
|
>>> builder._set_parts_wanted()
|
||||||
|
>>> for d in builder._iter_devs():
|
||||||
|
d['parts'] = 0
|
||||||
|
>>> for p2d in builder._replica2part2dev:
|
||||||
|
for dev_id in p2d:
|
||||||
|
builder.devs[dev_id]['parts'] += 1</programlisting>
|
||||||
|
<para>This is the extent of the recoverable fields. For
|
||||||
|
<literal>min_part_hours</literal> you'll either have to remember what the
|
||||||
|
value you used was, or just make up a new one.</para>
|
||||||
|
<programlisting language="python">
|
||||||
|
>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
|
||||||
|
<para>Try some validation: if this doesn't raise an exception, you may feel some
|
||||||
|
hope. Not too much, though.</para>
|
||||||
|
<programlisting language="python">>>> builder.validate()</programlisting>
|
||||||
|
<para>Save the builder.</para>
|
||||||
|
<programlisting language="python">
|
||||||
|
>>> import pickle
|
||||||
|
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
|
||||||
|
<para>You should now have a file called 'account.builder' in the current working
|
||||||
|
directory. Next, run <literal>swift-ring-builder account.builder write_ring</literal>
|
||||||
|
and compare the new account.ring.gz to the account.ring.gz that you started
|
||||||
|
from. They probably won't be byte-for-byte identical, but if you load them up
|
||||||
|
in a REPL and their <literal>_replica2part2dev_id</literal> and
|
||||||
|
<literal>devs</literal> attributes are the same (or nearly so), then you're
|
||||||
|
in good shape.</para>
|
||||||
|
<para>Next, repeat the procedure for <filename>container.ring.gz</filename>
|
||||||
|
and <filename>object.ring.gz</filename>, and you might get usable builder files.</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
@@ -1,144 +0,0 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
|
||||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
|
||||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
||||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
|
||||||
xml:id="troubleshooting-openstack-object-storage">
|
|
||||||
<title>Troubleshoot Object Storage</title>
|
|
||||||
<para>For OpenStack Object Storage, everything is logged in
|
|
||||||
<filename>/var/log/syslog</filename> (or messages on some
|
|
||||||
distros). Several settings enable further customization of
|
|
||||||
logging, such as <option>log_name</option>,
|
|
||||||
<option>log_facility</option>, and
|
|
||||||
<option>log_level</option>, within the object server
|
|
||||||
configuration files.</para>
|
|
||||||
<section xml:id="handling-drive-failure">
|
|
||||||
<title>Recover drive failures</title>
|
|
||||||
<para>If a drive fails, make sure the
|
|
||||||
drive is unmounted to make it easier for Object
|
|
||||||
Storage to work around the failure while you resolve
|
|
||||||
it. If you plan to replace the drive immediately, replace
|
|
||||||
the drive, format it, remount it, and let replication fill
|
|
||||||
it.</para>
|
|
||||||
<para>If you cannot replace the drive immediately, leave it
|
|
||||||
unmounted and remove the drive from the ring. This enables
|
|
||||||
you to replicate all the replicas on that drive elsewhere
|
|
||||||
until you can replace the drive. After you replace the
|
|
||||||
drive, you can add it to the ring again.</para>
|
|
||||||
<note>
|
|
||||||
<para>Rackspace has seen hints at drive failures by
|
|
||||||
looking at error messages in
|
|
||||||
<filename>/var/log/kern.log</filename>. Check this
|
|
||||||
file in your monitoring.</para>
|
|
||||||
</note>
|
|
||||||
</section>
|
|
||||||
<section xml:id="handling-server-failure">
|
|
||||||
<title>Recover server failures</title>
|
|
||||||
<para>If a server has hardware issues, make sure that the
|
|
||||||
Object Storage services are not running. This enables
|
|
||||||
Object Storage to work around the failure while you
|
|
||||||
troubleshoot.</para>
|
|
||||||
<para>If the server needs a reboot or a minimal amount of
|
|
||||||
work, let Object Storage work around the failure while you
|
|
||||||
fix the machine and get it back online. When the machine
|
|
||||||
comes back online, replication updates anything that was
|
|
||||||
missing during the downtime.</para>
|
|
||||||
<para>If the server has more serious issues,remove all server
|
|
||||||
devices from the ring. After you repair and put the server
|
|
||||||
online, you can add the devices for the server back to the
|
|
||||||
ring. You must reformat the devices before you add them to
|
|
||||||
the ring because they might be responsible for a different
|
|
||||||
set of partitions than before.</para>
|
|
||||||
</section>
|
|
||||||
<section xml:id="detecting-failed-drives">
|
|
||||||
<title>Detect failed drives</title>
|
|
||||||
<para>When a drive is about to fail, many error messages
|
|
||||||
appear in the <filename>/var/log/kern.log</filename> file.
|
|
||||||
You can run the <package>swift-drive-audit</package>
|
|
||||||
script through <command>cron</command> to watch for bad
|
|
||||||
drives. If errors are detected, it unmounts the bad drive
|
|
||||||
so that Object Storage can work around it. The script uses
|
|
||||||
a configuration file with these settings:</para>
|
|
||||||
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
|
|
||||||
<para>This script has been tested on only Ubuntu 10.04. If you
|
|
||||||
use a different distribution or operating system, take
|
|
||||||
care before using the script in production.</para>
|
|
||||||
</section>
|
|
||||||
<section xml:id="recover-ring-builder-file">
|
|
||||||
<title>Recover ring builder files (emergency)</title>
|
|
||||||
<para>You should always keep a backup of Swift ring builder
|
|
||||||
files. However, if an emergency occurs, use this procedure
|
|
||||||
to return your cluster to an operational state.</para>
|
|
||||||
<para>Existing Swift tools do not enable you to recover a
|
|
||||||
builder file from a <filename>ring.gz</filename> file.
|
|
||||||
However, if you have Python knowledge, you can construct a
|
|
||||||
builder file similar to the one you have lost.</para>
|
|
||||||
<warning>
|
|
||||||
<para>This procedure is a last-resort in an emergency. It
|
|
||||||
requires knowledge of the swift Python code and might
|
|
||||||
not succeed.</para>
|
|
||||||
</warning>
|
|
||||||
<procedure>
|
|
||||||
<step>
|
|
||||||
<para>Load the ring and a new ringbuilder object in a
|
|
||||||
Python REPL:</para>
|
|
||||||
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
|
|
||||||
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
|
|
||||||
</step>
|
|
||||||
<step>
|
|
||||||
<para>Copy the data in the ring into the
|
|
||||||
builder.</para>
|
|
||||||
<programlisting language="python">>>> import math
|
|
||||||
>>> partitions = len(ring._replica2part2dev_id[0])
|
|
||||||
>>> replicas = len(ring._replica2part2dev_id)
|
|
||||||
|
|
||||||
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
|
|
||||||
>>> builder.devs = ring.devs
|
|
||||||
>>> builder._replica2part2dev = ring.replica2part2dev_id
|
|
||||||
>>> builder._last_part_moves_epoch = 0
|
|
||||||
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
|
|
||||||
>>> builder._set_parts_wanted()
|
|
||||||
>>> for d in builder._iter_devs():
|
|
||||||
d['parts'] = 0
|
|
||||||
>>> for p2d in builder._replica2part2dev:
|
|
||||||
for dev_id in p2d:
|
|
||||||
builder.devs[dev_id]['parts'] += 1</programlisting>
|
|
||||||
<para>This is the extent of the recoverable
|
|
||||||
fields.</para>
|
|
||||||
</step>
|
|
||||||
<step>
|
|
||||||
<para>For <option>min_part_hours</option>, you must
|
|
||||||
remember the value that you used previously or
|
|
||||||
create a new value.</para>
|
|
||||||
<programlisting language="python">>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
|
|
||||||
<para>If validation succeeds without raising an
|
|
||||||
exception, you have succeeded.</para>
|
|
||||||
<programlisting language="python">>>> builder.validate()</programlisting>
|
|
||||||
</step>
|
|
||||||
<step>
|
|
||||||
<para>Save the builder.</para>
|
|
||||||
<programlisting language="python">>>> import pickle
|
|
||||||
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
|
|
||||||
<para>The <filename>account.builder</filename> file
|
|
||||||
appears in the current working directory.</para>
|
|
||||||
</step>
|
|
||||||
<step>
|
|
||||||
<para>Run <literal>swift-ring-builder account.builder
|
|
||||||
write_ring</literal>.</para>
|
|
||||||
<para>Compare the new
|
|
||||||
<filename>account.ring.gz</filename> to the
|
|
||||||
original <filename>account.ring.gz</filename>
|
|
||||||
file. They might not be byte-for-byte identical,
|
|
||||||
but if you load them in REPL and their
|
|
||||||
<option>_replica2part2dev_id</option> and
|
|
||||||
<option>devs</option> attributes are the same
|
|
||||||
(or nearly so), you have succeeded.</para>
|
|
||||||
</step>
|
|
||||||
<step>
|
|
||||||
<para>Repeat this procedure for the
|
|
||||||
<filename>container.ring.gz</filename> and
|
|
||||||
<filename>object.ring.gz</filename> files, and
|
|
||||||
you might get usable builder files.</para>
|
|
||||||
</step>
|
|
||||||
</procedure>
|
|
||||||
</section>
|
|
||||||
</chapter>
|
|
@@ -69,7 +69,8 @@ format="PNG" />
|
|||||||
</imageobject>
|
</imageobject>
|
||||||
</mediaobject>
|
</mediaobject>
|
||||||
</informalfigure>
|
</informalfigure>
|
||||||
<para>There will be three hosts in the setup.<table rules="all">
|
<para>There will be three hosts in the setup.</para>
|
||||||
|
<table rules="all">
|
||||||
<caption>Hosts for Demo</caption>
|
<caption>Hosts for Demo</caption>
|
||||||
<thead>
|
<thead>
|
||||||
<tr>
|
<tr>
|
||||||
@@ -103,7 +104,7 @@ format="PNG" />
|
|||||||
<td>Same as HostA</td>
|
<td>Same as HostA</td>
|
||||||
</tr>
|
</tr>
|
||||||
</tbody>
|
</tbody>
|
||||||
</table></para>
|
</table>
|
||||||
<section xml:id="multi_agent_demo_configuration">
|
<section xml:id="multi_agent_demo_configuration">
|
||||||
<title>Configuration</title>
|
<title>Configuration</title>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
|