2014-07-03 16:20:26 +00:00
|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
2013-12-10 07:46:28 +00:00
|
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
2014-07-09 20:21:55 +00:00
|
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
|
|
version="5.0"
|
|
|
|
|
xml:id="section_objectstorage-components">
|
2013-12-10 07:46:28 +00:00
|
|
|
|
<title>Components</title>
|
|
|
|
|
<para>The components that enable Object Storage to deliver high availability, high
|
|
|
|
|
durability, and high concurrency are:</para>
|
|
|
|
|
<itemizedlist>
|
|
|
|
|
<listitem>
|
2014-02-15 16:12:57 +00:00
|
|
|
|
<para><emphasis role="bold">Proxy servers.</emphasis> Handle all of the incoming
|
2013-12-10 07:46:28 +00:00
|
|
|
|
API requests.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
2014-02-15 16:12:57 +00:00
|
|
|
|
<para><emphasis role="bold">Rings.</emphasis> Map logical names of data to
|
2013-12-10 07:46:28 +00:00
|
|
|
|
locations on particular disks.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
2014-02-15 16:12:57 +00:00
|
|
|
|
<para><emphasis role="bold">Zones.</emphasis> Isolate data from other zones. A
|
2013-12-10 07:46:28 +00:00
|
|
|
|
failure in one zone doesn’t impact the rest of the cluster because data is
|
|
|
|
|
replicated across zones.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
2014-02-15 16:12:57 +00:00
|
|
|
|
<para><emphasis role="bold">Accounts and containers.</emphasis> Each account and
|
2013-12-10 07:46:28 +00:00
|
|
|
|
container are individual databases that are distributed across the cluster. An
|
|
|
|
|
account database contains the list of containers in that account. A container
|
|
|
|
|
database contains the list of objects in that container.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
2014-02-15 16:12:57 +00:00
|
|
|
|
<para><emphasis role="bold">Objects.</emphasis> The data itself.</para>
|
2013-12-10 07:46:28 +00:00
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
2014-02-15 16:12:57 +00:00
|
|
|
|
<para><emphasis role="bold">Partitions.</emphasis> A partition stores objects,
|
2013-12-10 07:46:28 +00:00
|
|
|
|
account databases, and container databases and helps manage locations where data
|
|
|
|
|
lives in the cluster.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
</itemizedlist>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>Object Storage building blocks</title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-buildingblocks.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
<section xml:id="section_proxy-servers">
|
|
|
|
|
<title>Proxy servers</title>
|
|
|
|
|
<para>Proxy servers are the public face of Object Storage and handle all of the incoming API
|
|
|
|
|
requests. Once a proxy server receives a request, it determines the storage node based
|
|
|
|
|
on the object's URL, for example, https://swift.example.com/v1/account/container/object.
|
|
|
|
|
Proxy servers also coordinate responses, handle failures, and coordinate
|
|
|
|
|
timestamps.</para>
|
|
|
|
|
<para>Proxy servers use a shared-nothing architecture and can be scaled as needed based on
|
|
|
|
|
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
|
|
|
|
|
If one proxy server fails, the others take over.</para>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_ring">
|
|
|
|
|
<title>Rings</title>
|
|
|
|
|
<para>A ring represents a mapping between the names of entities stored on disk and their
|
|
|
|
|
physical locations. There are separate rings for accounts, containers, and objects. When
|
|
|
|
|
other components need to perform any operation on an object, container, or account, they
|
|
|
|
|
need to interact with the appropriate ring to determine their location in the
|
|
|
|
|
cluster.</para>
|
|
|
|
|
<para>The ring maintains this mapping using zones, devices, partitions, and replicas. Each
|
|
|
|
|
partition in the ring is replicated, by default, three times across the cluster, and
|
|
|
|
|
partition locations are stored in the mapping maintained by the ring. The ring is also
|
|
|
|
|
responsible for determining which devices are used for handoff in failure
|
|
|
|
|
scenarios.</para>
|
|
|
|
|
<para>Data can be isolated into zones in the ring. Each partition replica is guaranteed to
|
|
|
|
|
reside in a different zone. A zone could represent a drive, a server, a cabinet, a
|
|
|
|
|
switch, or even a data center.</para>
|
|
|
|
|
<para>The partitions of the ring are equally divided among all of the devices in the Object
|
|
|
|
|
Storage installation. When partitions need to be moved around (for example, if a device
|
|
|
|
|
is added to the cluster), the ring ensures that a minimum number of partitions are moved
|
|
|
|
|
at a time, and only one replica of a partition is moved at a time.</para>
|
2014-06-28 13:33:29 +00:00
|
|
|
|
<para>You can use weights to balance the distribution of partitions on drives across the
|
2013-12-10 07:46:28 +00:00
|
|
|
|
cluster. This can be useful, for example, when differently sized drives are used in a
|
|
|
|
|
cluster.</para>
|
|
|
|
|
<para>The ring is used by the proxy server and several background processes (like
|
|
|
|
|
replication).</para>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>The <emphasis role="bold">ring</emphasis></title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-ring.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
<para>These rings are externally managed, in that the server processes themselves do not
|
|
|
|
|
modify the rings, they are instead given new rings modified by other tools.</para>
|
2014-06-28 13:33:29 +00:00
|
|
|
|
<para>The ring uses a configurable number of bits from an
|
|
|
|
|
MD5 hash for a path as a partition index that designates a
|
2013-12-10 07:46:28 +00:00
|
|
|
|
device. The number of bits kept from the hash is known as
|
|
|
|
|
the partition power, and 2 to the partition power
|
|
|
|
|
indicates the partition count. Partitioning the full MD5
|
|
|
|
|
hash ring allows other parts of the cluster to work in
|
|
|
|
|
batches of items at once which ends up either more
|
|
|
|
|
efficient or at least less complex than working with each
|
|
|
|
|
item separately or the entire cluster all at once.</para>
|
|
|
|
|
<para>Another configurable value is the replica count, which indicates how many of the
|
|
|
|
|
partition-device assignments make up a single ring. For a given partition number, each
|
|
|
|
|
replica’s device will not be in the same zone as any other replica's device. Zones can
|
|
|
|
|
be used to group devices based on physical locations, power separations, network
|
|
|
|
|
separations, or any other attribute that would improve the availability of multiple
|
|
|
|
|
replicas at the same time.</para>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_zones">
|
|
|
|
|
<title>Zones</title>
|
|
|
|
|
<para>Object Storage allows configuring zones in order to isolate failure boundaries.
|
|
|
|
|
Each data replica resides in a separate zone, if possible. At the smallest level, a zone
|
|
|
|
|
could be a single drive or a grouping of a few drives. If there were five object storage
|
|
|
|
|
servers, then each server would represent its own zone. Larger deployments would have an
|
|
|
|
|
entire rack (or multiple racks) of object servers, each representing a zone. The goal of
|
|
|
|
|
zones is to allow the cluster to tolerate significant outages of storage servers without
|
|
|
|
|
losing all replicas of the data.</para>
|
|
|
|
|
<para>As mentioned earlier, everything in Object Storage is stored, by default, three
|
|
|
|
|
times. Swift will place each replica "as-uniquely-as-possible" to ensure both high
|
|
|
|
|
availability and high durability. This means that when chosing a replica location,
|
|
|
|
|
Object Storage chooses a server in an unused zone before an unused server in a zone that
|
|
|
|
|
already has a replica of the data.</para>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>Zones</title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-zones.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
<para>When a disk fails, replica data is automatically distributed to the other zones to
|
|
|
|
|
ensure there are three copies of the data.</para>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_accounts-containers">
|
|
|
|
|
<title>Accounts and containers</title>
|
|
|
|
|
<para>Each account and container is an individual SQLite
|
|
|
|
|
database that is distributed across the cluster. An
|
|
|
|
|
account database contains the list of containers in
|
|
|
|
|
that account. A container database contains the list
|
|
|
|
|
of objects in that container.</para>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>Accounts and containers</title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-accountscontainers.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
<para>To keep track of object data locations, each account in the system has a database
|
|
|
|
|
that references all of its containers, and each container database references each
|
|
|
|
|
object.</para>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_partitions">
|
|
|
|
|
<title>Partitions</title>
|
|
|
|
|
<para>A partition is a collection of stored data, including account databases, container
|
|
|
|
|
databases, and objects. Partitions are core to the replication system.</para>
|
|
|
|
|
<para>Think of a partition as a bin moving throughout a fulfillment center warehouse.
|
|
|
|
|
Individual orders get thrown into the bin. The system treats that bin as a cohesive
|
|
|
|
|
entity as it moves throughout the system. A bin is easier to deal with than many little
|
|
|
|
|
things. It makes for fewer moving parts throughout the system.</para>
|
|
|
|
|
<para>System replicators and object uploads/downloads operate on partitions. As the
|
|
|
|
|
system scales up, its behavior continues to be predictable because the number of
|
|
|
|
|
partitions is a fixed number.</para>
|
2014-02-11 13:26:31 +00:00
|
|
|
|
<para>Implementing a partition is conceptually simple, a partition is just a
|
2013-12-10 07:46:28 +00:00
|
|
|
|
directory sitting on a disk with a corresponding hash table of what it contains.</para>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>Partitions</title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-partitions.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_replicators">
|
|
|
|
|
<title>Replicators</title>
|
|
|
|
|
<para>In order to ensure that there are three copies of the data everywhere, replicators
|
|
|
|
|
continuously examine each partition. For each local partition, the replicator compares
|
|
|
|
|
it against the replicated copies in the other zones to see if there are any
|
|
|
|
|
differences.</para>
|
2014-04-09 18:03:59 +00:00
|
|
|
|
<para>The replicator knows if replication needs to take place by examining hashes. A hash
|
2013-12-10 07:46:28 +00:00
|
|
|
|
file is created for each partition, which contains hashes of each directory in the
|
|
|
|
|
partition. Each of the three hash files is compared. For a given partition, the hash
|
|
|
|
|
files for each of the partition's copies are compared. If the hashes are different, then
|
|
|
|
|
it is time to replicate, and the directory that needs to be replicated is copied
|
|
|
|
|
over.</para>
|
|
|
|
|
<para>This is where partitions come in handy. With fewer things in the system, larger
|
|
|
|
|
chunks of data are transferred around (rather than lots of little TCP connections, which
|
|
|
|
|
is inefficient) and there is a consistent number of hashes to compare.</para>
|
|
|
|
|
<para>The cluster eventually has a consistent behavior where the newest data has a
|
|
|
|
|
priority.</para>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>Replication</title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-replication.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
<para>If a zone goes down, one of the nodes containing a replica notices and proactively
|
|
|
|
|
copies data to a handoff location.</para>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_usecases">
|
|
|
|
|
<title>Use cases</title>
|
|
|
|
|
<para>The following sections show use cases for object uploads and downloads and introduce the components.</para>
|
|
|
|
|
<section xml:id="upload">
|
|
|
|
|
<title>Upload</title>
|
|
|
|
|
<para>A client uses the REST API to make a HTTP request to PUT an object into an existing
|
|
|
|
|
container. The cluster receives the request. First, the system must figure out where
|
|
|
|
|
the data is going to go. To do this, the account name, container name, and object
|
|
|
|
|
name are all used to determine the partition where this object should live.</para>
|
|
|
|
|
<para>Then a lookup in the ring figures out which storage nodes contain the partitions in
|
|
|
|
|
question.</para>
|
|
|
|
|
<para>The data is then sent to each storage node where it is placed in the appropriate
|
|
|
|
|
partition. At least two of the three writes must be successful before the client is
|
|
|
|
|
notified that the upload was successful.</para>
|
|
|
|
|
<para>Next, the container database is updated asynchronously to reflect that there is a new
|
|
|
|
|
object in it.</para>
|
|
|
|
|
<figure>
|
|
|
|
|
<title>Object Storage in use</title>
|
|
|
|
|
<mediaobject>
|
|
|
|
|
<imageobject>
|
|
|
|
|
<imagedata fileref="../common/figures/objectstorage-usecase.png"/>
|
|
|
|
|
</imageobject>
|
|
|
|
|
</mediaobject>
|
|
|
|
|
</figure>
|
|
|
|
|
</section>
|
|
|
|
|
<section xml:id="section_swift-component-download">
|
|
|
|
|
<title>Download</title>
|
2014-02-11 13:26:31 +00:00
|
|
|
|
<para>A request comes in for an account/container/object. Using the same consistent hashing,
|
2013-12-10 07:46:28 +00:00
|
|
|
|
the partition name is generated. A lookup in the ring reveals which storage nodes
|
|
|
|
|
contain that partition. A request is made to one of the storage nodes to fetch the
|
|
|
|
|
object and, if that fails, requests are made to the other nodes.</para>
|
|
|
|
|
</section>
|
|
|
|
|
</section>
|
|
|
|
|
</section>
|