Convert Object Storage files to RST
-Object Storage Intro to RST. -Object Storage Features table to RST. -Object Storage Components to RST. -Added related figures. Change-Id: I835a4387d64afa38705fbf8e67ff89f1d7f45a3f Implements: blueprint reorganise-user-guides
After Width: | Height: | Size: 32 KiB |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 45 KiB |
BIN
doc/admin-guide-cloud-rst/source/figures/objectstorage-ring.png
Normal file
After Width: | Height: | Size: 23 KiB |
After Width: | Height: | Size: 61 KiB |
BIN
doc/admin-guide-cloud-rst/source/figures/objectstorage-zones.png
Normal file
After Width: | Height: | Size: 10 KiB |
@ -8,13 +8,12 @@ Contents
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
objectstorage_characteristics.rst
|
||||
|
||||
.. TODO (karenb)
|
||||
objectstorage_intro.rst
|
||||
objectstorage_features.rst
|
||||
objectstorage_characteristics.rst
|
||||
objectstorage_components.rst
|
||||
|
||||
.. TODO (karenb)
|
||||
objectstorage_ringbuilder.rst
|
||||
objectstorage_arch.rst
|
||||
objectstorage_replication.rst
|
||||
|
283
doc/admin-guide-cloud-rst/source/objectstorage_components.rst
Normal file
@ -0,0 +1,283 @@
|
||||
==========
|
||||
Components
|
||||
==========
|
||||
|
||||
The components that enable Object Storage to deliver high availability,
|
||||
high durability, and high concurrency are:
|
||||
|
||||
- **Proxy servers.** Handle all of the incoming API requests.
|
||||
|
||||
- **Rings.** Map logical names of data to locations on particular
|
||||
disks.
|
||||
|
||||
- **Zones.** Isolate data from other zones. A failure in one zone
|
||||
doesn't impact the rest of the cluster because data is replicated
|
||||
across zones.
|
||||
|
||||
- **Accounts and containers.** Each account and container are
|
||||
individual databases that are distributed across the cluster. An
|
||||
account database contains the list of containers in that account. A
|
||||
container database contains the list of objects in that container.
|
||||
|
||||
- **Objects.** The data itself.
|
||||
|
||||
- **Partitions.** A partition stores objects, account databases, and
|
||||
container databases and helps manage locations where data lives in
|
||||
the cluster.
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-building-blocks-figure:
|
||||
|
||||
**Object Storage building blocks**
|
||||
|
||||
.. figure:: figures/objectstorage-buildingblocks.png
|
||||
|
||||
|
|
||||
|
||||
Proxy servers
|
||||
-------------
|
||||
|
||||
Proxy servers are the public face of Object Storage and handle all of
|
||||
the incoming API requests. Once a proxy server receives a request, it
|
||||
determines the storage node based on the object's URL, for example,
|
||||
https://swift.example.com/v1/account/container/object. Proxy servers
|
||||
also coordinate responses, handle failures, and coordinate timestamps.
|
||||
|
||||
Proxy servers use a shared-nothing architecture and can be scaled as
|
||||
needed based on projected workloads. A minimum of two proxy servers
|
||||
should be deployed for redundancy. If one proxy server fails, the others
|
||||
take over.
|
||||
|
||||
For more information concerning proxy server configuration, please see
|
||||
the `Configuration
|
||||
Reference <http://docs.openstack.org/trunk/config-reference/content/proxy-server-configuration.html>`__.
|
||||
|
||||
Rings
|
||||
-----
|
||||
|
||||
A ring represents a mapping between the names of entities stored on disk
|
||||
and their physical locations. There are separate rings for accounts,
|
||||
containers, and objects. When other components need to perform any
|
||||
operation on an object, container, or account, they need to interact
|
||||
with the appropriate ring to determine their location in the cluster.
|
||||
|
||||
The ring maintains this mapping using zones, devices, partitions, and
|
||||
replicas. Each partition in the ring is replicated, by default, three
|
||||
times across the cluster, and partition locations are stored in the
|
||||
mapping maintained by the ring. The ring is also responsible for
|
||||
determining which devices are used for handoff in failure scenarios.
|
||||
|
||||
Data can be isolated into zones in the ring. Each partition replica is
|
||||
guaranteed to reside in a different zone. A zone could represent a
|
||||
drive, a server, a cabinet, a switch, or even a data center.
|
||||
|
||||
The partitions of the ring are equally divided among all of the devices
|
||||
in the Object Storage installation. When partitions need to be moved
|
||||
around (for example, if a device is added to the cluster), the ring
|
||||
ensures that a minimum number of partitions are moved at a time, and
|
||||
only one replica of a partition is moved at a time.
|
||||
|
||||
You can use weights to balance the distribution of partitions on drives
|
||||
across the cluster. This can be useful, for example, when differently
|
||||
sized drives are used in a cluster.
|
||||
|
||||
The ring is used by the proxy server and several background processes
|
||||
(like replication).
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-ring-figure:
|
||||
|
||||
**The ring**
|
||||
|
||||
.. figure:: figures/objectstorage-ring.png
|
||||
|
||||
|
|
||||
|
||||
These rings are externally managed, in that the server processes
|
||||
themselves do not modify the rings, they are instead given new rings
|
||||
modified by other tools.
|
||||
|
||||
The ring uses a configurable number of bits from an MD5 hash for a path
|
||||
as a partition index that designates a device. The number of bits kept
|
||||
from the hash is known as the partition power, and 2 to the partition
|
||||
power indicates the partition count. Partitioning the full MD5 hash ring
|
||||
allows other parts of the cluster to work in batches of items at once
|
||||
which ends up either more efficient or at least less complex than
|
||||
working with each item separately or the entire cluster all at once.
|
||||
|
||||
Another configurable value is the replica count, which indicates how
|
||||
many of the partition-device assignments make up a single ring. For a
|
||||
given partition number, each replica's device will not be in the same
|
||||
zone as any other replica's device. Zones can be used to group devices
|
||||
based on physical locations, power separations, network separations, or
|
||||
any other attribute that would improve the availability of multiple
|
||||
replicas at the same time.
|
||||
|
||||
Zones
|
||||
-----
|
||||
|
||||
Object Storage allows configuring zones in order to isolate failure
|
||||
boundaries. Each data replica resides in a separate zone, if possible.
|
||||
At the smallest level, a zone could be a single drive or a grouping of a
|
||||
few drives. If there were five object storage servers, then each server
|
||||
would represent its own zone. Larger deployments would have an entire
|
||||
rack (or multiple racks) of object servers, each representing a zone.
|
||||
The goal of zones is to allow the cluster to tolerate significant
|
||||
outages of storage servers without losing all replicas of the data.
|
||||
|
||||
As mentioned earlier, everything in Object Storage is stored, by
|
||||
default, three times. Swift will place each replica
|
||||
"as-uniquely-as-possible" to ensure both high availability and high
|
||||
durability. This means that when chosing a replica location, Object
|
||||
Storage chooses a server in an unused zone before an unused server in a
|
||||
zone that already has a replica of the data.
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-zones-figure:
|
||||
|
||||
**Zones**
|
||||
|
||||
.. figure:: figures/objectstorage-zones.png
|
||||
|
||||
|
|
||||
|
||||
When a disk fails, replica data is automatically distributed to the
|
||||
other zones to ensure there are three copies of the data.
|
||||
|
||||
Accounts and containers
|
||||
-----------------------
|
||||
|
||||
Each account and container is an individual SQLite database that is
|
||||
distributed across the cluster. An account database contains the list of
|
||||
containers in that account. A container database contains the list of
|
||||
objects in that container.
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-accountscontainers-figure:
|
||||
|
||||
**Accounts and containers**
|
||||
|
||||
.. figure:: figures/objectstorage-accountscontainers.png
|
||||
|
||||
|
|
||||
|
||||
To keep track of object data locations, each account in the system has a
|
||||
database that references all of its containers, and each container
|
||||
database references each object.
|
||||
|
||||
Partitions
|
||||
----------
|
||||
|
||||
A partition is a collection of stored data, including account databases,
|
||||
container databases, and objects. Partitions are core to the replication
|
||||
system.
|
||||
|
||||
Think of a partition as a bin moving throughout a fulfillment center
|
||||
warehouse. Individual orders get thrown into the bin. The system treats
|
||||
that bin as a cohesive entity as it moves throughout the system. A bin
|
||||
is easier to deal with than many little things. It makes for fewer
|
||||
moving parts throughout the system.
|
||||
|
||||
System replicators and object uploads/downloads operate on partitions.
|
||||
As the system scales up, its behavior continues to be predictable
|
||||
because the number of partitions is a fixed number.
|
||||
|
||||
Implementing a partition is conceptually simple, a partition is just a
|
||||
directory sitting on a disk with a corresponding hash table of what it
|
||||
contains.
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-partitions-figure:
|
||||
|
||||
**Partitions**
|
||||
|
||||
.. figure:: figures/objectstorage-partitions.png
|
||||
|
||||
|
|
||||
|
||||
Replicators
|
||||
-----------
|
||||
|
||||
In order to ensure that there are three copies of the data everywhere,
|
||||
replicators continuously examine each partition. For each local
|
||||
partition, the replicator compares it against the replicated copies in
|
||||
the other zones to see if there are any differences.
|
||||
|
||||
The replicator knows if replication needs to take place by examining
|
||||
hashes. A hash file is created for each partition, which contains hashes
|
||||
of each directory in the partition. Each of the three hash files is
|
||||
compared. For a given partition, the hash files for each of the
|
||||
partition's copies are compared. If the hashes are different, then it is
|
||||
time to replicate, and the directory that needs to be replicated is
|
||||
copied over.
|
||||
|
||||
This is where partitions come in handy. With fewer things in the system,
|
||||
larger chunks of data are transferred around (rather than lots of little
|
||||
TCP connections, which is inefficient) and there is a consistent number
|
||||
of hashes to compare.
|
||||
|
||||
The cluster eventually has a consistent behavior where the newest data
|
||||
has a priority.
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-replication-figure:
|
||||
|
||||
**Replication**
|
||||
|
||||
.. figure:: figures/objectstorage-replication.png
|
||||
|
||||
|
|
||||
|
||||
If a zone goes down, one of the nodes containing a replica notices and
|
||||
proactively copies data to a handoff location.
|
||||
|
||||
Use cases
|
||||
---------
|
||||
|
||||
The following sections show use cases for object uploads and downloads
|
||||
and introduce the components.
|
||||
|
||||
|
||||
Upload
|
||||
~~~~~~
|
||||
|
||||
A client uses the REST API to make a HTTP request to PUT an object into
|
||||
an existing container. The cluster receives the request. First, the
|
||||
system must figure out where the data is going to go. To do this, the
|
||||
account name, container name, and object name are all used to determine
|
||||
the partition where this object should live.
|
||||
|
||||
Then a lookup in the ring figures out which storage nodes contain the
|
||||
partitions in question.
|
||||
|
||||
The data is then sent to each storage node where it is placed in the
|
||||
appropriate partition. At least two of the three writes must be
|
||||
successful before the client is notified that the upload was successful.
|
||||
|
||||
Next, the container database is updated asynchronously to reflect that
|
||||
there is a new object in it.
|
||||
|
||||
|
|
||||
|
||||
.. _objectstorage-usecase-figure:
|
||||
|
||||
**Object Storage in use**
|
||||
|
||||
.. figure:: figures/objectstorage-usecase.png
|
||||
|
||||
|
|
||||
|
||||
Download
|
||||
~~~~~~~~
|
||||
|
||||
A request comes in for an account/container/object. Using the same
|
||||
consistent hashing, the partition name is generated. A lookup in the
|
||||
ring reveals which storage nodes contain that partition. A request is
|
||||
made to one of the storage nodes to fetch the object and, if that fails,
|
||||
requests are made to the other nodes.
|
72
doc/admin-guide-cloud-rst/source/objectstorage_features.rst
Normal file
@ -0,0 +1,72 @@
|
||||
=====================
|
||||
Features and benefits
|
||||
=====================
|
||||
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Features | Benefits |
|
||||
+=============================+==================================================+
|
||||
| Leverages commodity | No lock-in, lower price/GB. |
|
||||
| hardware | |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| HDD/node failure agnostic | Self-healing, reliable, data redundancy protects |
|
||||
| | from failures. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Unlimited storage | Large and flat namespace, highly scalable |
|
||||
| | read/write access, able to serve content |
|
||||
| | directly from storage system. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Multi-dimensional | Scale-out architecture: Scale vertically and |
|
||||
| scalability | horizontally-distributed storage. Backs up |
|
||||
| | and archives large amounts of data with |
|
||||
| | linear performance. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Account/container/object | No nesting, not a traditional file system: |
|
||||
| structure | Optimized for scale, it scales to multiple |
|
||||
| | petabytes and billions of objects. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Built-in replication | A configurable number of accounts, containers |
|
||||
| 3✕ + data redundancy | and object copies for high availability. |
|
||||
| (compared with 2✕ on RAID) | |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Easily add capacity (unlike | Elastic data scaling with ease |
|
||||
| RAID resize) | |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| No central database | Higher performance, no bottlenecks |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| RAID not required | Handle many small, random reads and writes |
|
||||
| | efficiently |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Built-in management | Account management: Create, add, verify, |
|
||||
| utilities | and delete users; Container management: Upload, |
|
||||
| | download, and verify; Monitoring: Capacity, |
|
||||
| | host, network, log trawling, and cluster health. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Drive auditing | Detect drive failures preempting data corruption |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Expiring objects | Users can set an expiration time or a TTL on an |
|
||||
| | object to control access |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Direct object access | Enable direct browser access to content, such as |
|
||||
| | for a control panel |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Realtime visibility into | Know what users are requesting. |
|
||||
| client requests | |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Supports S3 API | Utilize tools that were designed for the popular |
|
||||
| | S3 API. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Restrict containers per | Limit access to control usage by user. |
|
||||
| account | |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Support for NetApp, | Unified support for block volumes using a |
|
||||
| Nexenta, SolidFire | variety of storage systems. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Snapshot and backup API for | Data protection and recovery for VM data. |
|
||||
| block volumes | |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Standalone volume API | Separate endpoint and API for integration with |
|
||||
| available | other compute systems. |
|
||||
+-----------------------------+--------------------------------------------------+
|
||||
| Integration with Compute | Fully integrated with Compute for attaching |
|
||||
| | block volumes and reporting on usage. |
|
||||
+-----------------------------+--------------------------------------------------+
|
23
doc/admin-guide-cloud-rst/source/objectstorage_intro.rst
Normal file
@ -0,0 +1,23 @@
|
||||
==============================
|
||||
Introduction to Object Storage
|
||||
==============================
|
||||
|
||||
OpenStack Object Storage (code-named swift) is open source software for
|
||||
creating redundant, scalable data storage using clusters of standardized
|
||||
servers to store petabytes of accessible data. It is a long-term storage
|
||||
system for large amounts of static data that can be retrieved,
|
||||
leveraged, and updated. Object Storage uses a distributed architecture
|
||||
with no central point of control, providing greater scalability,
|
||||
redundancy, and permanence. Objects are written to multiple hardware
|
||||
devices, with the OpenStack software responsible for ensuring data
|
||||
replication and integrity across the cluster. Storage clusters scale
|
||||
horizontally by adding new nodes. Should a node fail, OpenStack works to
|
||||
replicate its content from other active nodes. Because OpenStack uses
|
||||
software logic to ensure data replication and distribution across
|
||||
different devices, inexpensive commodity hard drives and servers can be
|
||||
used in lieu of more expensive equipment.
|
||||
|
||||
Object Storage is ideal for cost effective, scale-out storage. It
|
||||
provides a fully distributed, API-accessible storage platform that can
|
||||
be integrated directly into applications or used for backup, archiving,
|
||||
and data retention.
|