Handle non-referenced and duplicated files

Admin Guide/Training Guides:
* section_rootwrap.xml was renamed to
  compute/section-compute-rootwrap.xml in change
  Ie300a9ce25d305b80bb0b21d3cfc318909f3a123. The file is unused and
  duplicated now, remove it.
* section_object-storage-admin.xml and
  common/section_objectstorage_tenant-specific-image-storage.xml
  were not referenced anywhere, add them to ch_objectstorage.xml.
* files common/section_objectstorage-account-reaper has the comment:
  "Old module003-ch008-account-reaper edited, renamed, and
  stored in doc/common for use by both Cloud Admin and
  Operator Training Guides..." - make it as suggested.
  There were more files moved the same way, update training-guides to
  use the new files and remove duplicates files.
  Also, remove comment from the lines, we have git for history, no need
  to duplicate the information.

Config Reference:
* Object Storage:
  - Add new "Container sync realms configuration" section and add tables.
  - Include file common/tables/swift-proxy-server-filter-container_sync.xml.
* Identity:
  - Include file common/tables/keystone-auth_token.xml

Change-Id: I6c8f41a01815485904c48db0695deb7813634df1
This commit is contained in:
Andreas Jaeger 2014-05-11 20:12:27 +02:00
parent 41f9ff0834
commit 6959cf0557
22 changed files with 38 additions and 1215 deletions

View File

@ -11,6 +11,9 @@
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
<xi:include href="../common/section_objectstorage-arch.xml"/>
<xi:include href="../common/section_objectstorage-replication.xml"/>
<xi:include href="../common/section_objectstorage-account-reaper.xml"/>
<xi:include href="../common/section_objectstorage_tenant-specific-image-storage.xml"/>
<xi:include href="section_object-storage-monitoring.xml"/>
<xi:include href="section_object-storage-admin.xml"/>
<xi:include href="../common/section_objectstorage-troubleshoot.xml"/>
</chapter>

View File

@ -1,122 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xml:id="root-wrap-reference"
xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
<title>Secure with root wrappers</title>
<para>The root wrapper enables the Compute
unprivileged user to run a number of actions as the root user
in the safest manner possible. Historically, Compute used a
specific <filename>sudoers</filename> file that listed every
command that the Compute user was allowed to run, and used
<command>sudo</command> to run that command as
<literal>root</literal>. However this was difficult to
maintain (the <filename>sudoers</filename> file was in
packaging), and did not enable complex filtering of parameters
(advanced filters). The rootwrap was designed to solve those
issues.</para>
<simplesect>
<title>How rootwrap works</title>
<para>Instead of calling <command>sudo make me a
sandwich</command>, Compute services start with
nova- call <command>sudo nova-rootwrap
/etc/nova/rootwrap.conf make me a sandwich</command>.
A generic sudoers entry lets the Compute user run
nova-rootwrap as root. The nova-rootwrap code looks for
filter definition directories in its configuration file,
and loads command filters from them. Then it checks if the
command requested by Compute matches one of those filters,
in which case it executes the command (as root). If no
filter matches, it denies the request.</para>
</simplesect>
<simplesect>
<title>Security model</title>
<para>The escalation path is fully controlled by the root
user. A sudoers entry (owned by root) allows Compute to
run (as root) a specific rootwrap executable, and only
with a specific configuration file (which should be owned
by root). nova-rootwrap imports the Python modules it
needs from a cleaned (and system-default) PYTHONPATH. The
configuration file (also root-owned) points to root-owned
filter definition directories, which contain root-owned
filters definition files. This chain ensures that the
Compute user itself is not in control of the configuration
or modules used by the nova-rootwrap executable.</para>
</simplesect>
<simplesect>
<title>Details of rootwrap.conf</title>
<para>You configure nova-rootwrap in the
<filename>rootwrap.conf</filename> file. Because it's
in the trusted security path, it must be owned and
writable by only the root user. Its location is specified
both in the sudoers entry and in the
<filename>nova.conf</filename> configuration file with
the <code>rootwrap_config=entry</code>.</para>
<para>It uses an INI file format with these sections and
parameters:</para>
<table rules="all" frame="border"
xml:id="rootwrap-conf-table-filter-path" width="100%">
<caption>rootwrap.conf configuration options</caption>
<col width="50%"/>
<col width="50%"/>
<thead>
<tr>
<td><para>Configuration option=Default
value</para></td>
<td><para>(Type) Description</para></td>
</tr>
</thead>
<tbody>
<tr>
<td><para>[DEFAULT]</para>
<para>filters_path=/etc/nova/rootwrap.d,/usr/share/nova/rootwrap
</para></td>
<td><para>(ListOpt) Comma-separated list of
directories containing filter definition
files. Defines where filters for root wrap
are stored. Directories defined on this
line should all exist, be owned and
writable only by the root
user.</para></td>
</tr>
</tbody>
</table>
</simplesect>
<simplesect>
<title>Details of .filters files</title>
<para>Filters definition files contain lists of filters that
nova-rootwrap will use to allow or deny a specific
command. They are generally suffixed by .filters. Since
they are in the trusted security path, they need to be
owned and writable only by the root user. Their location
is specified in the rootwrap.conf file.</para>
<para>It uses an INI file format with a [Filters] section and
several lines, each with a unique parameter name
(different for each filter that you define):</para>
<table rules="all" frame="border"
xml:id="rootwrap-conf-table-filter-name" width="100%">
<caption>rootwrap.conf configuration options</caption>
<col width="50%"/>
<col width="50%"/>
<thead>
<tr>
<td><para>Configuration option=Default
value</para></td>
<td><para>(Type) Description</para></td>
</tr>
</thead>
<tbody>
<tr>
<td><para>[Filters]</para>
<para>filter_name=kpartx: CommandFilter,
/sbin/kpartx, root</para></td>
<td><para>(ListOpt) Comma-separated list
containing first the Filter class to use,
followed by that Filter arguments (which
vary depending on the Filter class
selected).</para></td>
</tr>
</tbody>
</table>
</simplesect>
</section>

View File

@ -4,7 +4,6 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-account-reaper">
<!-- ... Old module003-ch008-account-reaper edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Account reaper</title>
<para>In the background, the account reaper removes data from the deleted accounts.</para>
<para>A reseller marks an account for deletion by issuing a <code>DELETE</code> request on the accounts
@ -37,4 +36,4 @@
logged with the <code>reap_warn_after</code> value in the <code>[account-reaper]</code>
section of the account-server.conf file. The default value is 30
days.</para>
</section>
</section>

View File

@ -8,7 +8,6 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-cluster-architecture">
<!-- ... Old module003-ch007-swift-cluster-architecture edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Cluster architecture</title>
<section xml:id="section_access-tier">
<title>Access tier</title>

View File

@ -4,7 +4,6 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="objectstorage_characteristics">
<!-- ... Old module003-ch003-obj-store-capabilities edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Object Storage characteristics</title>
<para>The key characteristics of Object Storage are that:</para>
<itemizedlist>
@ -56,4 +55,4 @@
Ruby, and C#. Amazon S3 and RackSpace Cloud Files users should be very familiar with Object
Storage. Users new to object storage systems will have to adjust to a different approach and
mindset than those required for a traditional filesystem.</para>
</section>
</section>

View File

@ -4,7 +4,6 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-components">
<!-- ... Old module003-ch004-swift-building-blocks edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Components</title>
<para>The components that enable Object Storage to deliver high availability, high
durability, and high concurrency are:</para>

View File

@ -4,7 +4,6 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage_features">
<!-- ... Old module003-ch002-features-benefits edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Features and benefits</title>
<para>
<informaltable class="c19">

View File

@ -4,7 +4,6 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="section_objectstorage-intro">
<!-- ... Old module003-ch001-intro-objstore edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Introduction to Object Storage</title>
<para>OpenStack Object Storage (code-named Swift) is open source software for creating
redundant, scalable data storage using clusters of standardized servers to store petabytes

View File

@ -3,7 +3,6 @@
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="section_objectstorage-replication">
<!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Replication</title>
<para>Because each replica in Object Storage functions
independently and clients generally require only a simple

View File

@ -3,7 +3,6 @@
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="section_objectstorage-ringbuilder">
<!-- ... Old module003-ch005-the-ring edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
<title>Ring-builder</title>
<para>Use the swift-ring-builder utility to build and manage rings. This
utility assigns partitions to devices and writes an optimized

View File

@ -22,6 +22,7 @@ options. For installation prerequisites and step-by-step walkthroughs, see the
<xi:include href="../common/tables/keystone-api.xml"/>
<xi:include href="../common/tables/keystone-assignment.xml"/>
<xi:include href="../common/tables/keystone-auth.xml"/>
<xi:include href="../common/tables/keystone-auth_token.xml"/>
<xi:include href="../common/tables/keystone-cache.xml"/>
<xi:include href="../common/tables/keystone-catalog.xml"/>
<xi:include href="../common/tables/keystone-credential.xml"/>

View File

@ -93,6 +93,23 @@
</section>
</section>
<section xml:id="container-sync-realms-configuration">
<title>Container sync realms configuration</title>
<para>Find an example container sync realms configuration at
<filename>etc/container-sync-realms.conf-sample</filename>
in the source code repository.</para>
<para>The available configuration options are:</para>
<xi:include
href="../common/tables/swift-container-sync-realms-DEFAULT.xml"/>
<xi:include
href="../common/tables/swift-container-sync-realms-realm1.xml"/>
<xi:include
href="../common/tables/swift-container-sync-realms-realm2.xml"/>
<section xml:id="container-sync-realms-conf">
<title>Sample container sync realms configuration file</title>
<programlisting language="ini"><xi:include parse="text" href="http://git.openstack.org/cgit/openstack/swift/plain/etc/container-sync-realms.conf-sample?h=stable/icehouse"/></programlisting>
</section>
</section>
<section xml:id="account-server-configuration">
<title>Account server configuration</title>
<para>Find an example account server configuration at
@ -140,6 +157,8 @@
href="../common/tables/swift-proxy-server-filter-cache.xml"/>
<xi:include
href="../common/tables/swift-proxy-server-filter-catch_errors.xml"/>
<xi:include
href="../common/tables/swift-proxy-server-filter-container_sync.xml"/>
<xi:include
href="../common/tables/swift-proxy-server-filter-dlo.xml"/>
<xi:include

View File

@ -12,15 +12,15 @@
</section>
<section xml:id="associate-intro-object-store">
<title>Introduction to Object Storage</title>
<xi:include href="./module003-ch001-intro-objstore.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch001-intro-objectstore']/*[not(self::db:title)])">
<xi:include href="../common/section_objectstorage-intro.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'section_objectstorage-intro']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="associate-object-store-features-benefits">
<title>Features and Benefits</title>
<xi:include href="./module003-ch002-features-benefits.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch002-features-benefits']/*[not(self::db:title)])">
<xi:include href="../common/section_objectstorage-features.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'section_objectstorage_features']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>

View File

@ -12,42 +12,19 @@
</section>
<section xml:id="operator-intro-object-store">
<title>Review Associate Introduction to Object Storage</title>
<xi:include href="./module003-ch001-intro-objstore.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch001-intro-objectstore']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="operator-object-store-features-benefits">
<title>Review Associate Features and Benefits</title>
<xi:include href="./module003-ch002-features-benefits.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch002-features-benefits']/*[not(self::db:title)])">
<xi:include href="../common/section_objectstorage-intro.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'section_objectstorage-intro']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<xi:include href="../common/section_objectstorage-features.xml"/>
<section xml:id="operator-object-store-node-administration-tasks">
<title>Review Associate Administration Tasks</title>
<para></para>
</section>
<section xml:id="operator-object-store-capabilities">
<title>Object Storage Capabilities</title>
<xi:include href="./module003-ch003-obj-store-capabilities.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch003-obj-store-capabilities']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="operator-swift-building-blocks">
<title>Object Storage Building Blocks</title>
<xi:include href="./module003-ch004-swift-building-blocks.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch004-swift-building-blocks']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="operator-swift-the-ring">
<title>Swift Ring Builder</title>
<xi:include href="./module003-ch005-the-ring.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch005-the-ring']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include></section>
<xi:include href="../common/section_objectstorage-characteristics.xml"/>
<xi:include href="../common/section_objectstorage-components.xml"/>
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
<section xml:id="operator-swift-more-concepts">
<title>More Swift Concepts</title>
<xi:include href="./module003-ch006-more-concepts.xml"
@ -55,25 +32,7 @@
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="operator-swift-cluster-architecture">
<title>Swift Cluster Architecture</title>
<xi:include href="./module003-ch007-swift-cluster-architecture.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch007-cluster-architecture']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="operator-swift-account-reaper">
<title>Swift Account Reaper</title>
<xi:include href="./module003-ch008-account-reaper.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch008-account-reaper']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<section xml:id="operator-swift-replication">
<title>Swift Replication</title>
<xi:include href="./module003-ch009-replication.xml"
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch009-replication']/*[not(self::db:title)])">
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
</xi:include>
</section>
<xi:include href="../common/section_objectstorage-arch.xml"/>
<xi:include href="../common/section_objectstorage-account-reaper.xml"/>
<xi:include href="../common/section_objectstorage-replication.xml"/>
</chapter>

View File

@ -1,32 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch001-intro-objectstore">
<title>Introduction to Object Storage</title>
<para>OpenStack Object Storage (code-named Swift) is open source
software for creating redundant, scalable data storage using
clusters of standardized servers to store petabytes of
accessible data. It is a long-term storage system for large
amounts of static data that can be retrieved, leveraged, and
updated. Object Storage uses a distributed architecture with
no central point of control, providing greater scalability,
redundancy and permanence. Objects are written to multiple
hardware devices, with the OpenStack software responsible for
ensuring data replication and integrity across the cluster.
Storage clusters scale horizontally by adding new nodes.
Should a node fail, OpenStack works to replicate its content
from other active nodes. Because OpenStack uses software logic
to ensure data replication and distribution across different
devices, inexpensive commodity hard drives and servers can be
used in lieu of more expensive equipment.</para>
<para>Object Storage is ideal for cost effective, scale-out
storage. It provides a fully distributed, API-accessible
storage platform that can be integrated directly into
applications or used for backup, archiving and data retention.
Block Storage allows block devices to be exposed and connected
to compute instances for expanded storage, better performance
and integration with enterprise storage platforms, such as
NetApp, Nexenta and SolidFire.</para>
</chapter>

View File

@ -1,204 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch002-features-benefits">
<title>Features and Benefits</title>
<para>
<informaltable class="c19">
<tbody>
<tr>
<th rowspan="1" colspan="1">Features</th>
<th rowspan="1" colspan="1">Benefits</th>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Leverages commodity
hardware</emphasis></td>
<td rowspan="1" colspan="1"
>No
lock-in, lower
price/GB</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>HDD/node failure agnostic</emphasis></td>
<td rowspan="1" colspan="1"
>Self
healingReliability, data redundancy protecting
from
failures</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Unlimited storage</emphasis></td>
<td rowspan="1" colspan="1"
>Huge
&amp; flat namespace, highly scalable
read/write accessAbility to serve content
directly from storage
system</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Multi-dimensional scalability</emphasis>
(scale out architecture)Scale vertically and
horizontally-distributed storage</td>
<td rowspan="1" colspan="1"
>Backup
and archive large amounts of data with linear
performance</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Account/Container/Object
structure</emphasis>No nesting, not a
traditional file system</td>
<td rowspan="1" colspan="1"
>Optimized
for scaleScales to multiple petabytes,
billions of
objects</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Built-in replication3x+ data
redundancy</emphasis> compared to 2x on
RAID</td>
<td rowspan="1" colspan="1"
>Configurable
number of accounts, container and object
copies for high
availability</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Easily add capacity</emphasis> unlike
RAID resize</td>
<td rowspan="1" colspan="1"
>Elastic
data scaling with
ease</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>No central database</emphasis></td>
<td rowspan="1" colspan="1"
>Higher
performance, no
bottlenecks</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>RAID not required</emphasis></td>
<td rowspan="1" colspan="1"
>Handle
lots of small, random reads and writes
efficiently</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Built-in management
utilities</emphasis></td>
<td rowspan="1" colspan="1"
>Account
Management: Create, add, verify, delete
usersContainer Management: Upload, download,
verifyMonitoring: Capacity, host, network, log
trawling, cluster
health</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Drive auditing</emphasis></td>
<td rowspan="1" colspan="1"
>Detect
drive failures preempting data
corruption</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Expiring objects</emphasis></td>
<td rowspan="1" colspan="1"
>Users
can set an expiration time or a TTL on an
object to control
access</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Direct object access</emphasis></td>
<td rowspan="1" colspan="1"
>Enable
direct browser access to content, such as for
a control
panel</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Realtime visibility into client
requests</emphasis></td>
<td rowspan="1" colspan="1"
>Know
what users are
requesting</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Supports S3 API</emphasis></td>
<td rowspan="1" colspan="1"
>Utilize
tools that were designed for the popular S3
API</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Restrict containers per
account</emphasis></td>
<td rowspan="1" colspan="1"
>Limit
access to control usage by
user</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Support for NetApp, Nexenta,
SolidFire</emphasis></td>
<td rowspan="1" colspan="1"
>Unified
support for block volumes using a variety of
storage
systems</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Snapshot and backup API for block
volumes</emphasis></td>
<td rowspan="1" colspan="1"
>Data
protection and recovery for VM
data</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Standalone volume API
available</emphasis></td>
<td rowspan="1" colspan="1"
>Separate
endpoint and API for integration with other
compute
systems</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Integration with Compute</emphasis></td>
<td rowspan="1" colspan="1"
>Fully
integrated to Compute for attaching block
volumes and reporting on usage</td>
</tr>
</tbody>
</informaltable>
</para>
</chapter>

View File

@ -1,100 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch003-obj-store-capabilities">
<title>Object Storage Capabilities</title>
<itemizedlist>
<listitem>
<para>OpenStack provides redundant, scalable object
storage using clusters of standardized servers capable
of storing petabytes of data</para>
</listitem>
<listitem>
<para>Object Storage is not a traditional file system, but
rather a distributed storage system for static data
such as virtual machine images, photo storage, email
storage, backups and archives. Having no central
"brain" or master point of control provides greater
scalability, redundancy and durability.</para>
</listitem>
<listitem>
<para>Objects and files are written to multiple disk
drives spread throughout servers in the data center,
with the OpenStack software responsible for ensuring
data replication and integrity across the
cluster.</para>
</listitem>
<listitem>
<para>Storage clusters scale horizontally simply by adding
new servers. Should a server or hard drive fail,
OpenStack replicates its content from other active
nodes to new locations in the cluster. Because
OpenStack uses software logic to ensure data
replication and distribution across different devices,
inexpensive commodity hard drives and servers can be
used in lieu of more expensive equipment.</para>
</listitem>
</itemizedlist>
<para><guilabel>Swift Characteristics</guilabel></para>
<para>The key characteristics of Swift include:</para>
<itemizedlist>
<listitem>
<para>All objects stored in Swift have a URL</para>
</listitem>
<listitem>
<para>All objects stored are replicated 3x in
as-unique-as-possible zones, which can be defined as a
group of drives, a node, a rack etc.</para>
</listitem>
<listitem>
<para>All objects have their own metadata</para>
</listitem>
<listitem>
<para>Developers interact with the object storage system
through a RESTful HTTP API</para>
</listitem>
<listitem>
<para>Object data can be located anywhere in the
cluster</para>
</listitem>
<listitem>
<para>The cluster scales by adding additional nodes --
without sacrificing performance, which allows a more
cost-effective linear storage expansion vs. fork-lift
upgrades</para>
</listitem>
<listitem>
<para>Data doesnt have to be migrated to an entirely new
storage system</para>
</listitem>
<listitem>
<para>New nodes can be added to the cluster without
downtime</para>
</listitem>
<listitem>
<para>Failed nodes and disks can be swapped out with no
downtime</para>
</listitem>
<listitem>
<para>Runs on industry-standard hardware, such as Dell,
HP, Supermicro etc.</para>
</listitem>
</itemizedlist>
<figure>
<title>Object Storage(Swift)</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image39.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Developers can either write directly to the Swift API or use
one of the many client libraries that exist for all popular
programming languages, such as Java, Python, Ruby and C#.
Amazon S3 and RackSpace Cloud Files users should feel very
familiar with Swift. For users who have not used an object
storage system before, it will require a different approach
and mindset than using a traditional filesystem.</para>
</chapter>

View File

@ -1,295 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch004-swift-building-blocks">
<title>Building Blocks of Swift</title>
<para>The components that enable Swift to deliver high
availability, high durability and high concurrency
are:</para>
<itemizedlist>
<listitem>
<para><emphasis role="bold">Proxy
Servers:</emphasis>Handles all incoming API
requests.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Rings:</emphasis>Maps
logical names of data to locations on particular
disks.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Zones:</emphasis>Each Zone
isolates data from other Zones. A failure in one Zone
doesnt impact the rest of the cluster because data is
replicated across the Zones.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Accounts &amp;
Containers:</emphasis>Each Account and Container
are individual databases that are distributed across
the cluster. An Account database contains the list of
Containers in that Account. A Container database
contains the list of Objects in that Container</para>
</listitem>
<listitem>
<para><emphasis role="bold">Objects:</emphasis>The
data itself.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Partitions:</emphasis>A
Partition stores Objects, Account databases and
Container databases. Its an intermediate 'bucket'
that helps manage locations where data lives in the
cluster.</para>
</listitem>
</itemizedlist>
<figure>
<title>Building Blocks</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image40.png"/>
</imageobject>
</mediaobject>
</figure>
<para><guilabel>Proxy Servers</guilabel></para>
<para>The Proxy Servers are the public face of Swift and
handle all incoming API requests. Once a Proxy Server
receive a request, it will determine the storage node
based on the URL of the object, such as <literal>
https://swift.example.com/v1/account/container/object
</literal>. The Proxy Servers also coordinates responses,
handles failures and coordinates timestamps.</para>
<para>Proxy servers use a shared-nothing architecture and can
be scaled as needed based on projected workloads. A
minimum of two Proxy Servers should be deployed for
redundancy. Should one proxy server fail, the others will
take over.</para>
<para><guilabel>The Ring</guilabel></para>
<para>A ring represents a mapping between the names of entities
stored on disk and their physical location. There are separate
rings for accounts, containers, and objects. When other
components need to perform any operation on an object,
container, or account, they need to interact with the
appropriate ring to determine its location in the
cluster.</para>
<para>The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the ring is
replicated, by default, 3 times across the cluster, and the
locations for a partition are stored in the mapping maintained
by the ring. The ring is also responsible for determining
which devices are used for hand off in failure
scenarios.</para>
<para>Data can be isolated with the concept of zones in the
ring. Each replica of a partition is guaranteed to reside
in a different zone. A zone could represent a drive, a
server, a cabinet, a switch, or even a data center.</para>
<para>The partitions of the ring are equally divided among all
the devices in the OpenStack Object Storage installation.
When partitions need to be moved around, such as when a
device is added to the cluster, the ring ensures that a
minimum number of partitions are moved at a time, and only
one replica of a partition is moved at a time.</para>
<para>Weights can be used to balance the distribution of
partitions on drives across the cluster. This can be
useful, for example, when different sized drives are used
in a cluster.</para>
<para>The ring is used by the Proxy server and several
background processes (like replication).</para>
<para>The Ring maps Partitions to physical locations on disk.
When other components need to perform any operation on an
object, container, or account, they need to interact with
the Ring to determine its location in the cluster.</para>
<para>The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the Ring is
replicated three times by default across the cluster, and
the locations for a partition are stored in the mapping
maintained by the Ring. The Ring is also responsible for
determining which devices are used for handoff should a
failure occur.</para>
<figure>
<title>The Lord of the <emphasis role="bold"
>Ring</emphasis>s</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image41.png"/>
</imageobject>
</mediaobject>
</figure>
<para>The Ring maps partitions to physical locations on
disk.</para>
<para>The rings determine where data should reside in the
cluster. There is a separate ring for account databases,
container databases, and individual objects but each ring
works in the same way. These rings are externally managed,
in that the server processes themselves do not modify the
rings, they are instead given new rings modified by other
tools.</para>
<para>The ring uses a configurable number of bits from a
paths MD5 hash as a partition index that designates a
device. The number of bits kept from the hash is known as
the partition power, and 2 to the partition power
indicates the partition count. Partitioning the full MD5
hash ring allows other parts of the cluster to work in
batches of items at once which ends up either more
efficient or at least less complex than working with each
item separately or the entire cluster all at once.</para>
<para>Another configurable value is the replica count, which
indicates how many of the partition-&gt;device assignments
comprise a single ring. For a given partition number, each
replicas device will not be in the same zone as any other
replica's device. Zones can be used to group devices based on
physical locations, power separations, network separations, or
any other attribute that would lessen multiple replicas being
unavailable at the same time.</para>
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
<para>Swift allows zones to be configured to isolate
failure boundaries. Each replica of the data resides
in a separate zone, if possible. At the smallest
level, a zone could be a single drive or a grouping of
a few drives. If there were five object storage
servers, then each server would represent its own
zone. Larger deployments would have an entire rack (or
multiple racks) of object servers, each representing a
zone. The goal of zones is to allow the cluster to
tolerate significant outages of storage servers
without losing all replicas of the data.</para>
<para>As we learned earlier, everything in Swift is
stored, by default, three times. Swift will place each
replica "as-uniquely-as-possible" to ensure both high
availability and high durability. This means that when
choosing a replica location, Swift will choose a server
in an unused zone before an unused server in a zone
that already has a replica of the data.</para>
<figure>
<title>image33.png</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image42.png"/>
</imageobject>
</mediaobject>
</figure>
<para>When a disk fails, replica data is automatically
distributed to the other zones to ensure there are
three copies of the data</para>
<para><guilabel>Accounts &amp;
Containers</guilabel></para>
<para>Each account and container is an individual SQLite
database that is distributed across the cluster. An
account database contains the list of containers in
that account. A container database contains the list
of objects in that container.</para>
<figure>
<title>Accounts and Containers</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image43.png"/>
</imageobject>
</mediaobject>
</figure>
<para>To keep track of object data location, each account
in the system has a database that references all its
containers, and each container database references
each object</para>
<para><guilabel>Partitions</guilabel></para>
<para>A Partition is a collection of stored data,
including Account databases, Container databases, and
objects. Partitions are core to the replication
system.</para>
<para>Think of a Partition as a bin moving throughout a
fulfillment center warehouse. Individual orders get
thrown into the bin. The system treats that bin as a
cohesive entity as it moves throughout the system. A
bin full of things is easier to deal with than lots of
little things. It makes for fewer moving parts
throughout the system.</para>
<para>The system replicators and object uploads/downloads
operate on Partitions. As the system scales up,
behavior continues to be predictable as the number of
Partitions is a fixed number.</para>
<para>The implementation of a Partition is conceptually
simple -- a partition is just a directory sitting on a
disk with a corresponding hash table of what it
contains.</para>
<figure>
<title>Partitions</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image44.png"/>
</imageobject>
</mediaobject>
</figure>
<para>*Swift partitions contain all data in the
system.</para>
<para><guilabel>Replication</guilabel></para>
<para>In order to ensure that there are three copies of
the data everywhere, replicators continuously examine
each Partition. For each local Partition, the
replicator compares it against the replicated copies
in the other Zones to see if there are any
differences.</para>
<para>How does the replicator know if replication needs to
take place? It does this by examining hashes. A hash
file is created for each Partition, which contains
hashes of each directory in the Partition. Each of the
three hash files is compared. For a given Partition,
the hash files for each of the Partition's copies are
compared. If the hashes are different, then it is time
to replicate and the directory that needs to be
replicated is copied over.</para>
<para>This is where the Partitions come in handy. With
fewer "things" in the system, larger chunks of data
are transferred around (rather than lots of little TCP
connections, which is inefficient) and there are a
consistent number of hashes to compare.</para>
<para>The cluster has eventually consistent behavior where
the newest data wins.</para>
<figure>
<title>Replication</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image45.png"/>
</imageobject>
</mediaobject>
</figure>
<para>*If a zone goes down, one of the nodes containing a
replica notices and proactively copies data to a
handoff location.</para>
<para>To describe how these pieces all come together, let's walk
through a few scenarios and introduce the components.</para>
<para><guilabel>Bird-eye View</guilabel></para>
<para><emphasis role="bold">Upload</emphasis></para>
<para>A client uses the REST API to make a HTTP request to PUT
an object into an existing Container. The cluster receives
the request. First, the system must figure out where the
data is going to go. To do this, the Account name,
Container name and Object name are all used to determine
the Partition where this object should live.</para>
<para>Then a lookup in the Ring figures out which storage
nodes contain the Partitions in question.</para>
<para>The data then is sent to each storage node where it is
placed in the appropriate Partition. A quorum is required
-- at least two of the three writes must be successful
before the client is notified that the upload was
successful.</para>
<para>Next, the Container database is updated asynchronously
to reflect that there is a new object in it.</para>
<figure>
<title>When End-User uses Swift</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image46.png"/>
</imageobject>
</mediaobject>
</figure>
<para><emphasis role="bold">Download</emphasis></para>
<para>A request comes in for an Account/Container/object.
Using the same consistent hashing, the Partition name is
generated. A lookup in the Ring reveals which storage
nodes contain that Partition. A request is made to one of
the storage nodes to fetch the object and if that fails,
requests are made to the other nodes.</para>
</chapter>

View File

@ -1,146 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch005-the-ring">
<title>Ring Builder</title>
<para>The rings are built and managed manually by a utility called
the ring-builder. The ring-builder assigns partitions to
devices and writes an optimized Python structure to a gzipped,
serialized file on disk for shipping out to the servers. The
server processes just check the modification time of the file
occasionally and reload their in-memory copies of the ring
structure as needed. Because of how the ring-builder manages
changes to the ring, using a slightly older ring usually just
means one of the three replicas for a subset of the partitions
will be incorrect, which can be easily worked around.</para>
<para>The ring-builder also keeps its own builder file with the
ring information and additional data required to build future
rings. It is very important to keep multiple backup copies of
these builder files. One option is to copy the builder files
out to every server while copying the ring files themselves.
Another is to upload the builder files into the cluster
itself. Complete loss of a builder file will mean creating a
new ring from scratch, nearly all partitions will end up
assigned to different devices, and therefore nearly all data
stored will have to be replicated to new locations. So,
recovery from a builder file loss is possible, but data will
definitely be unreachable for an extended time.</para>
<para><guilabel>Ring Data Structure</guilabel></para>
<para>The ring data structure consists of three top level
fields: a list of devices in the cluster, a list of lists
of device ids indicating partition to device assignments,
and an integer indicating the number of bits to shift an
MD5 hash to calculate the partition for the hash.</para>
<para><guilabel>Partition Assignment
List</guilabel></para>
<para>This is a list of array(H) of devices ids. The
outermost list contains an array(H) for each
replica. Each array(H) has a length equal to the
partition count for the ring. Each integer in the
array(H) is an index into the above list of devices.
The partition list is known internally to the Ring
class as _replica2part2dev_id.</para>
<para>So, to create a list of device dictionaries assigned
to a partition, the Python code would look like:
devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]</para>
<para>That code is a little simplistic, as it does not
account for the removal of duplicate devices. If a
ring has more replicas than devices, then a partition
will have more than one replica on one device; thats
simply the pigeonhole principle at work.</para>
<para>array(H) is used for memory conservation as there
may be millions of partitions.</para>
<para><guilabel>Fractional Replicas</guilabel></para>
<para>A ring is not restricted to having an integer number
of replicas. In order to support the gradual changing
of replica counts, the ring is able to have a real
number of replicas.</para>
<para>When the number of replicas is not an integer, then
the last element of _replica2part2dev_id will have a
length that is less than the partition count for the
ring. This means that some partitions will have more
replicas than others. For example, if a ring has 3.25
replicas, then 25% of its partitions will have four
replicas, while the remaining 75% will have just
three.</para>
<para><guilabel>Partition Shift Value</guilabel></para>
<para>The partition shift value is known internally to the
Ring class as _part_shift. This value used to shift an
MD5 hash to calculate the partition on which the data
for that hash should reside. Only the top four bytes
of the hash is used in this process. For example, to
compute the partition for the path
/account/container/object the Python code might look
like: partition = unpack_from('&gt;I',
md5('/account/container/object').digest())[0] &gt;&gt;
self._part_shift</para>
<para>For a ring generated with part_power P, the
partition shift value is 32 - P.</para>
<para><guilabel>Building the Ring</guilabel></para>
<para>The initial building of the ring first calculates the
number of partitions that should ideally be assigned to
each device based the devices weight. For example, given
a partition power of 20, the ring will have 1,048,576
partitions. If there are 1,000 devices of equal weight
they will each desire 1,048.576 partitions. The devices
are then sorted by the number of partitions they desire
and kept in order throughout the initialization
process.</para>
<para>Note: each device is also assigned a random tiebreaker
value that is used when two devices desire the same number
of partitions. This tiebreaker is not stored on disk
anywhere, and so two different rings created with the same
parameters will have different partition assignments. For
repeatable partition assignments, RingBuilder.rebalance()
takes an optional seed value that will be used to seed
Pythons pseudo-random number generator.</para>
<para>Then, the ring builder assigns each replica of each
partition to the device that desires the most partitions
at that point while keeping it as far away as possible
from other replicas. The ring builder prefers to assign a
replica to a device in a regions that has no replicas
already; should there be no such region available, the
ring builder will try to find a device in a different
zone; if not possible, it will look on a different server;
failing that, it will just look for a device that has no
replicas; finally, if all other options are exhausted, the
ring builder will assign the replica to the device that
has the fewest replicas already assigned. Note that
assignment of multiple replicas to one device will only
happen if the ring has fewer devices than it has
replicas.</para>
<para>When building a new ring based on an old ring, the
desired number of partitions each device wants is
recalculated. Next the partitions to be reassigned are
gathered up. Any removed devices have all their assigned
partitions unassigned and added to the gathered list. Any
partition replicas that (due to the addition of new
devices) can be spread out for better durability are
unassigned and added to the gathered list. Any devices
that have more partitions than they now desire have random
partitions unassigned from them and added to the gathered
list. Lastly, the gathered partitions are then reassigned
to devices using a similar method as in the initial
assignment described above.</para>
<para>Whenever a partition has a replica reassigned, the time
of the reassignment is recorded. This is taken into
account when gathering partitions to reassign so that no
partition is moved twice in a configurable amount of time.
This configurable amount of time is known internally to
the RingBuilder class as min_part_hours. This restriction
is ignored for replicas of partitions on devices that have
been removed, as removing a device only happens on device
failure and theres no choice but to make a
reassignment.</para>
<para>The above processes dont always perfectly rebalance a
ring due to the random nature of gathering partitions for
reassignment. To help reach a more balanced ring, the
rebalance process is repeated until near perfect (less 1%
off) or when the balance doesnt improve by at least 1%
(indicating we probably cant get perfect balance due to
wildly imbalanced zones or too many partitions recently
moved).</para>
</chapter>

View File

@ -1,93 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE section [
<!ENTITY % openstack SYSTEM "../common/entities/openstack.ent">
%openstack;
]>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch007-cluster-architecture">
<title>Cluster architecture</title>
<para><guilabel>Access Tier</guilabel></para>
<figure>
<title>Object Storage cluster architecture</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image47.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Large-scale deployments segment off an "Access Tier".
This tier is the “Grand Central” of the Object Storage
system. It fields incoming API requests from clients and
moves data in and out of the system. This tier is composed
of front-end load balancers, ssl- terminators,
authentication services, and it runs the (distributed)
brain of the Object Storage system — the proxy server
processes.</para>
<para>Having the access servers in their own tier enables
read/write access to be scaled out independently of
storage capacity. For example, if the cluster is on the
public Internet and requires SSL-termination and has high
demand for data access, many access servers can be
provisioned. However, if the cluster is on a private
network and it is being used primarily for archival
purposes, fewer access servers are needed.</para>
<para>A load balancer can be incorporated into the access tier,
because this is an HTTP addressable storage service.</para>
<para>Typically, this tier comprises a collection of 1U
servers. These machines use a moderate amount of RAM and
are network I/O intensive. It is wise to provision them with
two high-throughput (10GbE) interfaces, because these systems
field each incoming API request. One interface is
used for 'front-end' incoming requests and the other for
'back-end' access to the Object Storage nodes to put and
fetch data.</para>
<para><guilabel>Factors to consider</guilabel></para>
<para>For most publicly facing deployments as well as
private deployments available across a wide-reaching
corporate network, SSL is used to encrypt traffic
to the client. SSL adds significant processing load to
establish sessions between clients; it adds more capacity to
the access layer that will need to be provisioned. SSL may
not be required for private deployments on trusted
networks.</para>
<para><guilabel>Storage Nodes</guilabel></para>
<figure>
<title>Object Storage (Swift)</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image48.png"/>
</imageobject>
</mediaobject>
</figure>
<para>The next component is the storage servers themselves.
Generally, most configurations should provide each of the
five Zones with an equal amount of storage capacity.
Storage nodes use a reasonable amount of memory and CPU.
Metadata needs to be readily available to quickly return
objects. The object stores run services not only to field
incoming requests from the Access Tier, but to also run
replicators, auditors, and reapers. Object stores can be
provisioned with a single gigabit or a 10-gigabit network
interface depending on expected workload and desired
performance.</para>
<para>Currently, a 2&nbsp;TB or 3&nbsp;TB SATA disk delivers
good performance for the price. Desktop-grade drives can
be used where there are responsive remote hands in the
datacenter, and enterprise-grade drives can be used where
this is not the case.</para>
<para><guilabel>Factors to Consider</guilabel></para>
<para>Desired I/O performance for single-threaded requests
should be kept in mind. This system does not use RAID,
so each request for an object is handled by a single
disk. Disk performance impacts single-threaded
response rates.</para>
<para>To achieve apparent higher throughput, the object
storage system is designed with concurrent
uploads/downloads in mind. The network I/O capacity
(1GbE, bonded 1GbE pair, or 10GbE) should match your
desired concurrent throughput needs for reads and
writes.</para>
</chapter>

View File

@ -1,58 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch008-account-reaper">
<title>Account Reaper</title>
<para>The Account Reaper removes data from deleted accounts in the
background.</para>
<para>An account is marked for deletion by a reseller issuing a
DELETE request on the accounts storage URL. This simply puts
the value DELETED into the status column of the account_stat
table in the account database (and replicas), indicating the
data for the account should be deleted later.</para>
<para>There is normally no set retention time and no undelete; it
is assumed the reseller will implement such features and only
call DELETE on the account once it is truly desired the
accounts data be removed. However, in order to protect the
Swift cluster accounts from an improper or mistaken delete
request, you can set a delay_reaping value in the
[account-reaper] section of the account-server.conf to delay
the actual deletion of data. At this time, there is no utility
to undelete an account; one would have to update the account
database replicas directly, setting the status column to an
empty string and updating the put_timestamp to be greater than
the delete_timestamp. (On the TODO list is writing a utility
to perform this task, preferably through a ReST call.)</para>
<para>The account reaper runs on each account server and scans the
server occasionally for account databases marked for deletion.
It will only trigger on accounts that server is the primary
node for, so that multiple account servers arent all trying
to do the same work at the same time. Using multiple servers
to delete one account might improve deletion speed, but
requires coordination so they arent duplicating efforts. Speed
really isnt as much of a concern with data deletion and large
accounts arent deleted that often.</para>
<para>The deletion process for an account itself is pretty
straightforward. For each container in the account, each
object is deleted and then the container is deleted. Any
deletion requests that fail wont stop the overall process,
but will cause the overall process to fail eventually (for
example, if an object delete times out, the container wont be
able to be deleted later and therefore the account wont be
deleted either). The overall process continues even on a
failure so that it doesnt get hung up reclaiming cluster
space because of one troublesome spot. The account reaper will
keep trying to delete an account until it eventually becomes
empty, at which point the database reclaim process within the
db_replicator will eventually remove the database
files.</para>
<para>Sometimes a persistent error state can prevent some object
or container from being deleted. If this happens, you will see
a message such as “Account &lt;name&gt; has not been reaped
since &lt;date&gt;” in the log. You can control when this is
logged with the reap_warn_after value in the [account-reaper]
section of the account-server.conf file. By default this is 30
days.</para>
</chapter>

View File

@ -1,101 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch009-replication">
<title>Replication</title>
<para>Because each replica in swift functions independently, and
clients generally require only a simple majority of nodes
responding to consider an operation successful, transient
failures like network partitions can quickly cause replicas to
diverge. These differences are eventually reconciled by
asynchronous, peer-to-peer replicator processes. The
replicator processes traverse their local filesystems,
concurrently performing operations in a manner that balances
load across physical disks.</para>
<para>Replication uses a push model, with records and files
generally only being copied from local to remote replicas.
This is important because data on the node may not belong
there (as in the case of handoffs and ring changes), and a
replicator cant know what data exists elsewhere in the
cluster that it should pull in. Its the duty of any node that
contains data to ensure that data gets to where it belongs.
Replica placement is handled by the ring.</para>
<para>Every deleted record or file in the system is marked by a
tombstone, so that deletions can be replicated alongside
creations. The replication process cleans up tombstones after
a time period known as the consistency window. The consistency
window encompasses replication duration and how long transient
failure can remove a node from the cluster. Tombstone cleanup
must be tied to replication to reach replica
convergence.</para>
<para>If a replicator detects that a remote drive has failed, the
replicator uses the get_more_nodes interface for the ring to
choose an alternate node with which to synchronize. The
replicator can maintain desired levels of replication in the
face of disk failures, though some replicas may not be in an
immediately usable location. Note that the replicator doesnt
maintain desired levels of replication when other failures,
such as entire node failures occur, because most failure are
transient.</para>
<para>Replication is an area of active development, and likely
rife with potential improvements to speed and
accuracy.</para>
<para>There are two major classes of replicator - the db
replicator, which replicates accounts and containers, and the
object replicator, which replicates object data.</para>
<para><guilabel>DB Replication</guilabel></para>
<para>The first step performed by db replication is a low-cost
hash comparison to determine whether two replicas already
match. Under normal operation, this check is able to
verify that most databases in the system are already
synchronized very quickly. If the hashes differ, the
replicator brings the databases in sync by sharing records
added since the last sync point.</para>
<para>This sync point is a high water mark noting the last
record at which two databases were known to be in sync,
and is stored in each database as a tuple of the remote
database id and record id. Database ids are unique amongst
all replicas of the database, and record ids are
monotonically increasing integers. After all new records
have been pushed to the remote database, the entire sync
table of the local database is pushed, so the remote
database can guarantee that it is in sync with everything
with which the local database has previously
synchronized.</para>
<para>If a replica is found to be missing entirely, the whole
local database file is transmitted to the peer using
rsync(1) and vested with a new unique id.</para>
<para>In practice, DB replication can process hundreds of
databases per concurrency setting per second (up to the
number of available CPUs or disks) and is bound by the
number of DB transactions that must be performed.</para>
<para><guilabel>Object Replication</guilabel></para>
<para>The initial implementation of object replication simply
performed an rsync to push data from a local partition to
all remote servers it was expected to exist on. While this
performed adequately at small scale, replication times
skyrocketed once directory structures could no longer be
held in RAM. We now use a modification of this scheme in
which a hash of the contents for each suffix directory is
saved to a per-partition hashes file. The hash for a
suffix directory is invalidated when the contents of that
suffix directory are modified.</para>
<para>The object replication process reads in these hash
files, calculating any invalidated hashes. It then
transmits the hashes to each remote server that should
hold the partition, and only suffix directories with
differing hashes on the remote server are rsynced. After
pushing files to the remote server, the replication
process notifies it to recalculate hashes for the rsynced
suffix directories.</para>
<para>Performance of object replication is generally bound by
the number of uncached directories it has to traverse,
usually as a result of invalidated suffix directory
hashes. Using write volume and partition counts from our
running systems, it was designed so that around 2% of the
hash space on a normal node will be invalidated per day,
which has experimentally given us acceptable replication
speeds.</para>
</chapter>