2054 lines
118 KiB
XML
2054 lines
118 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||
xml:id="ch_running-openstack-object-storage">
|
||
<title>System Administration for OpenStack Object Storage</title>
|
||
<para>By understanding the concepts inherent to the Object Storage system you can better monitor and administer your storage solution. </para>
|
||
|
||
<section xml:id="understanding-how-object-storage-works">
|
||
|
||
<title>Understanding How Object Storage Works</title>
|
||
<para>This section offers a brief overview of each concept in administering Object Storage. </para>
|
||
<simplesect xml:id="the-ring"><title>The Ring</title>
|
||
|
||
<para>A ring represents a mapping between the names of entities stored on disk and their physical location. There are separate rings for accounts, containers, and objects. When other components need to perform any operation on an object, container, or account, they need to interact with the appropriate ring to determine its location in the cluster.
|
||
</para>
|
||
<para>The Ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the ring is replicated, by default, 3 times across the cluster, and the locations for a partition are stored in the mapping maintained by the ring. The ring is also responsible for determining which devices are used for handoff in failure scenarios.</para>
|
||
|
||
<para>Data can be isolated with the concept of zones in the ring. Each replica of a partition is guaranteed to reside in a different zone. A zone could represent a drive, a server, a cabinet, a switch, or even a datacenter.</para>
|
||
|
||
<para>The partitions of the ring are equally divided among all the devices in the OpenStack Object Storage installation. When partitions need to be moved around (for example if a device is added to the cluster), the ring ensures that a minimum number of partitions are moved at a time, and only one replica of a partition is moved at a time.</para>
|
||
|
||
<para>Weights can be used to balance the distribution of partitions on drives across the cluster. This can be useful, for example, when different sized drives are used in a cluster.</para>
|
||
|
||
<para>The ring is used by the Proxy server and several background processes (like replication).</para></simplesect>
|
||
<simplesect><title>Proxy Server</title>
|
||
<para>The Proxy Server is responsible for tying together the rest of the OpenStack Object Storage architecture. For each request, it will look up the location of the account, container, or object in the ring (see below) and route the request accordingly. The public API is also exposed through the Proxy Server.
|
||
</para>
|
||
<para>A large number of failures are also handled in the Proxy Server. For example, if a
|
||
server is unavailable for an object PUT, it will ask the ring for a hand-off server
|
||
and route there instead. </para>
|
||
<para>When objects are streamed to or from an object server, they are streamed directly through the proxy server to or from the user – the proxy server does not spool them.</para>
|
||
<para>You can use a proxy server with account management enabled by configuring it in
|
||
the proxy server configuration file.</para></simplesect>
|
||
|
||
<simplesect xml:id="object-server"><title>Object Server</title>
|
||
|
||
<para>The Object Server is a very simple blob storage server that can store, retrieve and delete objects stored on local devices. Objects are stored as binary files on the filesystem with metadata stored in the file’s extended attributes (xattrs). This requires that the underlying filesystem choice for object servers support xattrs on files. Some filesystems, like ext3, have xattrs turned off by default.</para>
|
||
|
||
<para>Each object is stored using a path derived from the object name’s hash and the operation’s timestamp. Last write always wins, and ensures that the latest object version will be served. A deletion is also treated as a version of the file (a 0 byte file ending with “.ts”, which stands for tombstone). This ensures that deleted files are replicated correctly and older versions don’t magically reappear due to failure scenarios.</para>
|
||
</simplesect>
|
||
|
||
<simplesect xml:id="container-server"><title>Container Server</title>
|
||
|
||
<para>The Container Server’s primary job is to handle listings of objects. It doesn’t know where those object’s are, just what objects are in a specific container. The listings are stored as sqlite database files, and replicated across the cluster similar to how objects are. Statistics are also tracked that include the total number of objects, and total storage usage for that container.</para>
|
||
</simplesect>
|
||
<simplesect><title>Account Server</title>
|
||
<para>The Account Server is very similar to the Container Server, excepting that it is
|
||
responsible for listings of containers rather than objects.</para>
|
||
</simplesect>
|
||
<simplesect xml:id="swift-replication"> <title>Replication</title>
|
||
|
||
<para>Replication is designed to keep the system in a consistent state in the face of temporary error conditions like network outages or drive failures.
|
||
</para>
|
||
<para>The replication processes compare local data with each remote copy to ensure they all contain the latest version. Object replication uses a hash list to quickly compare subsections of each partition, and container and account replication use a combination of hashes and shared high water marks.
|
||
</para>
|
||
<para>Replication updates are push based. For object replication, updating is just a matter of rsyncing files to the peer. Account and container replication push missing records over HTTP or rsync whole database files.
|
||
</para>
|
||
<para>The replicator also ensures that data is removed from the system. When an item (object, container, or account) is deleted, a tombstone is set as the latest version of the item. The replicator will see the tombstone and ensure that the item is removed from the entire system.
|
||
</para></simplesect>
|
||
<simplesect xml:id="updaters"><title>Updaters</title>
|
||
|
||
<para>There are times when container or account data can not be immediately updated.
|
||
This usually occurs during failure scenarios or periods of high load. If an update
|
||
fails, the update is queued locally on the file system, and the updater will process
|
||
the failed updates. This is where an eventual consistency window will most likely
|
||
come in to play. For example, suppose a container server is under load and a new
|
||
object is put in to the system. The object will be immediately available for reads
|
||
as soon as the proxy server responds to the client with success. However, the
|
||
container server did not update the object listing, and so the update would be
|
||
queued for a later update. Container listings, therefore, may not immediately
|
||
contain the object. </para>
|
||
<para>In practice, the consistency window is only as large as the frequency at which the updater runs and may not even be noticed as the proxy server will route listing requests to the first container server which responds. The server under load may not be the one that serves subsequent listing requests – one of the other two replicas may handle the listing.</para>
|
||
</simplesect>
|
||
<simplesect xml:id="auditors"><title>Auditors</title>
|
||
|
||
<para>Auditors crawl the local server checking the integrity of the objects, containers, and accounts. If corruption is found (in the case of bit rot, for example), the file is quarantined, and replication will replace the bad file from another replica. If other errors are found they are logged (for example, an object’s listing can’t be found on any container server it should be).
|
||
|
||
</para></simplesect>
|
||
</section>
|
||
<section xml:id="configuring-and-tuning-openstack-object-storage">
|
||
<title>Configuring and Tuning OpenStack Object Storage</title>
|
||
<para>This section walks through deployment options and considerations.</para>
|
||
|
||
<para>You have multiple deployment options to choose from. The swift services run completely autonomously, which provides for a lot of flexibility when
|
||
designing the hardware deployment for swift. The 4 main services are:</para>
|
||
|
||
<itemizedlist>
|
||
<listitem><para>Proxy Services</para></listitem>
|
||
<listitem><para>Object Services</para></listitem>
|
||
<listitem><para>Container Services</para></listitem>
|
||
<listitem><para>Account Services</para></listitem>
|
||
</itemizedlist>
|
||
|
||
<para>The Proxy Services are more CPU and network I/O intensive. If you are using
|
||
10g networking to the proxy, or are terminating SSL traffic at the proxy,
|
||
greater CPU power will be required.</para>
|
||
<para>The Object, Container, and Account Services (Storage Services) are more disk
|
||
and network I/O intensive.</para>
|
||
<para>The easiest deployment is to install all services on each server. There is
|
||
nothing wrong with doing this, as it scales each service out horizontally.</para>
|
||
<para>At Rackspace, we put the Proxy Services on their own servers and all of the
|
||
Storage Services on the same server. This allows us to send 10g networking to
|
||
the proxy and 1g to the storage servers, and keep load balancing to the
|
||
proxies more manageable. Storage Services scale out horizontally as storage
|
||
servers are added, and we can scale overall API throughput by adding more
|
||
Proxies.</para>
|
||
<para>If you need more throughput to either Account or Container Services, they may
|
||
each be deployed to their own servers. For example you might use faster (but
|
||
more expensive) SAS or even SSD drives to get faster disk I/O to the databases.</para>
|
||
<para>Load balancing and network design is left as an exercise to the reader,
|
||
but this is a very important part of the cluster, so time should be spent
|
||
designing the network for a Swift cluster.</para>
|
||
</section>
|
||
<section xml:id="preparing-the-ring">
|
||
|
||
<title>Preparing the Ring</title>
|
||
<para>The first step is to determine the number of partitions that will be in the
|
||
ring. We recommend that there be a minimum of 100 partitions per drive to
|
||
insure even distribution across the drives. A good starting point might be
|
||
to figure out the maximum number of drives the cluster will contain, and then
|
||
multiply by 100, and then round up to the nearest power of two.</para>
|
||
<para>For example, imagine we are building a cluster that will have no more than
|
||
5,000 drives. That would mean that we would have a total number of 500,000
|
||
partitions, which is pretty close to 2^19, rounded up.</para>
|
||
<para>It is also a good idea to keep the number of partitions small (relatively).
|
||
The more partitions there are, the more work that has to be done by the
|
||
replicators and other backend jobs and the more memory the rings consume in
|
||
process. The goal is to find a good balance between small rings and maximum
|
||
cluster size.</para>
|
||
<para>The next step is to determine the number of replicas to store of the data.
|
||
Currently it is recommended to use 3 (as this is the only value that has
|
||
been tested). The higher the number, the more storage that is used but the
|
||
less likely you are to lose data.</para>
|
||
|
||
<para>It is also important to determine how many zones the cluster should have. It is
|
||
recommended to start with a minimum of 5 zones. You can start with fewer, but
|
||
our testing has shown that having at least five zones is optimal when failures
|
||
occur. We also recommend trying to configure the zones at as high a level as
|
||
possible to create as much isolation as possible. Some example things to take
|
||
into consideration can include physical location, power availability, and
|
||
network connectivity. For example, in a small cluster you might decide to
|
||
split the zones up by cabinet, with each cabinet having its own power and
|
||
network connectivity. The zone concept is very abstract, so feel free to use
|
||
it in whatever way best isolates your data from failure. Zones are referenced
|
||
by number, beginning with 1.</para>
|
||
<para>You can now start building the ring with:</para>
|
||
<literallayout>swift-ring-builder <builder_file> create <part_power> <replicas> <min_part_hours></literallayout>
|
||
|
||
<para>This will start the ring build process creating the <builder_file> with
|
||
2^<part_power> partitions. <min_part_hours> is the time in hours before a
|
||
specific partition can be moved in succession (24 is a good value for this).</para>
|
||
|
||
<para>Devices can be added to the ring with:</para>
|
||
<literallayout>swift-ring-builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight></literallayout>
|
||
|
||
|
||
<para>This will add a device to the ring where <builder_file> is the name of the
|
||
builder file that was created previously, <zone> is the number of the zone
|
||
this device is in, <ip> is the ip address of the server the device is in,
|
||
<port> is the port number that the server is running on, <device_name> is
|
||
the name of the device on the server (for example: sdb1), <meta> is a string
|
||
of metadata for the device (optional), and <weight> is a float weight that
|
||
determines how many partitions are put on the device relative to the rest of
|
||
the devices in the cluster (a good starting point is 100.0 x TB on the drive).
|
||
Add each device that will be initially in the cluster.</para>
|
||
|
||
<para>Once all of the devices are added to the ring, run:</para>
|
||
<literallayout>swift-ring-builder <builder_file> rebalance</literallayout>
|
||
|
||
<para>This will distribute the partitions across the drives in the ring. It is
|
||
important whenever making changes to the ring to make all the changes
|
||
required before running rebalance. This will ensure that the ring stays as
|
||
balanced as possible, and as few partitions are moved as possible.</para>
|
||
<para>The above process should be done to make a ring for each storage service
|
||
(Account, Container and Object). The builder files will be needed in future
|
||
changes to the ring, so it is very important that these be kept and backed up.
|
||
The resulting .tar.gz ring file should be pushed to all of the servers in the
|
||
cluster. For more information about building rings, running
|
||
swift-ring-builder with no options will display help text with available
|
||
commands and options. </para>
|
||
|
||
</section>
|
||
<section xml:id="server-configuration-reference">
|
||
|
||
<title>Server Configuration Reference</title>
|
||
<para>Swift uses paste.deploy to manage server configurations. Default configuration
|
||
options are set in the <code>[DEFAULT]</code> section, and any options specified there
|
||
can be overridden in any of the other sections.</para>
|
||
<section xml:id="object-server-configuration">
|
||
<title>Object Server Configuration</title>
|
||
<para>An Example Object Server configuration can be found at
|
||
etc/object-server.conf-sample in the source code repository.</para>
|
||
<para>The following configuration options are available:</para>
|
||
|
||
<table rules="all">
|
||
<caption>object-server.conf Default Options in the [DEFAULT] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>swift_dir</td>
|
||
<td>/etc/swift</td>
|
||
<td>Swift configuration directory</td>
|
||
</tr>
|
||
<tr>
|
||
<td>devices</td>
|
||
<td>/srv/node</td>
|
||
<td>Parent directory of where devices are mounted</td>
|
||
</tr>
|
||
<tr>
|
||
<td>mount_check</td>
|
||
<td>true</td>
|
||
<td>Whether or not check if the devices are mounted to prevent accidentally
|
||
writing to the root device</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_ip</td>
|
||
<td>0.0.0.0</td>
|
||
<td>IP Address for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_port</td>
|
||
<td>6000</td>
|
||
<td>Port for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>workers</td>
|
||
<td>1</td>
|
||
<td>Number of workers to fork</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
<table rules="all">
|
||
<caption>object-server.conf Server Options in the [object-server] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>use</td>
|
||
<td> </td>
|
||
<td>paste.deploy entry point for the object server. For most cases, this
|
||
should be <code>egg:swift#object</code>.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>object-server</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_requests</td>
|
||
<td>True</td>
|
||
<td>Whether or not to log each request</td>
|
||
</tr>
|
||
<tr>
|
||
<td>user</td>
|
||
<td>swift</td>
|
||
<td>User to run as</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>3</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>network_chunk_size</td>
|
||
<td>65536</td>
|
||
<td>Size of chunks to read/write over the network</td>
|
||
</tr>
|
||
<tr>
|
||
<td>disk_chunk_size</td>
|
||
<td>65536</td>
|
||
<td>Size of chunks to read/write to disk</td>
|
||
</tr>
|
||
<tr>
|
||
<td>max_upload_time</td>
|
||
<td>86400</td>
|
||
<td>Maximum time allowed to upload an object</td>
|
||
</tr>
|
||
<tr>
|
||
<td>slow</td>
|
||
<td>0</td>
|
||
<td>If > 0, Minimum time in seconds for a PUT or DELETE request to
|
||
complete</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>object-server.conf Replicator Options in the [object-replicator] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>object-replicator</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>daemonize</td>
|
||
<td>yes</td>
|
||
<td>Whether or not to run replication as a daemon</td>
|
||
</tr>
|
||
<tr>
|
||
<td>run_pause</td>
|
||
<td>30</td>
|
||
<td>Time in seconds to wait between replication passes</td>
|
||
</tr>
|
||
<tr>
|
||
<td>concurrency</td>
|
||
<td>1</td>
|
||
<td>Number of replication workers to spawn</td>
|
||
</tr>
|
||
<tr>
|
||
<td>timeout</td>
|
||
<td>5</td>
|
||
<td>Timeout value sent to rsync –timeout and –contimeout
|
||
options</td>
|
||
</tr>
|
||
<tr>
|
||
<td>stats_interval</td>
|
||
<td>3600</td>
|
||
<td>Interval in seconds between logging replication statistics</td>
|
||
</tr>
|
||
<tr>
|
||
<td>reclaim_age</td>
|
||
<td>604800</td>
|
||
<td>Time elapsed in seconds before an object can be reclaimed</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>object-server.conf Updater Options in the [object-updater] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>object-updater</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>interval</td>
|
||
<td>300</td>
|
||
<td>Minimum time for a pass to take</td>
|
||
</tr>
|
||
<tr>
|
||
<td>concurrency</td>
|
||
<td>1</td>
|
||
<td>Number of updater workers to spawn</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>10</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>slowdown</td>
|
||
<td>0.01</td>
|
||
<td>Time in seconds to wait between objects</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>object-server.conf Auditor Options in the [object-auditor] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>object-auditor</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>files_per_second</td>
|
||
<td>20</td>
|
||
<td>Maximum files audited per second. Should be tuned according to
|
||
individual system specs. 0 is unlimited.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bytes_per_second</td>
|
||
<td>10000000</td>
|
||
<td>Maximum bytes audited per second. Should be tuned according to
|
||
individual system specs. 0 is unlimited.</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</section>
|
||
<section xml:id="container-server-configuration">
|
||
<title>Container Server Configuration</title>
|
||
<para>An example Container Server configuration can be found at
|
||
etc/container-server.conf-sample in the source code repository.</para>
|
||
<para>The following configuration options are available:</para>
|
||
<table rules="all">
|
||
<caption>container-server.conf Default Options in the [DEFAULT] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>swift_dir</td>
|
||
<td>/etc/swift</td>
|
||
<td>Swift configuration directory</td>
|
||
</tr>
|
||
<tr>
|
||
<td>devices</td>
|
||
<td>/srv/node</td>
|
||
<td>Parent directory of where devices are mounted</td>
|
||
</tr>
|
||
<tr>
|
||
<td>mount_check</td>
|
||
<td>true</td>
|
||
<td>Whether or not check if the devices are mounted to prevent accidentally
|
||
writing to the root device</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_ip</td>
|
||
<td>0.0.0.0</td>
|
||
<td>IP Address for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_port</td>
|
||
<td>6001</td>
|
||
<td>Port for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>workers</td>
|
||
<td>1</td>
|
||
<td>Number of workers to fork</td>
|
||
</tr>
|
||
<tr>
|
||
<td>user</td>
|
||
<td>swift</td>
|
||
<td>User to run as</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
<table rules="all">
|
||
<caption>container-server.conf Server Options in the [container-server] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>use</td>
|
||
<td> </td>
|
||
<td>paste.deploy entry point for the container server. For most cases, this
|
||
should be <code>egg:swift#container</code>.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>container-server</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>3</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>container-server.conf Replicator Options in the [container-replicator] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>container-replicator</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>per_diff</td>
|
||
<td>1000</td>
|
||
<td> </td>
|
||
</tr>
|
||
<tr>
|
||
<td>concurrency</td>
|
||
<td>8</td>
|
||
<td>Number of replication workers to spawn</td>
|
||
</tr>
|
||
<tr>
|
||
<td>run_pause</td>
|
||
<td>30</td>
|
||
<td>Time in seconds to wait between replication passes</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>10</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>reclaim_age</td>
|
||
<td>604800</td>
|
||
<td>Time elapsed in seconds before a container can be reclaimed</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>container-server.conf Updater Options in the [container-updater] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>container-updater</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>interval</td>
|
||
<td>300</td>
|
||
<td>Minimum time for a pass to take</td>
|
||
</tr>
|
||
<tr>
|
||
<td>concurrency</td>
|
||
<td>4</td>
|
||
<td>Number of updater workers to spawn</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>3</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>slowdown</td>
|
||
<td>0.01</td>
|
||
<td>Time in seconds to wait between containers</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>container-server.conf Auditor Options in the [container-auditor] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>container-auditor</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>interval</td>
|
||
<td>1800</td>
|
||
<td>Minimum time for a pass to take</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</section>
|
||
<section xml:id="account-server-configuration">
|
||
<title>Account Server Configuration</title>
|
||
<para>An example Account Server configuration can be found at
|
||
etc/account-server.conf-sample in the source code repository.</para>
|
||
<para>The following configuration options are available:</para>
|
||
<table rules="all">
|
||
<caption>account-server.conf Default Options in the [DEFAULT] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>swift_dir</td>
|
||
<td>/etc/swift</td>
|
||
<td>Swift configuration directory</td>
|
||
</tr>
|
||
<tr>
|
||
<td>devices</td>
|
||
<td>/srv/node</td>
|
||
<td>Parent directory or where devices are mounted</td>
|
||
</tr>
|
||
<tr>
|
||
<td>mount_check</td>
|
||
<td>true</td>
|
||
<td>Whether or not check if the devices are mounted to prevent accidentally
|
||
writing to the root device</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_ip</td>
|
||
<td>0.0.0.0</td>
|
||
<td>IP Address for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_port</td>
|
||
<td>6002</td>
|
||
<td>Port for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>workers</td>
|
||
<td>1</td>
|
||
<td>Number of workers to fork</td>
|
||
</tr>
|
||
<tr>
|
||
<td>user</td>
|
||
<td>swift</td>
|
||
<td>User to run as</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>account-server.conf Server Options in the [account-server] section</caption>
|
||
<tbody >
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>use</td>
|
||
<td> </td>
|
||
<td>Entry point for paste.deploy for the account server. For most cases,
|
||
this should be <code>egg:swift#account</code>.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>account-server</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>account-server.conf Replicator Options in the [account-replicator] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>account-replicator</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>per_diff</td>
|
||
<td>1000</td>
|
||
<td> </td>
|
||
</tr>
|
||
<tr>
|
||
<td>concurrency</td>
|
||
<td>8</td>
|
||
<td>Number of replication workers to spawn</td>
|
||
</tr>
|
||
<tr>
|
||
<td>run_pause</td>
|
||
<td>30</td>
|
||
<td>Time in seconds to wait between replication passes</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>10</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>reclaim_age</td>
|
||
<td>604800</td>
|
||
<td>Time elapsed in seconds before an account can be reclaimed</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>account-server.conf Auditor Options in the [account-auditor] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>account-auditor</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>interval</td>
|
||
<td>1800</td>
|
||
<td>Minimum time for a pass to take</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>account-server.conf Reaper Options in the [account-reaper] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>account-auditor</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Logging level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>concurrency</td>
|
||
<td>25</td>
|
||
<td>Number of replication workers to spawn</td>
|
||
</tr>
|
||
<tr>
|
||
<td>interval</td>
|
||
<td>3600</td>
|
||
<td>Minimum time for a pass to take</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>10</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</section>
|
||
<section xml:id="proxy-server-configuration">
|
||
<title>Proxy Server Configuration</title>
|
||
<para>An example Proxy Server configuration can be found at etc/proxy-server.conf-sample
|
||
in the source code repository.</para>
|
||
<para>The following configuration options are available:</para>
|
||
<table rules="all">
|
||
<caption>proxy-server.conf Default Options in the [DEFAULT] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_ip</td>
|
||
<td>0.0.0.0</td>
|
||
<td>IP Address for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>bind_port</td>
|
||
<td>80</td>
|
||
<td>Port for server to bind to</td>
|
||
</tr>
|
||
<tr>
|
||
<td>swift_dir</td>
|
||
<td>/etc/swift</td>
|
||
<td>Swift configuration directory</td>
|
||
</tr>
|
||
<tr>
|
||
<td>workers</td>
|
||
<td>1</td>
|
||
<td>Number of workers to fork</td>
|
||
</tr>
|
||
<tr>
|
||
<td>user</td>
|
||
<td>swift</td>
|
||
<td>User to run as</td>
|
||
</tr>
|
||
<tr>
|
||
<td>cert_file</td>
|
||
<td> </td>
|
||
<td>Path to the ssl .crt</td>
|
||
</tr>
|
||
<tr>
|
||
<td>key_file</td>
|
||
<td> </td>
|
||
<td>Path to the ssl .key</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<table rules="all">
|
||
<caption>proxy-server.conf Server Options in the [proxy-server] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>use</td>
|
||
<td> </td>
|
||
<td>Entry point for paste.deploy for the proxy server. For most cases, this
|
||
should be <code>egg:swift#proxy</code>.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>proxy-server</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Log level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_headers</td>
|
||
<td>True</td>
|
||
<td>If True, log headers in each request</td>
|
||
</tr>
|
||
<tr>
|
||
<td>recheck_account_existence</td>
|
||
<td>60</td>
|
||
<td>Cache timeout in seconds to send memcached for account existence</td>
|
||
</tr>
|
||
<tr>
|
||
<td>recheck_container_existence</td>
|
||
<td>60</td>
|
||
<td>Cache timeout in seconds to send memcached for container existence</td>
|
||
</tr>
|
||
<tr>
|
||
<td>object_chunk_size</td>
|
||
<td>65536</td>
|
||
<td>Chunk size to read from object servers</td>
|
||
</tr>
|
||
<tr>
|
||
<td>client_chunk_size</td>
|
||
<td>65536</td>
|
||
<td>Chunk size to read from clients</td>
|
||
</tr>
|
||
<tr>
|
||
<td>memcache_servers</td>
|
||
<td>127.0.0.1:11211</td>
|
||
<td>Comma separated list of memcached servers ip:port</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>10</td>
|
||
<td>Request timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>client_timeout</td>
|
||
<td>60</td>
|
||
<td>Timeout to read one chunk from a client</td>
|
||
</tr>
|
||
<tr>
|
||
<td>conn_timeout</td>
|
||
<td>0.5</td>
|
||
<td>Connection timeout to external services</td>
|
||
</tr>
|
||
<tr>
|
||
<td>error_suppression_interval</td>
|
||
<td>60</td>
|
||
<td>Time in seconds that must elapse since the last error for a node to be
|
||
considered no longer error limited</td>
|
||
</tr>
|
||
<tr>
|
||
<td>error_suppression_limit</td>
|
||
<td>10</td>
|
||
<td>Error count to consider a node error limited</td>
|
||
</tr>
|
||
<tr>
|
||
<td>allow_account_management</td>
|
||
<td>false</td>
|
||
<td>Whether account PUTs and DELETEs are even callable</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
<table rules="all">
|
||
<caption>proxy-server.conf Paste.deploy Options in the [filter:swauth] section</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
</tr>
|
||
<tr>
|
||
<td>use</td>
|
||
<td> </td>
|
||
<td>Entry point for paste.deploy to use for auth, set to:
|
||
<code>egg:swauth#swauth</code> to use the swauth downloaded from
|
||
https://github.com/gholt/swauth </td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_name</td>
|
||
<td>auth-server</td>
|
||
<td>Label used when logging</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_facility</td>
|
||
<td>LOG_LOCAL0</td>
|
||
<td>Syslog log facility</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_level</td>
|
||
<td>INFO</td>
|
||
<td>Log level</td>
|
||
</tr>
|
||
<tr>
|
||
<td>log_headers</td>
|
||
<td>True</td>
|
||
<td>If True, log headers in each request</td>
|
||
</tr>
|
||
<tr>
|
||
<td>reseller_prefix</td>
|
||
<td>AUTH</td>
|
||
<td>The naming scope for the auth service. Swift storage accounts and auth
|
||
tokens will begin with this prefix.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>auth_prefix</td>
|
||
<td>/auth/</td>
|
||
<td>The HTTP request path prefix for the auth service. Swift itself reserves
|
||
anything beginning with the letter <code>v</code>.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>default_swift_cluster</td>
|
||
<td>local#http://127.0.0.1:8080/v1</td>
|
||
<td>The default Swift cluster to place newly created accounts on; only
|
||
needed if you are using Swauth for authentication.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>token_life</td>
|
||
<td>86400</td>
|
||
<td>The number of seconds a token is valid.</td>
|
||
</tr>
|
||
<tr>
|
||
<td>node_timeout</td>
|
||
<td>10</td>
|
||
<td>Request timeout</td>
|
||
</tr>
|
||
<tr>
|
||
<td>super_admin_key</td>
|
||
<td>None</td>
|
||
<td>The key for the .super_admin account.</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</section>
|
||
</section>
|
||
<section xml:id="considerations-and-tuning">
|
||
<title>Considerations and Tuning</title><para>Fine-tuning your deployment and installation may take some time and effort. Here are some considerations for improving performance of an OpenStack Object Storage installation.</para>
|
||
<section xml:id="memcached-considerations">
|
||
<title>Memcached Considerations</title>
|
||
|
||
<para>Several of the Services rely on Memcached for caching certain types of
|
||
lookups, such as auth tokens, and container/account existence. Swift does
|
||
not do any caching of actual object data. Memcached should be able to run
|
||
on any servers that have available RAM and CPU. At Rackspace, we run
|
||
Memcached on the proxy servers. The <code>memcache_servers</code> config option
|
||
in the <code>proxy-server.conf</code> should contain all memcached servers.</para>
|
||
</section>
|
||
<section xml:id="system-time">
|
||
<title>System Time</title>
|
||
<para>Time may be relative but it is relatively important for Swift! Swift uses
|
||
timestamps to determine which is the most recent version of an object.
|
||
It is very important for the system time on each server in the cluster to
|
||
by synced as closely as possible (more so for the proxy server, but in general
|
||
it is a good idea for all the servers). At Rackspace, we use NTP with a local
|
||
NTP server to ensure that the system times are as close as possible. This
|
||
should also be monitored to ensure that the times do not vary too much.</para>
|
||
</section>
|
||
<section xml:id="general-service-tuning">
|
||
|
||
<title>General Service Tuning</title>
|
||
<para>Most services support either a worker or concurrency value in the settings.
|
||
This allows the services to make effective use of the cores available. A good
|
||
starting point to set the concurrency level for the proxy and storage services
|
||
to 2 times the number of cores available. If more than one service is
|
||
sharing a server, then some experimentation may be needed to find the best
|
||
balance.</para>
|
||
<para>At Rackspace, our Proxy servers have dual quad core processors, giving us 8
|
||
cores. Our testing has shown 16 workers to be a pretty good balance when
|
||
saturating a 10g network and gives good CPU utilization.</para>
|
||
<para>Our Storage servers all run together on the same servers. These servers have
|
||
dual quad core processors, for 8 cores total. We run the Account, Container,
|
||
and Object servers with 8 workers each. Most of the background jobs are run
|
||
at a concurrency of 1, with the exception of the replicators which are run at
|
||
a concurrency of 2.</para>
|
||
<para>The above configuration setting should be taken as suggestions and testing
|
||
of configuration settings should be done to ensure best utilization of CPU,
|
||
network connectivity, and disk I/O.</para>
|
||
</section>
|
||
<section xml:id="filesystem-considerations">
|
||
<title>Filesystem Considerations</title>
|
||
<para>Swift is designed to be mostly filesystem agnostic–the only requirement
|
||
being that the filesystem supports extended attributes (xattrs). After
|
||
thorough testing with our use cases and hardware configurations, XFS was
|
||
the best all-around choice. If you decide to use a filesystem other than
|
||
XFS, we highly recommend thorough testing.</para>
|
||
|
||
<para>If you are using XFS, some settings that can dramatically impact
|
||
performance. We recommend the following when creating the XFS
|
||
partition:</para>
|
||
<para><code>mkfs.xfs -i size=1024 -f /dev/sda1</code></para>
|
||
|
||
<para>Setting the inode size is important, as XFS stores xattr data in the inode.
|
||
If the metadata is too large to fit in the inode, a new extent is created,
|
||
which can cause quite a performance problem. Upping the inode size to 1024
|
||
bytes provides enough room to write the default metadata, plus a little
|
||
headroom. We do not recommend running Swift on RAID, but if you are using
|
||
RAID it is also important to make sure that the proper sunit and swidth
|
||
settings get set so that XFS can make most efficient use of the RAID array.</para>
|
||
<para>We also recommend the following example mount options when using XFS:</para>
|
||
<para><code>mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /srv/node/sda</code>
|
||
</para>
|
||
<para>For a standard swift install, all data drives are mounted directly under /srv/node
|
||
(as can be seen in the above example of mounting /dev/sda1 as /srv/node/sda). If you
|
||
choose to mount the drives in another directory, be sure to set the
|
||
<code>devices</code> config option in all of the server configs to point to the
|
||
correct directory.</para>
|
||
|
||
</section>
|
||
<section xml:id="general-system-tuning">
|
||
<title>General System Tuning</title>
|
||
<para>Rackspace currently runs Swift on Ubuntu Server 10.04, and the following
|
||
changes have been found to be useful for our use cases.</para>
|
||
<para>The following settings should be in <code>/etc/sysctl.conf</code>:</para>
|
||
<literallayout>
|
||
|
||
# disable TIME_WAIT.. wait..
|
||
net.ipv4.tcp_tw_recycle=1
|
||
net.ipv4.tcp_tw_reuse=1
|
||
|
||
# disable syn cookies
|
||
net.ipv4.tcp_syncookies = 0
|
||
|
||
# double amount of allowed conntrack
|
||
net.ipv4.netfilter.ip_conntrack_max = 262144
|
||
|
||
</literallayout>
|
||
|
||
<para>To load the updated sysctl settings, run <code>sudo sysctl -p </code></para>
|
||
<para>A note about changing the TIME_WAIT values. By default the OS will hold
|
||
a port open for 60 seconds to ensure that any remaining packets can be
|
||
received. During high usage, and with the number of connections that are
|
||
created, it is easy to run out of ports. We can change this since we are
|
||
in control of the network. If you are not in control of the network, or
|
||
do not expect high loads, then you may not want to adjust those values.</para>
|
||
</section>
|
||
<section xml:id="logging-considerations">
|
||
<title>Logging Considerations</title>
|
||
<para>Swift is set up to log directly to syslog. Every service can be configured with
|
||
the <code>log_facility</code> option to set the syslog log facility destination. We
|
||
recommend using syslog-ng to route the logs to specific log files locally on the
|
||
server and also to remote log collecting servers.</para>
|
||
|
||
</section>
|
||
<section xml:id="working-with-rings">
|
||
<title>Working with Rings</title>
|
||
<para>The rings determine where data should reside in the cluster. There is a
|
||
separate ring for account databases, container databases, and individual
|
||
objects but each ring works in the same way. These rings are externally
|
||
managed, in that the server processes themselves do not modify the rings, they
|
||
are instead given new rings modified by other tools.</para>
|
||
<para>The ring uses a configurable number of bits from a path's MD5 hash as a
|
||
partition index that designates a device. The number of bits kept from the hash
|
||
is known as the partition power, and 2 to the partition power indicates the
|
||
partition count. Partitioning the full MD5 hash ring allows other parts of the
|
||
cluster to work in batches of items at once which ends up either more efficient
|
||
or at least less complex than working with each item separately or the entire
|
||
cluster all at once.</para>
|
||
<para>Another configurable value is the replica count, which indicates how many of
|
||
the partition->device assignments comprise a single ring. For a given partition
|
||
number, each replica's device will not be in the same zone as any other
|
||
replica's device. Zones can be used to group devices based on physical
|
||
locations, power separations, network separations, or any other attribute that
|
||
would lessen multiple replicas being unavailable at the same time.</para>
|
||
<section xml:id="managing-rings-with-the-ring-builder">
|
||
<title>Managing Rings with the Ring Builder</title>
|
||
<para>The rings are built and managed manually by a utility called the ring-builder.
|
||
The ring-builder assigns partitions to devices and writes an optimized Python
|
||
structure to a gzipped, pickled file on disk for shipping out to the servers.
|
||
The server processes just check the modification time of the file occasionally
|
||
and reload their in-memory copies of the ring structure as needed. Because of
|
||
how the ring-builder manages changes to the ring, using a slightly older ring
|
||
usually just means one of the three replicas for a subset of the partitions
|
||
will be incorrect, which can be easily worked around.</para>
|
||
<para>The ring-builder also keeps its own builder file with the ring information and
|
||
additional data required to build future rings. It is very important to keep
|
||
multiple backup copies of these builder files. One option is to copy the
|
||
builder files out to every server while copying the ring files themselves.
|
||
Another is to upload the builder files into the cluster itself. Complete loss
|
||
of a builder file will mean creating a new ring from scratch, nearly all
|
||
partitions will end up assigned to different devices, and therefore nearly all
|
||
data stored will have to be replicated to new locations. So, recovery from a
|
||
builder file loss is possible, but data will definitely be unreachable for an
|
||
extended time.</para>
|
||
<section xml:id="about-the-ring-data-structure">
|
||
<title>About the Ring Data Structure</title>
|
||
<para>The ring data structure consists of three top level fields: a list of devices
|
||
in the cluster, a list of lists of device ids indicating partition to device
|
||
assignments, and an integer indicating the number of bits to shift an MD5 hash
|
||
to calculate the partition for the hash.</para>
|
||
<section xml:id="list-of-devices-in-the-ring">
|
||
<title>List of Devices in the Ring</title>
|
||
|
||
<para>The list of devices is known internally to the Ring class as devs. Each item in
|
||
the list of devices is a dictionary with the following keys:</para>
|
||
<table rules="all">
|
||
<caption>List of Devices and Keys</caption>
|
||
<tbody>
|
||
<tr>
|
||
<td>Key</td>
|
||
<td>Type</td>
|
||
<td>Description</td></tr>
|
||
|
||
<tr><td>id</td>
|
||
<td>integer</td>
|
||
<td>The index into the list devices.</td>
|
||
</tr>
|
||
<tr><td>zone</td>
|
||
|
||
<td>integer</td>
|
||
<td>The zone the devices resides in.</td>
|
||
</tr>
|
||
<tr><td>weight</td>
|
||
<td>float</td>
|
||
<td>The relative weight of the device in comparison to other
|
||
devices. This usually corresponds directly to the amount of
|
||
disk space the device has compared to other devices. For
|
||
instance a device with 1 terabyte of space might have a weight
|
||
of 100.0 and another device with 2 terabytes of space might
|
||
have a weight of 200.0. This weight can also be used to bring
|
||
back into balance a device that has ended up with more or less
|
||
data than desired over time. A good average weight of 100.0
|
||
allows flexibility in lowering the weight later if necessary.</td>
|
||
</tr>
|
||
<tr><td>ip</td>
|
||
<td>string</td>
|
||
<td>The IP address of the server containing the device.</td>
|
||
|
||
</tr>
|
||
<tr><td>port</td>
|
||
<td>int</td>
|
||
<td>The TCP port the listening server process uses that serves
|
||
requests for the device.</td>
|
||
</tr>
|
||
<tr><td>device</td>
|
||
<td>string</td>
|
||
<td>The on disk name of the device on the server.
|
||
For example: sdb1</td>
|
||
</tr>
|
||
<tr><td>meta</td>
|
||
|
||
<td>string</td>
|
||
<td>A general-use field for storing additional information for the
|
||
device. This information isn't used directly by the server
|
||
processes, but can be useful in debugging. For example, the
|
||
date and time of installation and hardware manufacturer could
|
||
be stored here.</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<para>Note: The list of devices may contain holes, or indexes set to None, for
|
||
devices that have been removed from the cluster. Generally, device ids are not
|
||
reused. Also, some devices may be temporarily disabled by setting their weight
|
||
to 0.0. </para>
|
||
</section>
|
||
</section>
|
||
|
||
<section xml:id="partition-assignment-list">
|
||
<title>Partition Assignment List</title>
|
||
<para>This is a list of array(‘I') of devices ids. The outermost list contains an
|
||
array(‘I') for each replica. Each array(‘I') has a length equal to the
|
||
partition count for the ring. Each integer in the array(‘I') is an index into
|
||
the above list of devices. The partition list is known internally to the Ring
|
||
class as _replica2part2dev_id.</para>
|
||
|
||
<para>So, to create a list of device dictionaries assigned to a partition, the Python
|
||
code would look like: devices = [self.devs[part2dev_id[partition]] for part2dev_id in self._replica2part2dev_id]</para>
|
||
<para>array(‘I') is used for memory conservation as there may be millions of
|
||
partitions.</para>
|
||
|
||
</section>
|
||
<section xml:id="partition-shift-value">
|
||
<title>Partition Shift Value</title>
|
||
<para>The partition shift value is known internally to the Ring class as _part_shift.
|
||
This value used to shift an MD5 hash to calculate the partition on which the
|
||
data for that hash should reside. Only the top four bytes of the hash is used
|
||
in this process. For example, to compute the partition for the path
|
||
/account/container/object the Python code might look like: partition = unpack_from('>I', md5('/account/container/object').digest())[0] >>self._part_shift</para>
|
||
|
||
</section></section>
|
||
<section xml:id="building-the-ring">
|
||
<title>Building the Ring</title>
|
||
<para>The initial building of the ring first calculates the number of partitions that
|
||
should ideally be assigned to each device based the device's weight. For
|
||
example, if the partition power of 20 the ring will have 1,048,576 partitions.
|
||
If there are 1,000 devices of equal weight they will each desire 1,048.576
|
||
partitions. The devices are then sorted by the number of partitions they desire
|
||
and kept in order throughout the initialization process.</para>
|
||
<para>Then, the ring builder assigns each partition's replica to the device that
|
||
desires the most partitions at that point, with the restriction that the device
|
||
is not in the same zone as any other replica for that partition. Once assigned,
|
||
the device's desired partition count is decremented and moved to its new sorted
|
||
location in the list of devices and the process continues.</para>
|
||
<para>When building a new ring based on an old ring, the desired number of partitions
|
||
each device wants is recalculated. Next the partitions to be reassigned are
|
||
gathered up. Any removed devices have all their assigned partitions unassigned
|
||
and added to the gathered list. Any devices that have more partitions than they
|
||
now desire have random partitions unassigned from them and added to the
|
||
gathered list. Lastly, the gathered partitions are then reassigned to devices
|
||
using a similar method as in the initial assignment described above.</para>
|
||
<para>Whenever a partition has a replica reassigned, the time of the reassignment is
|
||
recorded. This is taken into account when gathering partitions to reassign so
|
||
that no partition is moved twice in a configurable amount of time. This
|
||
configurable amount of time is known internally to the RingBuilder class as
|
||
min_part_hours. This restriction is ignored for replicas of partitions on
|
||
devices that have been removed, as removing a device only happens on device
|
||
failure and there's no choice but to make a reassignment.</para>
|
||
|
||
<para>The above processes don't always perfectly rebalance a ring due to the random
|
||
nature of gathering partitions for reassignment. To help reach a more balanced
|
||
ring, the rebalance process is repeated until near perfect (less 1% off) or
|
||
when the balance doesn't improve by at least 1% (indicating we probably can't
|
||
get perfect balance due to wildly imbalanced zones or too many partitions
|
||
recently moved).</para>
|
||
</section>
|
||
<section xml:id="history-of-the-ring-design">
|
||
<title>History of the Ring Design</title>
|
||
<para>The ring code went through many iterations before arriving at what it is now
|
||
and while it has been stable for a while now, the algorithm may be tweaked or
|
||
perhaps even fundamentally changed if new ideas emerge. This section will try
|
||
to describe the previous ideas attempted and attempt to explain why they were
|
||
discarded.</para>
|
||
<para>A “live ring” option was considered where each server could maintain its own
|
||
copy of the ring and the servers would use a gossip protocol to communicate the
|
||
changes they made. This was discarded as too complex and error prone to code
|
||
correctly in the project time span available. One bug could easily gossip bad
|
||
data out to the entire cluster and be difficult to recover from. Having an
|
||
externally managed ring simplifies the process, allows full validation of data
|
||
before it's shipped out to the servers, and guarantees each server is using a
|
||
ring from the same timeline. It also means that the servers themselves aren't
|
||
spending a lot of resources maintaining rings.</para>
|
||
|
||
<para>A couple of “ring server” options were considered. One was where all ring
|
||
lookups would be done by calling a service on a separate server or set of
|
||
servers, but this was discarded due to the latency involved. Another was much
|
||
like the current process but where servers could submit change requests to the
|
||
ring server to have a new ring built and shipped back out to the servers. This
|
||
was discarded due to project time constraints and because ring changes are
|
||
currently infrequent enough that manual control was sufficient. However, lack
|
||
of quick automatic ring changes did mean that other parts of the system had to
|
||
be coded to handle devices being unavailable for a period of hours until
|
||
someone could manually update the ring.</para>
|
||
<para>The current ring process has each replica of a partition independently assigned
|
||
to a device. A version of the ring that used a third of the memory was tried,
|
||
where the first replica of a partition was directly assigned and the other two
|
||
were determined by “walking” the ring until finding additional devices in other
|
||
zones. This was discarded as control was lost as to how many replicas for a
|
||
given partition moved at once. Keeping each replica independent allows for
|
||
moving only one partition replica within a given time window (except due to
|
||
device failures). Using the additional memory was deemed a good tradeoff for
|
||
moving data around the cluster much less often.</para>
|
||
<para>Another ring design was tried where the partition to device assignments weren't
|
||
stored in a big list in memory but instead each device was assigned a set of
|
||
hashes, or anchors. The partition would be determined from the data item's hash
|
||
and the nearest device anchors would determine where the replicas should be
|
||
stored. However, to get reasonable distribution of data each device had to have
|
||
a lot of anchors and walking through those anchors to find replicas started to
|
||
add up. In the end, the memory savings wasn't that great and more processing
|
||
power was used, so the idea was discarded.</para>
|
||
<para>A completely non-partitioned ring was also tried but discarded as the
|
||
partitioning helps many other parts of the system, especially replication.
|
||
Replication can be attempted and retried in a partition batch with the other
|
||
replicas rather than each data item independently attempted and retried. Hashes
|
||
of directory structures can be calculated and compared with other replicas to
|
||
reduce directory walking and network traffic.</para>
|
||
|
||
<para>Partitioning and independently assigning partition replicas also allowed for
|
||
the best balanced cluster. The best of the other strategies tended to give
|
||
+-10% variance on device balance with devices of equal weight and +-15% with
|
||
devices of varying weights. The current strategy allows us to get +-3% and +-8%
|
||
respectively.</para>
|
||
<para>Various hashing algorithms were tried. SHA offers better security, but the ring
|
||
doesn't need to be cryptographically secure and SHA is slower. Murmur was much
|
||
faster, but MD5 was built-in and hash computation is a small percentage of the
|
||
overall request handling time. In all, once it was decided the servers wouldn't
|
||
be maintaining the rings themselves anyway and only doing hash lookups, MD5 was
|
||
chosen for its general availability, good distribution, and adequate speed.</para>
|
||
</section>
|
||
</section>
|
||
<section xml:id="the-account-reaper">
|
||
<title>The Account Reaper</title>
|
||
<para>The Account Reaper removes data from deleted accounts in the background.</para>
|
||
<para>An account is marked for deletion by a reseller through the services server's
|
||
remove_storage_account XMLRPC call. This simply puts the value DELETED into the
|
||
status column of the account_stat table in the account database (and replicas),
|
||
indicating the data for the account should be deleted later. There is no set
|
||
retention time and no undelete; it is assumed the reseller will implement such
|
||
features and only call remove_storage_account once it is truly desired the
|
||
account's data be removed.</para>
|
||
|
||
<para>The account reaper runs on each account server and scans the server
|
||
occasionally for account databases marked for deletion. It will only trigger on
|
||
accounts that server is the primary node for, so that multiple account servers
|
||
aren't all trying to do the same work at the same time. Using multiple servers
|
||
to delete one account might improve deletion speed, but requires coordination
|
||
so they aren't duplicating effort. Speed really isn't as much of a concern with
|
||
data deletion and large accounts aren't deleted that often.</para>
|
||
<para>The deletion process for an account itself is pretty straightforward. For each
|
||
container in the account, each object is deleted and then the container is
|
||
deleted. Any deletion requests that fail won't stop the overall process, but
|
||
will cause the overall process to fail eventually (for example, if an object
|
||
delete times out, the container won't be able to be deleted later and therefore
|
||
the account won't be deleted either). The overall process continues even on a
|
||
failure so that it doesn't get hung up reclaiming cluster space because of one
|
||
troublesome spot. The account reaper will keep trying to delete an account
|
||
until it eventually becomes empty, at which point the database reclaim process
|
||
within the db_replicator will eventually remove the database files.</para>
|
||
<section xml:id="account-reaper-background-and-history">
|
||
<title>Account Reaper Background and History</title>
|
||
<para>At first, a simple approach of deleting an account through completely external
|
||
calls was considered as it required no changes to the system. All data would
|
||
simply be deleted in the same way the actual user would, through the public
|
||
REST API. However, the downside was that it would use proxy resources and log
|
||
everything when it didn't really need to. Also, it would likely need a
|
||
dedicated server or two, just for issuing the delete requests.</para>
|
||
|
||
<para>A completely bottom-up approach was also considered, where the object and
|
||
container servers would occasionally scan the data they held and check if the
|
||
account was deleted, removing the data if so. The upside was the speed of
|
||
reclamation with no impact on the proxies or logging, but the downside was that
|
||
nearly 100% of the scanning would result in no action creating a lot of I/O
|
||
load for no reason.</para>
|
||
<para>A more container server centric approach was also considered, where the account
|
||
server would mark all the containers for deletion and the container servers
|
||
would delete the objects in each container and then themselves. This has the
|
||
benefit of still speedy reclamation for accounts with a lot of containers, but
|
||
has the downside of a pretty big load spike. The process could be slowed down
|
||
to alleviate the load spike possibility, but then the benefit of speedy
|
||
reclamation is lost and what's left is just a more complex process. Also,
|
||
scanning all the containers for those marked for deletion when the majority
|
||
wouldn't be seemed wasteful. The db_replicator could do this work while
|
||
performing its replication scan, but it would have to spawn and track deletion
|
||
processes which seemed needlessly complex.</para>
|
||
<para>In the end, an account server centric approach seemed best, as described above.</para>
|
||
|
||
</section>
|
||
</section>
|
||
</section>
|
||
<section xml:id="replication">
|
||
<title>Replication</title>
|
||
<para>Since each replica in OpenStack Object Storage functions independently, and clients generally require only a simple majority of nodes responding to consider an operation successful, transient failures like network partitions can quickly cause replicas to diverge. These differences are eventually reconciled by asynchronous, peer-to-peer replicator processes. The replicator processes traverse their local filesystems, concurrently performing operations in a manner that balances load across physical disks.</para>
|
||
<para>Replication uses a push model, with records and files generally only being copied from local to remote replicas. This is important because data on the node may not belong there (as in the case of handoffs and ring changes), and a replicator can't know what data exists elsewhere in the cluster that it should pull in. It's the duty of any node that contains data to ensure that data gets to where it belongs. Replica placement is handled by the ring.</para>
|
||
|
||
<para>Every deleted record or file in the system is marked by a tombstone, so that deletions can be replicated alongside creations. These tombstones are cleaned up by the replication process after a period of time referred to as the consistency window, which is related to replication duration and how long transient failures can remove a node from the cluster. Tombstone cleanup must be tied to replication to reach replica convergence.</para>
|
||
<para>If a replicator detects that a remote drive is has failed, it will use the ring's “get_more_nodes” interface to choose an alternate node to synchronize with. The replicator can generally maintain desired levels of replication in the face of hardware failures, though some replicas may not be in an immediately usable location.</para>
|
||
<para>Replication is an area of active development, and likely rife with potential improvements to speed and correctness.</para>
|
||
<para>There are two major classes of replicator - the db replicator, which replicates accounts and containers, and the object replicator, which replicates object data.</para>
|
||
<section xml:id="database-replication">
|
||
<title>Database Replication</title>
|
||
<para>The first step performed by db replication is a low-cost hash comparison to find out whether or not two replicas already match. Under normal operation, this check is able to verify that most databases in the system are already synchronized very quickly. If the hashes differ, the replicator brings the databases in sync by sharing records added since the last sync point.</para>
|
||
|
||
<para>This sync point is a high water mark noting the last record at which two databases were known to be in sync, and is stored in each database as a tuple of the remote database id and record id. Database ids are unique amongst all replicas of the database, and record ids are monotonically increasing integers. After all new records have been pushed to the remote database, the entire sync table of the local database is pushed, so the remote database knows it's now in sync with everyone the local database has previously synchronized with.</para>
|
||
<para>If a replica is found to be missing entirely, the whole local database file is transmitted to the peer using rsync(1) and vested with a new unique id.</para>
|
||
<para>In practice, DB replication can process hundreds of databases per concurrency setting per second (up to the number of available CPUs or disks) and is bound by the number of DB transactions that must be performed.</para>
|
||
</section>
|
||
<section xml:id="object-replication">
|
||
<title>Object Replication</title>
|
||
<para>The initial implementation of object replication simply performed an rsync to push data from a local partition to all remote servers it was expected to exist on. While this performed adequately at small scale, replication times skyrocketed once directory structures could no longer be held in RAM. We now use a modification of this scheme in which a hash of the contents for each suffix directory is saved to a per-partition hashes file. The hash for a suffix directory is invalidated when the contents of that suffix directory are modified.</para>
|
||
<para>The object replication process reads in these hash files, calculating any invalidated hashes. It then transmits the hashes to each remote server that should hold the partition, and only suffix directories with differing hashes on the remote server are rsynced. After pushing files to the remote server, the replication process notifies it to recalculate hashes for the rsynced suffix directories.</para>
|
||
<para>Performance of object replication is generally bound by the number of uncached directories it has to traverse, usually as a result of invalidated suffix directory hashes. Using write volume and partition counts from our running systems, it was designed so that around 2% of the hash space on a normal node will be invalidated per day, which has experimentally given us acceptable replication speeds.</para>
|
||
|
||
</section></section>
|
||
|
||
<section xml:id="managing-large-objects">
|
||
|
||
<title>Managing Large Objects (Greater than 5 GB)</title>
|
||
<para>OpenStack Object Storage has a limit on the size of a single uploaded object; by default this is
|
||
5GB. However, the download size of a single object is virtually unlimited with
|
||
the concept of segmentation. Segments of the larger object are uploaded and a
|
||
special manifest file is created that, when downloaded, sends all the segments
|
||
concatenated as a single object. This also offers much greater upload speed
|
||
with the possibility of parallel uploads of the segments.</para>
|
||
<section xml:id="using-swift-to-manage-segmented-objects">
|
||
<title>Using swift to Manage Segmented Objects</title>
|
||
|
||
<para>The quickest way to try out this feature is use the included swift OpenStack
|
||
Object Storage client tool. You can use the -S option to specify the segment size to use
|
||
when splitting a large file. For example:</para>
|
||
<literallayout>swift upload test_container -S 1073741824 large_file</literallayout>
|
||
|
||
<para>This would split the large_file into 1G segments and begin uploading those
|
||
segments in parallel. Once all the segments have been uploaded, swift will then
|
||
create the manifest file so the segments can be downloaded as one.</para>
|
||
<para>So now, the following st command would download the entire large object:</para>
|
||
|
||
<literallayout>swift download test_container large_file</literallayout>
|
||
|
||
<para>The swift CLI uses a strict convention for its segmented object support. In the
|
||
above example it will upload all the segments into a second container named
|
||
test_container_segments. These segments will have names like
|
||
large_file/1290206778.25/21474836480/00000000,
|
||
large_file/1290206778.25/21474836480/00000001, etc.</para>
|
||
<para>The main benefit for using a separate container is that the main container
|
||
listings will not be polluted with all the segment names. The reason for using
|
||
the segment name format of <name>/<timestamp>/<size>/<segment> is so that an
|
||
upload of a new file with the same name won't overwrite the contents of the
|
||
first until the last moment when the manifest file is updated.</para>
|
||
|
||
<para>The swift CLI will manage these segment files for you, deleting old segments on
|
||
deletes and overwrites, etc. You can override this behavior with the
|
||
--leave-segments option if desired; this is useful if you want to have multiple
|
||
versions of the same large object available.</para>
|
||
</section>
|
||
<section xml:id="direct-api-management-of-large-objects">
|
||
<title>Direct API Management of Large Objects</title>
|
||
<para>You can also work with the segments and manifests directly with HTTP requests
|
||
instead of having swift do that for you. You can just upload the segments like you
|
||
would any other object and the manifest is just a zero-byte file with an extra
|
||
X-Object-Manifest header.</para>
|
||
|
||
<para>All the object segments need to be in the same container, have a common object
|
||
name prefix, and their names sort in the order they should be concatenated.
|
||
They don't have to be in the same container as the manifest file will be, which
|
||
is useful to keep container listings clean as explained above with st.</para>
|
||
<para>The manifest file is simply a zero-byte file with the extra
|
||
X-Object-Manifest:<code><container>/<prefix> header</code>, where <container> is
|
||
the container the object segments are in and <prefix> is the common prefix
|
||
for all the segments.</para>
|
||
|
||
<para>It is best to upload all the segments first and then create or update the
|
||
manifest. In this way, the full object won't be available for downloading until
|
||
the upload is complete. Also, you can upload a new set of segments to a second
|
||
location and then update the manifest to point to this new location. During the
|
||
upload of the new segments, the original manifest will still be available to
|
||
download the first set of segments.</para>
|
||
<para>Here's an example using curl with tiny 1-byte segments:</para>
|
||
<literallayout>
|
||
# First, upload the segments
|
||
curl -X PUT -H 'X-Auth-Token: <token>' \
|
||
http://<storage_url>/container/myobject/1 --data-binary '1'
|
||
curl -X PUT -H 'X-Auth-Token: <token>' \
|
||
http://<storage_url>/container/myobject/2 --data-binary '2'
|
||
curl -X PUT -H 'X-Auth-Token: <token>' \
|
||
http://<storage_url>/container/myobject/3 --data-binary '3'
|
||
|
||
# Next, create the manifest file
|
||
curl -X PUT -H 'X-Auth-Token: <token>' \
|
||
-H 'X-Object-Manifest: container/myobject/' \
|
||
http://<storage_url>/container/myobject --data-binary ''
|
||
|
||
# And now we can download the segments as a single object
|
||
curl -H 'X-Auth-Token: <token>' \
|
||
http://<storage_url>/container/myobject</literallayout>
|
||
|
||
</section>
|
||
<section xml:id="additional-notes-on-large-objects">
|
||
<title>Additional Notes on Large Objects</title>
|
||
<itemizedlist>
|
||
<listitem><para>With a GET or HEAD of a manifest file, the X-Object-Manifest:
|
||
<code><container>/<prefix></code> header will be returned with the concatenated object
|
||
so you can tell where it's getting its segments from.</para></listitem>
|
||
|
||
<listitem><para>The response's Content-Length for a GET or HEAD on the manifest
|
||
file will be the sum of all the segments in the <container>/<prefix>
|
||
listing, dynamically. So, uploading additional segments after the manifest is
|
||
created will cause the concatenated object to be that much larger; there's no
|
||
need to recreate the manifest file.</para></listitem>
|
||
|
||
<listitem><para>The response's Content-Type for a GET or HEAD on the manifest
|
||
will be the same as the Content-Type set during the PUT request that
|
||
created the manifest. You can easily change the Content-Type by reissuing
|
||
the PUT.</para></listitem>
|
||
|
||
<listitem><para>The response's ETag for a GET or HEAD on the manifest file will
|
||
be the MD5 sum of the concatenated string of ETags for each of the segments
|
||
in the <container>/<prefix> listing, dynamically. Usually in OpenStack Object Storage the
|
||
ETag is the MD5 sum of the contents of the object, and that holds true for
|
||
each segment independently. But, it's not feasible to generate such an ETag
|
||
for the manifest itself, so this method was chosen to at least offer change
|
||
detection.</para></listitem>
|
||
</itemizedlist>
|
||
</section>
|
||
<section xml:id="large-object-storage-history-and-background">
|
||
<title>Large Object Storage History and Background</title>
|
||
<para>Large object support has gone through various iterations before settling on
|
||
this implementation.</para>
|
||
<para>The primary factor driving the limitation of object size in OpenStack Object Storage is
|
||
maintaining balance among the partitions of the ring. To maintain an even
|
||
dispersion of disk usage throughout the cluster the obvious storage pattern
|
||
was to simply split larger objects into smaller segments, which could then be
|
||
glued together during a read.</para>
|
||
<para>Before the introduction of large object support some applications were already
|
||
splitting their uploads into segments and re-assembling them on the client
|
||
side after retrieving the individual pieces. This design allowed the client
|
||
to support backup and archiving of large data sets, but was also frequently
|
||
employed to improve performance or reduce errors due to network interruption.
|
||
The major disadvantage of this method is that knowledge of the original
|
||
partitioning scheme is required to properly reassemble the object, which is
|
||
not practical for some use cases, such as CDN origination.</para>
|
||
<para>In order to eliminate any barrier to entry for clients wanting to store
|
||
objects larger than 5GB, initially we also prototyped fully transparent
|
||
support for large object uploads. A fully transparent implementation would
|
||
support a larger max size by automatically splitting objects into segments
|
||
during upload within the proxy without any changes to the client API. All
|
||
segments were completely hidden from the client API.</para>
|
||
<para>This solution introduced a number of challenging failure conditions into the
|
||
cluster, wouldn't provide the client with any option to do parallel uploads,
|
||
and had no basis for a resume feature. The transparent implementation was
|
||
deemed just too complex for the benefit.</para>
|
||
|
||
<para>The current “user manifest” design was chosen in order to provide a
|
||
transparent download of large objects to the client and still provide the
|
||
uploading client a clean API to support segmented uploads.</para>
|
||
<para>Alternative “explicit” user manifest options were discussed which would have
|
||
required a pre-defined format for listing the segments to “finalize” the
|
||
segmented upload. While this may offer some potential advantages, it was
|
||
decided that pushing an added burden onto the client which could potentially
|
||
limit adoption should be avoided in favor of a simpler “API” (essentially just
|
||
the format of the ‘X-Object-Manifest' header).</para>
|
||
|
||
<para>During development it was noted that this “implicit” user manifest approach
|
||
which is based on the path prefix can be potentially affected by the eventual
|
||
consistency window of the container listings, which could theoretically cause
|
||
a GET on the manifest object to return an invalid whole object for that short
|
||
term. In reality you're unlikely to encounter this scenario unless you're
|
||
running very high concurrency uploads against a small testing environment
|
||
which isn't running the object-updaters or container-replicators.</para>
|
||
<para>Like all of OpenStack Object Storage, Large Object Support is living feature which will continue
|
||
to improve and may change over time.</para>
|
||
|
||
</section>
|
||
</section>
|
||
<section xml:id="throttling-resources-by-setting-rate-limits">
|
||
|
||
<title>Throttling Resources by Setting Rate Limits</title>
|
||
|
||
<para>Rate limiting in OpenStack Object Storage is implemented as a pluggable middleware that you configure on the proxy server. Rate
|
||
limiting is performed on requests that result in database writes to the
|
||
account and container sqlite dbs. It uses memcached and is dependent on
|
||
the proxy servers having highly synchronized time. The rate limits are
|
||
limited by the accuracy of the proxy server clocks.</para>
|
||
<section xml:id="configuration-for-rate-limiting">
|
||
|
||
<title>Configuration for Rate Limiting</title>
|
||
<para>All configuration is optional. If no account or container limits are provided
|
||
there will be no rate limiting. Configuration available:</para>
|
||
<table rules="all">
|
||
<caption>Configuration options for rate limiting in proxy-server.conf
|
||
file</caption>
|
||
<tbody>
|
||
<tr><td>Option</td>
|
||
<td>Default</td>
|
||
<td>Description</td>
|
||
|
||
</tr>
|
||
<tr><td>clock_accuracy</td>
|
||
<td>1000</td>
|
||
<td>Represents how accurate the proxy servers'
|
||
system clocks are with each other. 1000
|
||
means that all the proxies' clock are
|
||
accurate to each other within 1
|
||
millisecond. No ratelimit should be
|
||
higher than the clock accuracy.</td>
|
||
</tr>
|
||
<tr><td>max_sleep_time_seconds</td>
|
||
<td>60</td>
|
||
<td>App will immediately return a 498 response
|
||
if the necessary sleep time ever exceeds
|
||
the given max_sleep_time_seconds.</td>
|
||
|
||
</tr>
|
||
<tr><td>log_sleep_time_seconds</td>
|
||
<td>0</td>
|
||
<td>To allow visibility into rate limiting set
|
||
this value > 0 and all sleeps greater than
|
||
the number will be logged.</td>
|
||
</tr>
|
||
<tr><td>account_ratelimit</td>
|
||
<td>0</td>
|
||
<td>If set, will limit all requests to
|
||
/account_name and PUTs to
|
||
/account_name/container_name. Number is in
|
||
requests per second</td>
|
||
</tr>
|
||
|
||
<tr><td>account_whitelist</td>
|
||
<td>‘'</td>
|
||
<td>Comma separated lists of account names that
|
||
will not be rate limited.</td>
|
||
</tr>
|
||
<tr><td>account_blacklist</td>
|
||
<td>‘'</td>
|
||
<td>Comma separated lists of account names that
|
||
will not be allowed. Returns a 497 response.</td>
|
||
</tr>
|
||
<tr><td>container_ratelimit_size</td>
|
||
<td>‘'</td>
|
||
<td>When set with container_limit_x = r:
|
||
for containers of size x, limit requests
|
||
per second to r. Will limit GET and HEAD
|
||
requests to /account_name/container_name
|
||
and PUTs and DELETEs to
|
||
/account_name/container_name/object_name</td>
|
||
|
||
</tr></tbody>
|
||
|
||
</table>
|
||
<para>The container rate limits are linearly interpolated from the values given. A
|
||
sample container rate limiting could be:</para>
|
||
<para>container_ratelimit_100 = 100</para>
|
||
<para>container_ratelimit_200 = 50</para>
|
||
<para>container_ratelimit_500 = 20</para>
|
||
<para>This would result in</para>
|
||
<table rules="all">
|
||
<caption>Values for Rate Limiting with Sample Configuration Settings</caption>
|
||
<tbody>
|
||
<tr><td>Container Size</td>
|
||
<td>Rate Limit</td>
|
||
</tr>
|
||
<tr><td>0-99</td>
|
||
<td>No limiting</td>
|
||
</tr>
|
||
<tr><td>100</td>
|
||
<td>100</td>
|
||
</tr>
|
||
|
||
<tr><td>150</td>
|
||
<td>75</td>
|
||
</tr>
|
||
<tr><td>500</td>
|
||
<td>20</td>
|
||
</tr>
|
||
<tr><td>1000</td>
|
||
<td>20</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
</section>
|
||
</section>
|
||
|
||
<section xml:id="configuring-openstack-object-storage-with-s3_api">
|
||
|
||
<title>Configuring Object Storage with the S3 API</title>
|
||
<para>The Swift3 middleware emulates the S3 REST API on top of Object Storage.</para>
|
||
|
||
<para>The following operations are currently supported:</para>
|
||
|
||
<itemizedlist>
|
||
<listitem><para>GET Service</para></listitem>
|
||
<listitem><para>DELETE Bucket</para></listitem>
|
||
<listitem><para>GET Bucket (List Objects)</para></listitem>
|
||
<listitem><para>PUT Bucket</para></listitem>
|
||
<listitem><para>DELETE Object</para></listitem>
|
||
<listitem><para>GET Object</para></listitem>
|
||
<listitem><para>HEAD Object</para></listitem>
|
||
<listitem><para>PUT Object</para></listitem>
|
||
<listitem><para>PUT Object (Copy)</para></listitem></itemizedlist>
|
||
|
||
<para>To add this middleware to your configuration, add the swift3 middleware in front of the auth middleware, and before any other middleware that look at swift requests (like rate limiting).
|
||
</para>
|
||
<para>Ensure that your proxy-server.conf file contains swift3 in the pipeline and the
|
||
[filter:swift3] section, as shown below:</para>
|
||
<literallayout class="monospaced">[pipeline:main]
|
||
pipeline = healthcheck cache swift3 swauth proxy-server
|
||
|
||
[filter:swift3]
|
||
use = egg:swift#swift3
|
||
</literallayout>
|
||
<para>Next, configure the tool that you use to connect to the S3 API. For S3curl, for example, you'll need to add your host IP information by adding y our host IP to the @endpoints array (line 33 in s3curl.pl):</para>
|
||
|
||
<literallayout class="monospaced">my @endpoints = ( '1.2.3.4');</literallayout>
|
||
<para>Now you can send commands to the endpoint, such as:</para>
|
||
<literallayout class="monospaced">
|
||
./s3curl.pl - 'myacc:myuser' -key mypw -get - -s -v http://1.2.3.4:8080
|
||
</literallayout>
|
||
<para>To set up your client, the access key will be the concatenation of the account and user
|
||
strings that should look like test:tester, and the secret access key is the account
|
||
password. The host should also point to the Swift storage node's hostname. It also will
|
||
have to use the old-style calling format, and not the hostname-based container format.
|
||
Here is an example client setup using the Python boto library on a locally installed
|
||
all-in-one Swift installation. </para>
|
||
<literallayout class="monospaced">
|
||
connection = boto.s3.Connection(
|
||
aws_access_key_id='test:tester',
|
||
aws_secret_access_key='testing',
|
||
port=8080,
|
||
host='127.0.0.1',
|
||
is_secure=False,
|
||
calling_format=boto.s3.connection.OrdinaryCallingFormat())
|
||
</literallayout></section>
|
||
|
||
<section xml:id="managing-openstack-object-storage-with-swift-cli">
|
||
|
||
<title>Managing OpenStack Object Storage with CLI Swift</title><para>In the Object Store (swift) project there is a tool that can perform a variety of tasks on
|
||
your storage cluster named swift. This client utility can be used for adhoc processing,
|
||
to gather statistics, list items, update metadata, upload, download and delete files. It
|
||
is based on the native swift client library client.py. Incorporating client.py into
|
||
swift provides many benefits such as seamlessly re-authorizing if the current token
|
||
expires in the middle of processing, retrying operations up to five times and a
|
||
processing concurrency of 10. All of these things help make the swift tool robust and
|
||
great for operational use.</para>
|
||
|
||
<section xml:id="swift-cli-basics">
|
||
<title>Swift CLI Basics</title>
|
||
<para>The command line usage for swift, the CLI tool is:
|
||
<literallayout>swift (command) [options] [args]</literallayout></para>
|
||
|
||
<para>Here are the available commands for swift. </para>
|
||
<simplesect><title>stat [container] [object]</title>
|
||
<para>Displays information for the account, container, or object depending on the
|
||
args given (if any).</para></simplesect>
|
||
<simplesect><title>list [options] [container]</title>
|
||
<para>Lists the containers for the account or the objects for a container. -p or -prefix
|
||
is an option that will only list items beginning with that prefix. -d or
|
||
-delimiter is option (for container listings only) that will roll up items with
|
||
the given delimiter, or character that can act as a nested directory
|
||
organizer.</para></simplesect>
|
||
<simplesect><title>upload [options] container file_or_directory [file_or_directory] […]</title><para> Uploads to the given container the files and directories specified by the remaining args. -c
|
||
or -changed is an option that will only upload files that have changed since the
|
||
last upload.</para></simplesect>
|
||
<simplesect><title>post [options] [container] [object]</title>
|
||
<para>Updates meta information for the account, container, or object depending on the
|
||
args given. If the container is not found, it will be created automatically; but
|
||
this is not true for accounts and objects. Containers also allow the -r (or
|
||
-read-acl) and -w (or -write-acl) options. The -m or -meta option is allowed on
|
||
all and used to define the user meta data items to set in the form Name:Value.
|
||
This option can be repeated. </para>
|
||
<para>Example: post -m Color:Blue -m Size:Large</para></simplesect>
|
||
<simplesect><title>download —all OR download container [object] [object] …</title>
|
||
<para>Downloads everything in the account (with —all), or everything in a
|
||
container, or a list of objects depending on the args given. For a single
|
||
object download, you may use the -o [—output] (filename) option to
|
||
redirect the output to a specific file or if “-” then just redirect to
|
||
stdout.</para></simplesect>
|
||
<simplesect><title>delete —all OR delete container [object] [object] …</title>
|
||
<para>Deletes everything in the account (with —all), or everything in a
|
||
container, or a list of objects depending on the args given.
|
||
</para>
|
||
<para>Example: swift -A https://auth.api.rackspacecloud.com/v1.0 -U user -K key
|
||
stat</para></simplesect>
|
||
<simplesect><title>Options for swift</title>
|
||
<para>-version show program’s version number and exit</para>
|
||
<para>-h, -help show this help message and exit</para>
|
||
<para>-s, -snet Use SERVICENET internal network</para>
|
||
<para>-v, -verbose Print more info</para>
|
||
<para>-q, -quiet Suppress status output</para>
|
||
<para>-A AUTH, -auth=AUTH URL for obtaining an auth token</para>
|
||
<para>-U USER, -user=USER User name for obtaining an auth token</para>
|
||
<para>-K KEY, -key=KEY Key for obtaining an auth token</para> </simplesect></section>
|
||
<section xml:id="analyzing-log-files-with-swift-cli">
|
||
<title>Analyzing Log Files with Swift CLI</title>
|
||
<para>When you want quick, command-line answers to questions about logs, you can use
|
||
swift with the -o or -output option. The -o —output option can only be used with a
|
||
single object download to redirect the data stream to either a different file name
|
||
or to STDOUT (-). The ability to redirect the output to STDOUT allows you to pipe
|
||
“|” data without saving it to disk first. One common use case is being able to do
|
||
some quick log file analysis. First let’s use swift to setup some data for the
|
||
examples. The “logtest” directory contains four log files with the following line
|
||
format.</para>
|
||
|
||
<para><literallayout>files:
|
||
2010-11-16-21_access.log
|
||
2010-11-16-22_access.log
|
||
2010-11-15-21_access.log
|
||
2010-11-15-22_access.log
|
||
|
||
log lines:
|
||
Nov 15 21:53:52 lucid64 proxy-server - 127.0.0.1 15/Nov/2010/22/53/52 DELETE /v1/AUTH_cd4f57824deb4248a533f2c28bf156d3/2eefc05599d44df38a7f18b0b42ffedd HTTP/1.0 204 - - test%3Atester%2CAUTH_tkcdab3c6296e249d7b7e2454ee57266ff - - - txaba5984c-aac7-460e-b04b-afc43f0c6571 - 0.0432</literallayout></para>
|
||
|
||
|
||
<para>The swift tool can easily upload the four log files into a container named
|
||
“logtest”:</para><para>
|
||
<literallayout>
|
||
$ cd logs
|
||
$ swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K \
|
||
testing upload logtest *.log
|
||
2010-11-16-21_access.log
|
||
2010-11-16-22_access.log
|
||
2010-11-15-21_access.log
|
||
2010-11-15-22_access.log
|
||
|
||
get statistics on the account:
|
||
$ swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K \
|
||
testing -q stat
|
||
Account: AUTH_cd4f57824deb4248a533f2c28bf156d3
|
||
Containers: 1
|
||
Objects: 4
|
||
Bytes: 5888268
|
||
|
||
get statistics on the container:
|
||
$ swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K \
|
||
testing stat logtest
|
||
Account: AUTH_cd4f57824deb4248a533f2c28bf156d3
|
||
Container: logtest
|
||
Objects: 4
|
||
Bytes: 5864468
|
||
Read ACL:
|
||
Write ACL:
|
||
|
||
list all the objects in the container:
|
||
$ swift -A http:///swift-auth.com:11000/v1.0 -U test:tester -K \
|
||
testing list logtest
|
||
2010-11-15-21_access.log
|
||
2010-11-15-22_access.log
|
||
2010-11-16-21_access.log
|
||
2010-11-16-22_access.log</literallayout></para>
|
||
<para>These next three examples use the -o —output option with (-) to help answer
|
||
questions about the uploaded log files. The swift command will download an object,
|
||
stream it to awk to determine the breakdown of requests by return code for
|
||
everything during 2200 on November 16th, 2010. Based on the log line format column 9
|
||
is the type of request and column 12 is the return code. After awk processes the
|
||
data stream it is piped to sort and then uniq -c to sum up the number of occurrences
|
||
for each combination of request type and return code.</para>
|
||
|
||
<para><literallayout>$ swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K \
|
||
testing download -o - logtest 2010-11-16-22_access.log \
|
||
| awk ‘{ print $9”-“$12}’ | sort | uniq -c
|
||
|
||
805 DELETE-204
|
||
12 DELETE-404
|
||
2 DELETE-409
|
||
723 GET-200
|
||
142 GET-204
|
||
74 GET-206
|
||
80 GET-304
|
||
34 GET-401
|
||
5 GET-403
|
||
18 GET-404
|
||
166 GET-412
|
||
2 GET-416
|
||
50 HEAD-200
|
||
17 HEAD-204
|
||
20 HEAD-401
|
||
8 HEAD-404
|
||
30 POST-202
|
||
25 POST-204
|
||
22 POST-400
|
||
6 POST-404
|
||
842 PUT-201
|
||
2 PUT-202
|
||
32 PUT-400
|
||
4 PUT-403
|
||
4 PUT-404
|
||
2 PUT-411
|
||
6 PUT-412
|
||
6 PUT-413
|
||
2 PUT-422
|
||
8 PUT-499
|
||
</literallayout></para>
|
||
|
||
<para>This example uses a bash for loop with awk, swift with its -o —output
|
||
option with a hyphen (-) to find out how many PUT requests are in each log file.
|
||
First create a list of objects by running swift with the list command on the
|
||
“logtest” container; then for each item in the list run swift with download -o -
|
||
then pipe the output into grep to filter the put requests and finally into wc -l to
|
||
count the lines.</para>
|
||
<para><literallayout>$ for f in `swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K testing list logtest` ; \
|
||
do echo -ne “PUTS - ” ; swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K \
|
||
testing download -o - logtest $f | grep PUT | wc -l ; done
|
||
|
||
2010-11-15-21_access.log - PUTS - 402
|
||
2010-11-15-22_access.log - PUTS - 1091
|
||
2010-11-16-21_access.log - PUTS - 892
|
||
2010-11-16-22_access.log - PUTS - 910
|
||
</literallayout></para>
|
||
|
||
<para>By adding the -p —prefix option a prefix query is performed on the
|
||
list to return only the object names that begin with a specific string. Let’s
|
||
determine out how many PUT requests are in each object with a name beginning with
|
||
“2010-11-15”. First create a list of objects by running swift with the list command
|
||
on the “logtest” container with the prefix option -p 2010-11-15. Then on each of
|
||
item(s) returned run swift with the download -o - then pipe the output to grep and
|
||
wc as in the previous example. The echo command is added to display the object
|
||
name.</para>
|
||
|
||
<para><literallayout>$ for f in `swift -A http://swift-auth.com:11000/v1.0 -U test:tester -K testing list \
|
||
-p 2010-11-15 logtest` ; do echo -ne “$f - PUTS - ” ; \
|
||
swift -A http://127.0.0.1:11000/v1.0 -U test:tester -K testing \
|
||
download -o - logtest $f | grep PUT | wc -l ; done
|
||
|
||
2010-11-15-21_access.log - PUTS - 402
|
||
2010-11-15-22_access.log - PUTS - 910
|
||
|
||
</literallayout></para>
|
||
<para>The swift utility is simple, scalable, flexible and provides useful
|
||
solutions all of which are core principles of cloud computing; with the -o output
|
||
option being just one of its many features. </para></section></section>
|
||
</chapter>
|