Added first cut of the deployment guide, and updated auth overview to better
represent the current code
This commit is contained in:
parent
651d2be91d
commit
9916adba75
450
doc/source/deployment_guide.rst
Normal file
450
doc/source/deployment_guide.rst
Normal file
@ -0,0 +1,450 @@
|
||||
================
|
||||
Deployment Guide
|
||||
================
|
||||
|
||||
-----------------------
|
||||
Hardware Considerations
|
||||
-----------------------
|
||||
|
||||
Swift is designed to run on commodity hardware. At Rackspace, our storage
|
||||
servers are currently running fairly generic 4U servers with 24 2T SATA
|
||||
drives and 8 cores of processing power. RAID on the storage drives is not
|
||||
required and not recommended. Swift's disk usage pattern is the worst
|
||||
case possible for RAID, and performance degrades very quickly using RAID 5
|
||||
or 6.
|
||||
|
||||
------------------
|
||||
Deployment Options
|
||||
------------------
|
||||
|
||||
The swift services run completely autonomously, which provides for a lot of
|
||||
flexibility when architecting the hardware deployment for swift. The 4 main
|
||||
services are:
|
||||
|
||||
#. Proxy Services
|
||||
#. Object Services
|
||||
#. Container Services
|
||||
#. Account Services
|
||||
|
||||
The Proxy Services are more CPU and network I/O intensive. If you are using
|
||||
10g networking to the proxy, or are terminating SSL traffic at the proxy,
|
||||
greater CPU power will be required.
|
||||
|
||||
The Object, Container, and Account Services (Storage Services) are more disk
|
||||
and network I/O intensive.
|
||||
|
||||
The easiest deployment is to install all services on each server. There is
|
||||
nothing wrong with doing this, as it scales each service out horizontally.
|
||||
|
||||
At Rackspace, we put the Proxy Services on their own servers and all of the
|
||||
Storage Services on the same server. This allows us to send 10g networking to
|
||||
the proxy and 1g to the storage servers, and keep load balancing to the
|
||||
proxies more manageable. Storage Services scale out horizontally as storage
|
||||
servers are added, and we can scale overall API throughput by adding more
|
||||
Proxies.
|
||||
|
||||
If you need more throughput to either Account or Container Services, they may
|
||||
each be deployed to their own servers. For example you might use faster (but
|
||||
more expensive) SAS or even SSD drives to get faster disk I/O to the databases.
|
||||
|
||||
Load balancing and network design is left as an excercise to the reader,
|
||||
but this is a very important part of the cluster, so time should be spent
|
||||
designing the network for a Swift cluster.
|
||||
|
||||
------------------
|
||||
Preparing the Ring
|
||||
------------------
|
||||
|
||||
The first step is to determine the number of partitions that will be in the
|
||||
ring. We recommend that there be a minimum of 100 partitions per drive to
|
||||
insure even distribution accross the drives. A good starting point might be
|
||||
to figure out the maximum number of drives the cluster will contain, and then
|
||||
multiply by 100, and then round up to the nearest power of two.
|
||||
|
||||
For example, imagine we are building a cluster that will have no more than
|
||||
5,000 drives. That would mean that we would have a total number of 500,000
|
||||
partitions, which is pretty close to 2^19, rounded up.
|
||||
|
||||
It is also a good idea to keep the number of partitions small (realatively).
|
||||
The more partitions there are, the more work that has to be done by the
|
||||
replicators and other backend jobs and the more memory the rings consume in
|
||||
process. The goal is to find a good balance between small rings and maximum
|
||||
cluster size.
|
||||
|
||||
The next step is to determine the number of replicas to store of the data.
|
||||
Currently it is recommended to use 3 (as this is the only value that has
|
||||
been tested). The higher the number, the more storage that is used but the
|
||||
less likely you are to lose data.
|
||||
|
||||
It is also important to determine how many zones the cluster should have. It is
|
||||
recommended to start with a minimum of 5 zones. You can start with fewer, but
|
||||
our testing has shown that having at least five zones is optimal when failures
|
||||
occur. We also recommend trying to configure the zones as high a level as
|
||||
possible to create as much isolation as possible. Some example things to take
|
||||
into consideration can include physical location, power availability, and
|
||||
network connectivity. For example, in a small cluster you might decide to
|
||||
split the zones up by cabinet, with each cabinet having its own power and
|
||||
network connectivity. The zone concept is very abstract, so feel free to use
|
||||
it in whatever way best isolates your data from failure. Zones are referenced
|
||||
by number, beginning with 1.
|
||||
|
||||
You can now start building the ring with::
|
||||
|
||||
swift-ring_builder <builder_file> create <part_power> <replicas> <min_part_hours>
|
||||
|
||||
This will start the ring build process creating the <builder_file> with
|
||||
2^<part_power> partitions. <min_part_hours> is the time in hours before a
|
||||
specific partition can be moved in succession (24 is a good value for this).
|
||||
|
||||
Devices can be added to the ring with::
|
||||
|
||||
swift-ring_builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight>
|
||||
|
||||
This will add a device to the ring where <builder_file> is the name of the
|
||||
builder file that was created previously, <zone> is the number of the zone
|
||||
this device is in, <ip> is the ip address of the server the device is in,
|
||||
<port> is the port number that the server is running on, <device_name> is
|
||||
the name of the device on the server (for example: sdb1), <meta> is a string
|
||||
of metadata for the device (optional), and <weight> is a float weight that
|
||||
determines how many partitions are put on the device relative to the rest of
|
||||
the devices in the cluster (a good starting point is 100.0 x TB on the drive).
|
||||
Add each device that will be initially in the cluster.
|
||||
|
||||
Once all of the devices are added to the ring, run::
|
||||
|
||||
swift_ring_builder <builder_file> rebalance
|
||||
|
||||
This will distribute the partitions across the drives in the ring. It is
|
||||
important whenever making changes to the ring to make all the changes
|
||||
required before running rebalance. This will ensure that the ring stays as
|
||||
balanced as possible, and as few partitions are moved as possible.
|
||||
|
||||
The above process should be done to make a ring for each storage serivce
|
||||
(Account, Container and Object). The builder files will be needed in future
|
||||
changes to the ring, so it is very important that these be kept and backed up.
|
||||
The resulting .tar.gz ring file should be pushed to all of the servers in the
|
||||
cluster. For more information about building rings, running
|
||||
swift_ring_builder with no options will display help text with available
|
||||
commands and options. More information on how the ring works internally
|
||||
can be found in the :doc:`Ring Overview <overview_ring>`.
|
||||
|
||||
---------------------------
|
||||
Object Server Configuration
|
||||
---------------------------
|
||||
|
||||
An Example Object Server configuration can be found at
|
||||
etc/object-server.conf-sample in the source code repository.
|
||||
|
||||
The following configuration options are available:
|
||||
|
||||
[object-server]
|
||||
|
||||
================== ========== =============================================
|
||||
Option Default Description
|
||||
------------------ ---------- ---------------------------------------------
|
||||
swift_dir /etc/swift Swift configuration directory
|
||||
devices /srv/node Parent directory of where devices are mounted
|
||||
mount_check true Weather or not check if the devices are
|
||||
mounted to prevent accidently writing
|
||||
to the root device
|
||||
bind_ip 0.0.0.0 IP Address for server to bind to
|
||||
bind_port 6000 Port for server to bind to
|
||||
workers 1 Number of workers to fork
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
log_requests True Weather or not to log each request
|
||||
user swift User to run as
|
||||
node_timeout 3 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
network_chunk_size 65536 Size of chunks to read/write over the
|
||||
network
|
||||
disk_chunk_size 65536 Size of chunks to read/write to disk
|
||||
max_upload_time 86400 Maximum time allowed to upload an object
|
||||
slow 0 If > 0, Minimum time in seconds for a PUT
|
||||
or DELETE request to complete
|
||||
================== ========== =============================================
|
||||
|
||||
[object-replicator]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
daemonize yes Weather or not to run replication as a
|
||||
daemon
|
||||
run_pause 30 Time in seconds to wait between replication
|
||||
passes
|
||||
concurrency 1 Number of replication workers to spawn
|
||||
timeout 5 Timeout value sent to rsync --timeout and
|
||||
--contimeout options
|
||||
stats_interval 3600 Interval in seconds between logging
|
||||
replication statistics
|
||||
reclaim_age 604800 Time elapsed in seconds before an object
|
||||
can be reclaimed
|
||||
================== ========== ===========================================
|
||||
|
||||
[object-updater]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
interval 300 Minimum time for a pass to take
|
||||
concurrency 1 Number of updater workers to spawn
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
slowdown 0.01 Time in seconds to wait between objects
|
||||
================== ========== ===========================================
|
||||
|
||||
[object-auditor]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
interval 1800 Minimum time for a pass to take
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
================== ========== ===========================================
|
||||
|
||||
------------------------------
|
||||
Container Server Configuration
|
||||
------------------------------
|
||||
|
||||
An example Container Server configuration can be found at
|
||||
etc/container-server.conf-sample in the source code repository.
|
||||
|
||||
The following configuration options are available:
|
||||
|
||||
[container-server]
|
||||
|
||||
================== ========== ============================================
|
||||
Option Default Description
|
||||
------------------ ---------- --------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
swift_dir /etc/swift Swift configuration directory
|
||||
devices /srv/node Parent irectory of where devices are mounted
|
||||
mount_check true Weather or not check if the devices are
|
||||
mounted to prevent accidently writing
|
||||
to the root device
|
||||
bind_ip 0.0.0.0 IP Address for server to bind to
|
||||
bind_port 6001 Port for server to bind to
|
||||
workers 1 Number of workers to fork
|
||||
user swift User to run as
|
||||
node_timeout 3 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
================== ========== ============================================
|
||||
|
||||
[container-replicator]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
per_diff 1000
|
||||
concurrency 8 Number of replication workers to spawn
|
||||
run_pause 30 Time in seconds to wait between replication
|
||||
passes
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
reclaim_age 604800 Time elapsed in seconds before a container
|
||||
can be reclaimed
|
||||
================== ========== ===========================================
|
||||
|
||||
[container-updater]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
interval 300 Minimum time for a pass to take
|
||||
concurrency 4 Number of updater workers to spawn
|
||||
node_timeout 3 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
slowdown 0.01 Time in seconds to wait between containers
|
||||
================== ========== ===========================================
|
||||
|
||||
[container-auditor]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
interval 1800 Minimum time for a pass to take
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
================== ========== ===========================================
|
||||
|
||||
----------------------------
|
||||
Account Server Configuration
|
||||
----------------------------
|
||||
|
||||
An example Account Server configuration can be found at
|
||||
etc/account-server.conf-sample in the source code repository.
|
||||
|
||||
The following configuration options are available:
|
||||
|
||||
[account-server]
|
||||
|
||||
================== ========== =============================================
|
||||
Option Default Description
|
||||
------------------ ---------- ---------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
swift_dir /etc/swift Swift configuration directory
|
||||
devices /srv/node Parent directory or where devices are mounted
|
||||
mount_check true Weather or not check if the devices are
|
||||
mounted to prevent accidently writing
|
||||
to the root device
|
||||
bind_ip 0.0.0.0 IP Address for server to bind to
|
||||
bind_port 6002 Port for server to bind to
|
||||
workers 1 Number of workers to fork
|
||||
user swift User to run as
|
||||
================== ========== =============================================
|
||||
|
||||
[account-replicator]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
per_diff 1000
|
||||
concurrency 8 Number of replication workers to spawn
|
||||
run_pause 30 Time in seconds to wait between replication
|
||||
passes
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
reclaim_age 604800 Time elapsed in seconds before a account
|
||||
can be reclaimed
|
||||
================== ========== ===========================================
|
||||
|
||||
[account-auditor]
|
||||
|
||||
==================== ========== ===========================================
|
||||
Option Default Description
|
||||
-------------------- ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
interval 1800 Minimum time for a pass to take
|
||||
max_container_count 100 Maximum containers randomly picked for
|
||||
a given account audit
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
==================== ========== ===========================================
|
||||
|
||||
[account-reaper]
|
||||
|
||||
================== ========== ===========================================
|
||||
Option Default Description
|
||||
------------------ ---------- -------------------------------------------
|
||||
log_facility LOG_LOCAL0 Syslog log facility
|
||||
log_level INFO Logging level
|
||||
concurrency 25 Number of replication workers to spawn
|
||||
interval 3600 Minimum time for a pass to take
|
||||
node_timeout 10 Request timeout to external services
|
||||
conn_timeout 0.5 Connection timeout to external services
|
||||
================== ========== ===========================================
|
||||
|
||||
----------------------
|
||||
General Service Tuning
|
||||
----------------------
|
||||
|
||||
Most services support either a worker or concurrency value in the settings.
|
||||
This allows the services to make effective use of the cores available. A good
|
||||
starting point to set the concurrency level for the proxy and storage services
|
||||
to 2 times the number of cores available. If more than one service is
|
||||
sharing a server, then some experimentaiton may be needed to find the best
|
||||
balance.
|
||||
|
||||
At Rackspace, our Proxy servers have dual quad core processors, giving us 8
|
||||
cores. Our testing has shown 16 workers to be a pretty good balance when
|
||||
saturating a 10g network and gives good CPU utilization.
|
||||
|
||||
Our Storage servers all run together on the same servers. These servers have
|
||||
dual quad core processors, for 8 cores total. We run the Account, Container,
|
||||
and Object servers with 8 workers each. Most of the background jobs are run
|
||||
at a concurrency of 1, with the exception of the replicators which are run at
|
||||
a concurrency of 2.
|
||||
|
||||
The above configuration setting should be taken as suggestions and testing
|
||||
of configuration settings should be done to ensure best utilization of CPU,
|
||||
network connectivity, and disk I/O.
|
||||
|
||||
-------------------------
|
||||
Filesystem Considerations
|
||||
-------------------------
|
||||
|
||||
Swift is designed to be mostly filesystem agnostic--the only requirement
|
||||
beeing that the filesystem supports extended attributes (xattrs). After
|
||||
thorough testing with our use cases and hardware configurations, XFS was
|
||||
the best all-around choice. If you decide to use a filesystem other than
|
||||
XFS, we highly recommend thorough testing.
|
||||
|
||||
If you are using XFS, some settings that can dramatically impact
|
||||
performance. We recommend the following when creating the XFS
|
||||
partition::
|
||||
|
||||
mkfs.xfs -i size=1024 -f /dev/sda1
|
||||
|
||||
Setting the inode size is important, as XFS stores xattr data in the inode.
|
||||
If the metadata is too large to fit in the inode, a new extent is created,
|
||||
which can cause quite a performance problem. Upping the inode size to 1024
|
||||
bytes provides enough room to write the default metadata, plus a little
|
||||
headroom. We do not recommend running Swift on RAID, but if you are using
|
||||
RAID it is also important to make sure that the proper sunit and swidth
|
||||
settings get set so that XFS can make most efficient use of the RAID array.
|
||||
|
||||
We also recommend the following example mount options when using XFS::
|
||||
|
||||
mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /srv/node/sda
|
||||
|
||||
For a standard swift install, all data drives are mounted directly under
|
||||
/srv/node (as can be seen in the above example of mounting /def/sda1 as
|
||||
/srv/node/sda). If you choose to mount the drives in another directory,
|
||||
be sure to set the `devices` config option in all of the server configs to
|
||||
point to the correct directory.
|
||||
|
||||
---------------------
|
||||
General System Tuning
|
||||
---------------------
|
||||
|
||||
Rackspace currently runs Swift on Ubuntu Server 10.04, and the following
|
||||
changes have been found to be useful for our use cases.
|
||||
|
||||
The following settings should be in `/etc/sysctl.conf`::
|
||||
|
||||
# disable TIME_WAIT.. wait..
|
||||
net.ipv4.tcp_tw_recycle=1
|
||||
net.ipv4.tcp_tw_reuse=1
|
||||
|
||||
# disable syn cookies
|
||||
net.ipv4.tcp_syncookies = 0
|
||||
|
||||
# double amount of allowed conntrack
|
||||
net.ipv4.netfilter.ip_conntrack_max = 262144
|
||||
|
||||
To load the updated sysctl settings, run ``sudo sysctl -p``
|
||||
|
||||
A note about changing the TIME_WAIT values. By default the OS will hold
|
||||
a port open for 60 seconds to ensure that any remaining packets can be
|
||||
received. During high usage, and with the number of connections that are
|
||||
created, it is easy to run out of ports. We can change this since we are
|
||||
in control of the network. If you are not in control of the network, or
|
||||
do not expect high loads, then you may not want to adjust those values.
|
||||
|
||||
----------------------
|
||||
Logging Considerations
|
||||
----------------------
|
||||
|
||||
Swift is set up to log directly to syslog. Every service can be configured
|
||||
with the `log_facility` option to set the syslog log facility destination.
|
||||
It is recommended to use syslog-ng to route the logs to specific log
|
||||
files locally on the server and also to remote log collecting servers.
|
@ -33,6 +33,13 @@ Development:
|
||||
development_guidelines
|
||||
development_saio
|
||||
|
||||
Deployment:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
deployment_guide
|
||||
|
||||
Source:
|
||||
|
||||
.. toctree::
|
||||
|
@ -2,9 +2,13 @@
|
||||
The Auth System
|
||||
===============
|
||||
|
||||
The auth system for Swift is based on the auth system from an existing
|
||||
architecture -- actually from a few existing auth systems -- and is therefore a
|
||||
bit disjointed. The distilled points about it are:
|
||||
--------------
|
||||
Developer Auth
|
||||
--------------
|
||||
|
||||
The auth system for Swift is based on the auth system from the existing
|
||||
Rackspace architecture -- actually from a few existing auth systems --
|
||||
and is therefore a bit disjointed. The distilled points about it are:
|
||||
|
||||
* The authentication/authorization part is outside Swift itself
|
||||
* The user of Swift passes in an auth token with each request
|
||||
@ -19,29 +23,29 @@ of something unique, some use "something else" but the salient point is that
|
||||
the token is a string which can be sent as-is back to the auth system for
|
||||
validation.
|
||||
|
||||
The validation call is, for historical reasons, an XMLRPC call. There are two
|
||||
types of auth systems, type 0 and type 1. With type 0, the XMLRPC call is given
|
||||
the token and the Swift account name (also known as the account hash because
|
||||
it's usually of the format <reseller>_<hash>). With type 1, the call is given
|
||||
the container name and HTTP method as well as the token and account hash. Both
|
||||
types are also given a service login and password recorded in Swift's
|
||||
resellers.conf. For a valid token, both auth system types respond with a
|
||||
session TTL and overall expiration in seconds from now. Swift does not honor
|
||||
the session TTL but will cache the token up to the expiration time. Tokens can
|
||||
be purged through a call to Swift's services server.
|
||||
An auth call is given the auth token and the Swift account hash. For a valid
|
||||
token, the auth system responds with a session TTL and overall expiration in
|
||||
seconds from now. Swift does not honor the session TTL but will cache the
|
||||
token up to the expiration time. Tokens can be purged through a call to the
|
||||
auth system.
|
||||
|
||||
How the user gets the token to use with Swift is up to the reseller software
|
||||
itself. For instance, with Cloud Files the user has a starting URL to an auth
|
||||
system. The user starts a session by sending a ReST request to that auth system
|
||||
to receive the auth token, a URL to the Swift system, and a URL to the CDN
|
||||
system.
|
||||
The user starts a session by sending a ReST request to that auth system
|
||||
to receive the auth token and a URL to the Swift system.
|
||||
|
||||
--------------
|
||||
Extending Auth
|
||||
--------------
|
||||
|
||||
Auth is written as wsgi middleware, so implementing your own auth is as easy
|
||||
as writing new wsgi middleware, and plugging it in to the proxy server.
|
||||
|
||||
The current middleware is implemented in the DevAuthMiddleware class in
|
||||
swift/common/auth.py, and should be a good starting place for implemeting
|
||||
your own auth.
|
||||
|
||||
------------------
|
||||
History and Future
|
||||
------------------
|
||||
|
||||
What's established in Swift for authentication/authorization has history from
|
||||
before Swift, so that won't be recorded here. It was minimally integrated with
|
||||
Swift to meet project deadlines, but in the near future Swift should have a
|
||||
pluggable auth/reseller system to support the above as well as other
|
||||
architectures.
|
||||
before Swift, so that won't be recorded here.
|
||||
|
Loading…
Reference in New Issue
Block a user