Added initial admin guide, and added more to the deployment guide, plus

cleaned up some of the doc string warning
This commit is contained in:
Chuck Thier 2010-07-30 14:57:20 -05:00
parent 0baceef8ad
commit e051495715
8 changed files with 261 additions and 14 deletions

View File

@ -2,3 +2,4 @@
*.sw? *.sw?
doc/build/* doc/build/*
dist dist
swift.egg-info

154
doc/source/admin_guide.rst Normal file
View File

@ -0,0 +1,154 @@
=====================
Administrator's Guide
=====================
------------------
Managing the Rings
------------------
Removing a device from the ring::
swift-ring-builder <builder-file> remove <ip_address>/<device_name>
Removing a server from the ring::
swift-ring-builder <builder-file> remove <ip_address>
Adding devices to the ring:
See :ref:`ring-preparing`
See what devices for a server are in the ring::
swift-ring-builder <builder-file> search <ip_address>
Once you are done with all changes to the ring, the changes need to be
"committed"::
swift-ring-builder <builder-file> rebalance
Once the new rings are built, they should be pushed out to all the servers
in the cluster.
-----------------------
Handling System Updates
-----------------------
It is recommended that system updates and reboots are done a zone at a time.
This allows the update to happen, and for the Swift cluster to stay available
and responsive to requests. It is also advisable when updating a zone, let
it run for a while before updating the other zones to make sure the update
doesn't have any adverse effects.
----------------------
Handling Drive Failure
----------------------
In the event that a drive has failed, the first step is to make sure the drive
is unmounted. This will make it easier for swift to work around the failure
until it has been resolved. If the drive is going to be replaced immediately,
then it is just best to replace the drive, format it, remount it, and let
replication fill it up.
If the drive can't be replaced immediately, then it is best to leave it
unmounted, and remove the drive from the ring. This will allow all the
replicas that were on that drive to be replicated elsewhere until the drive
is replaced. Once the drive is replaced, it can be re-added to the ring.
-----------------------
Handling Server Failure
-----------------------
If a server is having hardware issues, it is a good idea to make sure the
swift services are not running. This will allow Swift to work around the
failure while you troubleshoot.
If the server just needs a reboot, or a small amount of work that should
only last a couple of hours, then it is probably best to let Swift work
around the failure and get the machine fixed and back online. When the
machine comes back online, replication will make sure that anything that is
missing during the downtime will get updated.
If the server has more serious issues, then it is probably best to remove
all of the server's devices from the ring. Once the server has been repaired
and is back online, the server's devices can be added back into the ring.
It is important that the devices are reformatted before putting them back
into the ring as it is likely to be responsible for a different set of
partitions than before.
-----------------------
Detecting Failed Drives
-----------------------
It has been our experience that when a drive is about to fail, error messages
will spew into `/var/log/kern.log`. There is a script called
`swift-drive-audit` that can be run via cron to watch for bad drives. If
errors are detected, it will unmount the bad drive, so that Swift can
work around it. The script takes a configuration file with the following
settings:
[drive-audit]
================== ========== ===========================================
Option Default Description
------------------ ---------- -------------------------------------------
log_facility LOG_LOCAL0 Syslog log facility
log_level INFO Log level
device_dir /srv/node Directory devices are mounted under
minutes 60 Number of minutes to look back in
`/var/log/kern.log`
error_limit 1 Number of errors to find before a device
is unmounted
================== ========== ===========================================
This script has only been tested on Ubuntu 10.04, so if you are using a
different distro or OS, some care should be taken before using in production.
--------------
Cluster Health
--------------
TODO: Greg, add docs here about how to use swift-stats-populate, and
swift-stats-report
------------------------
Debugging Tips and Tools
------------------------
When a request is made to Swift, it is given a unique transaction id. This
id should be in every log line that has to do with that request. This can
be usefult when looking at all the services that are hit by a single request.
If you need to know where a specific account, container or object is in the
cluster, `swift-get-nodes` will show the location where each replica should be.
If you are looking at an object on the server and need more info,
`swift-object-info` will display the account, container, replica locations
and metadata of the object.
If you want to audit the data for an account, `swift-account-audit` can be
used to crawl the account, checking that all containers and objects can be
found.
-----------------
Managing Services
-----------------
Swift services are generally managed with `swift-init`. the general usage is
``swift-init <service> <command>``, where service is the swift service to
manage (for example object, container, account, proxy) and command is one of:
========== ===============================================
Command Description
---------- -----------------------------------------------
start Start the service
stop Stop the service
restart Restart the service
shutdown Attempt to gracefully shutdown the service
reload Attempt to gracefully restart the service
========== ===============================================
A graceful shutdown or reload will finish any current requests before
completely stopping the old service. There is also a special case of
`swift-init all <command>`, which will run the command for all swift services.

View File

@ -51,6 +51,8 @@ Load balancing and network design is left as an excercise to the reader,
but this is a very important part of the cluster, so time should be spent but this is a very important part of the cluster, so time should be spent
designing the network for a Swift cluster. designing the network for a Swift cluster.
.. _ring-preparing:
------------------ ------------------
Preparing the Ring Preparing the Ring
------------------ ------------------
@ -320,7 +322,7 @@ per_diff 1000
concurrency 8 Number of replication workers to spawn concurrency 8 Number of replication workers to spawn
run_pause 30 Time in seconds to wait between replication run_pause 30 Time in seconds to wait between replication
passes passes
node_timeout 10 Request timeout to external services node_timeout 10 Request timeout to external services
conn_timeout 0.5 Connection timeout to external services conn_timeout 0.5 Connection timeout to external services
reclaim_age 604800 Time elapsed in seconds before a account reclaim_age 604800 Time elapsed in seconds before a account
can be reclaimed can be reclaimed
@ -353,6 +355,99 @@ node_timeout 10 Request timeout to external services
conn_timeout 0.5 Connection timeout to external services conn_timeout 0.5 Connection timeout to external services
================== ========== =========================================== ================== ========== ===========================================
--------------------------
Proxy Server Configuration
--------------------------
[proxy-server]
============================ =============== =============================
Option Default Description
---------------------------- --------------- -----------------------------
log_facility LOG_LOCAL0 Syslog log facility
log_level INFO Log level
bind_ip 0.0.0.0 IP Address for server to
bind to
bind_port 80 Port for server to bind to
cert_file Path to the ssl .crt
key_file Path to the ssl .key
swift_dir /etc/swift Swift configuration directory
log_headers True If True, log headers in each
request
workers 1 Number of workers to fork
user swift User to run as
recheck_account_existence 60 Cache timeout in seconds to
send memcached for account
existance
recheck_container_existence 60 Cache timeout in seconds to
send memcached for container
existance
object_chunk_size 65536 Chunk size to read from
object servers
client_chunk_size 65536 Chunk size to read from
clients
memcache_servers 127.0.0.1:11211 Comma separated list of
memcached servers ip:port
node_timeout 10 Request timeout to external
services
client_timeout 60 Timeout to read one chunk
from a client
conn_timeout 0.5 Connection timeout to
external services
error_suppression_interval 60 Time in seconds that must
elapse since the last error
for a node to be considered
no longer error limited
error_suppression_limit 10 Error count to consider a
node error limited
rate_limit 20000.0 Max container level ops per
second
account_rate_limit 200.0 Max account level ops per
second
rate_limit_account_whitelist Comma separated list of
account name hashes to not
rate limit
rate_limit_account_blacklist Comma separated list of
account name hashes to block
completly
============================ =============== =============================
[auth-server]
============ =================================== ========================
Option Default Description
------------ ----------------------------------- ------------------------
class swift.common.auth.DevAuthMiddleware Auth wsgi middleware
to use
ip 127.0.0.1 IP address of auth
server
port 11000 Port of auth server
node_timeout 10 Request timeout
============ =================================== ========================
------------------------
Memcached Considerations
------------------------
Several of the Services rely on Memcached for caching certain types of
lookups, such as auth tokens, and container/account existance. Swift does
not do any caching of actual object data. Memcached should be able to run
on any servers that have available RAM and CPU. At Rackspace, we run
Memcached on the proxy servers. The `memcache_servers` config option
in the `proxy-server.conf` should contain all memcached servers.
-----------
System Time
-----------
Time may be relative but it is relatively important for Swift! Sift uses
timestamps to determine which is the most recent version of an object.
It is very important for the system time on each server in the cluster to
by synced as closely as possible (more so for the proxy server, but in general
it is a good idea for all the servers). At Rackspace, we use NTP with a local
NTP server to ensure that the system times are as close as possible. This
should also be monitored to ensure that the times do not vary too much.
---------------------- ----------------------
General Service Tuning General Service Tuning
---------------------- ----------------------

View File

@ -39,6 +39,7 @@ Deployment:
:maxdepth: 1 :maxdepth: 1
deployment_guide deployment_guide
admin_guide
Source: Source:

View File

@ -12,8 +12,6 @@
# recheck_account_existence = 60 # recheck_account_existence = 60
# recheck_container_existence = 60 # recheck_container_existence = 60
# object_chunk_size = 8192 # object_chunk_size = 8192
# container_chunk_size = 8192
# account_chunk_size = 8192
# client_chunk_size = 8192 # client_chunk_size = 8192
# Default for memcache_servers is below, but you can specify multiple servers # Default for memcache_servers is below, but you can specify multiple servers
# with the format: 10.1.2.3:11211,10.1.2.4:11211 # with the format: 10.1.2.3:11211,10.1.2.4:11211
@ -32,7 +30,6 @@
# account_rate_limit = 200.0 # account_rate_limit = 200.0
# rate_limit_account_whitelist = acct1,acct2,etc # rate_limit_account_whitelist = acct1,acct2,etc
# rate_limit_account_blacklist = acct3,acct4,etc # rate_limit_account_blacklist = acct3,acct4,etc
# container_put_lock_timeout = 5
# [auth-server] # [auth-server]
# class = swift.common.auth.DevAuthMiddleware # class = swift.common.auth.DevAuthMiddleware

View File

@ -201,10 +201,14 @@ class AccountReaper(object):
:param partition: The partition in the account ring the account is on. :param partition: The partition in the account ring the account is on.
:param nodes: The primary node dicts for the account to delete. :param nodes: The primary node dicts for the account to delete.
* See also: :class:`swift.common.db.AccountBroker` for the broker .. seealso::
class.
* See also: :func:`swift.common.ring.Ring.get_nodes` for a description :class:`swift.common.db.AccountBroker` for the broker class.
of the node dicts.
.. seealso::
:func:`swift.common.ring.Ring.get_nodes` for a description
of the node dicts.
""" """
begin = time() begin = time()
account = broker.get_info()['account'] account = broker.get_info()['account']

View File

@ -123,7 +123,7 @@ def invalidate_hash(suffix_dir):
Invalidates the hash for a suffix_dir in the partition's hashes file. Invalidates the hash for a suffix_dir in the partition's hashes file.
:param suffix_dir: absolute path to suffix dir whose hash needs :param suffix_dir: absolute path to suffix dir whose hash needs
invalidating invalidating
""" """
suffix = os.path.basename(suffix_dir) suffix = os.path.basename(suffix_dir)

View File

@ -949,9 +949,6 @@ class BaseApplication(object):
self.conn_timeout = float(conf.get('conn_timeout', 0.5)) self.conn_timeout = float(conf.get('conn_timeout', 0.5))
self.client_timeout = int(conf.get('client_timeout', 60)) self.client_timeout = int(conf.get('client_timeout', 60))
self.object_chunk_size = int(conf.get('object_chunk_size', 65536)) self.object_chunk_size = int(conf.get('object_chunk_size', 65536))
self.container_chunk_size = \
int(conf.get('container_chunk_size', 65536))
self.account_chunk_size = int(conf.get('account_chunk_size', 65536))
self.client_chunk_size = int(conf.get('client_chunk_size', 65536)) self.client_chunk_size = int(conf.get('client_chunk_size', 65536))
self.log_headers = conf.get('log_headers') == 'True' self.log_headers = conf.get('log_headers') == 'True'
self.error_suppression_interval = \ self.error_suppression_interval = \
@ -979,8 +976,6 @@ class BaseApplication(object):
self.rate_limit_blacklist = [x.strip() for x in self.rate_limit_blacklist = [x.strip() for x in
conf.get('rate_limit_account_blacklist', '').split(',') conf.get('rate_limit_account_blacklist', '').split(',')
if x.strip()] if x.strip()]
self.container_put_lock_timeout = \
int(conf.get('container_put_lock_timeout', 5))
def get_controller(self, path): def get_controller(self, path):
""" """